Posts

13th April 2026

BPE as a Structural Probe for Unknown Writing Systems

Byte pair encoding is best known for tokenization in NLP — but it can also serve as a diagnostic tool for unknown writing systems. The compression curve, merge history, and cross-linguistic ranking expose structural properties that traditional entropy measures miss.

14th April 2026

What Counts as a Character? Alphabet Hypotheses and the Compression Baseline

What is the correct unit of analysis for BPE compression? Different decompositions of the Voynich script yield different alphabets, different text lengths, and different compression fingerprints. Seven hypotheses are presented, showing that the anomaly diminishes — but does not disappear — under progressively coarser alphabets.

16th April 2026

What BPE Reveals about the Internal Structure of Voynichese

Applying the BPE probe to the Voynich manuscript’s internal structure reveals three properties that constrain theories of the script: rigid positional slots, fast entropy convergence, and systematic variation across sections and scribal hands.