Posts
13th April 2026
BPE as a Structural Probe for Unknown Writing Systems
Byte pair encoding is best known for tokenization in NLP — but it can also serve as a diagnostic tool for unknown writing systems. The compression curve, merge history, and cross-linguistic ranking expose structural properties that traditional entropy measures miss.
14th April 2026
What Counts as a Character? Alphabet Hypotheses and the Compression Baseline
What is the correct unit of analysis for BPE compression? Different decompositions of the Voynich script yield different alphabets, different text lengths, and different compression fingerprints. Seven hypotheses are presented, showing that the anomaly diminishes — but does not disappear — under progressively coarser alphabets.