BPE as a Structural Probe for Unknown Writing Systems
Byte pair encoding is best known for tokenization in NLP — but it can also serve as a diagnostic tool for unknown writing systems. The compression curve, merge history, and cross-linguistic ranking expose structural properties that traditional entropy measures miss.
Byte pair encoding (BPE), originally introduced by Gage (1994) as a data compression algorithm, was repurposed by Sennrich, Haddow & Birch (2016) as a subword tokenization method for neural machine translation. It has since become a foundational component of modern large language model architectures.
In this post, I apply BPE in a different capacity: not for compression or tokenization, but as a structural probe for writing systems. The central observation is that the rate at which a text compresses under iterative pair-merging, together with the pattern of merges produced, is sensitive to the character-level combinatorics of the writing system — independently of semantic content.
I developed this probe to investigate the Voynich manuscript (Beinecke MS 408), but the method is general. Any undeciphered or poorly understood writing system — from Linear A to the Indus Valley script — could be subjected to the same analysis.
Algorithm
BPE operates as follows. Given a sequence of symbols:
- Count all adjacent pairs.
- Replace every occurrence of the most frequent pair with a new symbol.
- Repeat.
Each iteration constitutes a merge. After n merges, the text has been shortened: some pairs of symbols have been collapsed into single tokens. The ratio of the resulting length to the original length is the compression ratio at step n.
BPE merges are not arbitrary. They recover the dominant character-level
regularities in the text. In English, the first merges typically find th,
he, in, er — the bigrams that carry the most redundancy. A writing
system with different combinatorial structure will produce a different merge
sequence.
Motivation: beyond character entropy
The standard measure of character-level predictability is conditional entropy (h₂): the average information, in bits, carried by the next character given the previous one. Lower h₂ indicates more predictable character sequences.
The Voynich manuscript scores anomalously low on this measure. Using the methodology and corpus of Lindemann & Bowern (2020), which comprises 294 Wikipedia language samples, Voynichese exhibits a conditional character entropy below that of every natural language in the sample. Montemurro & Zanette (2013) reached a complementary conclusion using entropy at the word level, finding long-range statistical structure consistent with linguistic content.
However, h₂ is a scalar. It quantifies the degree of predictability but does not characterize its structure. BPE provides three additional diagnostics:
Compression ratio as a function of merge count reveals whether predictability is concentrated in a small number of patterns (steep initial decline) or distributed across many (gradual decline).
The merge sequence itself — the ordered list of which pairs are most productive — exposes the combinatorial structure that generates the text.
Merge-count decay — the rate at which successive merge frequencies diminish — indicates whether the text is dominated by a few patterns (slow decay) or exhibits a flatter distribution (rapid decay).
Together, these form a compression fingerprint: a richer characterization than any single statistic.
Compression curves
The figure above presents compression curves for the Voynich manuscript, English, and Latin, all size-matched to the Voynich text length of approximately 39,000 tokens. Dashed lines indicate character-shuffled controls — texts preserving the same character frequency distribution but with randomized sequential order, thereby destroying all character-level structure.
Three observations stand out:
-
Voynichese compresses more rapidly than either natural language at every merge step. By merge 50, its compression ratio falls below 0.35, indicating that BPE has identified sufficient redundancy to eliminate over one-third of the text.
-
The effect is not an alphabet-size artefact. English has 26 characters; Voynich in the EVA transliteration (Landini 2001; Zandbergen n.d.) has approximately 20–25 basic characters, depending on how ligatures and multi-stroke glyphs are counted (a question explored in a subsequent post). Despite comparable alphabet sizes, the compression behaviours differ dramatically.
-
Shuffling eliminates the anomaly. The shuffled Voynich control compresses substantially less aggressively, confirming that the effect resides in the sequential structure of characters, not merely in their frequency distribution.
Merge history
The first 30 BPE merges reveal which character combinations dominate each text.
In English, the merge sequence recovers familiar bigrams and trigrams:
t+h → th, th+e → the, i+n → in. These reflect the morphological and
phonological structure of the language.
In Voynich, the pattern differs markedly. The initial merges recover units
such as c+h → ch, s+h → sh, e+e → ee, ch+e → che — highly productive
syllable-like combinations, consistent with the positional regularities
identified by Stolfi (n.d.) and the slot-based analysis of Currier (1976).
The associated counts are disproportionately large: the most frequent Voynich
merge exceeds the most frequent English merge by a wide margin, despite
comparable text sizes.
This concentration — a small number of merges accounting for a large proportion of the total compression — is characteristic of a system with limited combinatorial degrees of freedom.
Cross-linguistic ranking
To calibrate these observations against a typologically diverse baseline, BPE was applied to 268 of the 294 Wikipedia languages in the Lindemann & Bowern (2020) corpus, selecting those with sufficient text to size-match to the Voynich text length.
The Voynich manuscript ranks 3rd out of 270 texts (268 languages plus the Voynich text and its shuffled control), with a compression ratio of 0.332. The natural-language mean is 0.530 (σ = 0.075), placing Voynichese at −2.64σ — 99.3% of natural languages exhibit lower compressibility.
The only texts that compress more aggressively employ scripts with very small effective alphabets (typically syllabaries or logographic systems in which the Unicode code points already encode complex units). A 20–25 character alphabetic script compressing at this rate is not attested elsewhere in this sample.
Merge-count decay
The rate at which merge counts diminish from the first merge to the fiftieth is also informative. The grey bands in the figure above indicate the 50th, 80th, and 95th percentiles across all 268 natural languages. Voynich (red) begins above the 95th percentile and remains there throughout all 50 merge steps.
This result extends beyond the observation that Voynichese contains one unusually frequent bigram. The entire distribution of character-pair frequencies is shifted toward concentration. The text behaves as though it were generated by a process with fewer independent degrees of freedom than natural language typically exhibits.
After approximately 15–16 merges, however, the Voynich decay curve falls back within the natural-language envelope. The anomaly is concentrated in a small set of extremely productive character combinations; beyond those, the remaining combinatorial structure resembles that of ordinary language. This is a structural clue: whatever process generated the text produced a few dominant patterns layered on top of otherwise unremarkable character-level statistics.
Scope and limitations
BPE compression analysis does not determine what Voynichese is. The method cannot, on its own, distinguish between a natural language with unusually constrained phonotactics, an artificial language, a cipher, a hoax, or a meaningless but structured generation process.
What it does provide is a quantified structural property — hyper-concentrated character-level patterns producing anomalous compressibility — that any adequate theory of the manuscript must account for.
The following post examines a prior question that any BPE analysis must address: what counts as a character? Different decompositions of the Voynich script yield different compression baselines, and the answer constrains everything that follows.
References
- Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
- Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
- Landini, G. (2001). Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia, 25(4), 275–295.
- Lindemann, L. & Bowern, C. (2020). Character entropy in modern and historical texts: Comparison metrics for an undeciphered manuscript. arXiv:2010.14697.
- Montemurro, M. A. & Zanette, D. H. (2013). Keywords and co-occurrence patterns in the Voynich manuscript: An information-theoretic analysis. PLoS ONE, 8(6), e66344.
- Sennrich, R., Haddow, B. & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the ACL, 1715–1725.
- Stolfi, J. (n.d.). Voynich manuscript stuff. Retrieved from https://www.ic.unicamp.br/~stolfi/voynich/
- Zandbergen, R. (n.d.). The Voynich manuscript. Retrieved from https://voynich.nu/