BPE as a Structural Probe for Unknown Writing Systems

Byte pair encoding (BPE), originally introduced by Gage (1994) as a data compression algorithm, was repurposed by Sennrich, Haddow & Birch (2016) as a subword tokenization method for neural machine translation. It has since become a foundational component of modern large language model architectures.

In this post, I apply BPE in a different capacity: not for compression or tokenization, but as a structural probe for writing systems. The central observation is that the rate at which a text compresses under iterative pair-merging, together with the pattern of merges produced, is sensitive to the character-level combinatorics of the writing system — independently of semantic content.

I developed this probe to investigate the Voynich manuscript (Beinecke MS 408), but the method is general. Any undeciphered or poorly understood writing system — from Linear A to the Indus Valley script — could be subjected to the same analysis.

Algorithm

BPE operates as follows. Given a sequence of symbols:

Count all adjacent pairs.
Replace every occurrence of the most frequent pair with a new symbol.
Repeat.

Each iteration constitutes a merge. After n merges, the text has been shortened: some pairs of symbols have been collapsed into single tokens. The ratio of the resulting length to the original length is the compression ratio at step n.

BPE merges are not arbitrary. They recover the dominant character-level regularities in the text. In English, the first merges typically find th, he, in, er — the bigrams that carry the most redundancy. A writing system with different combinatorial structure will produce a different merge sequence.

Motivation: beyond character entropy

The standard measure of character-level predictability is conditional entropy (h₂): the average information, in bits, carried by the next character given the previous one. Lower h₂ indicates more predictable character sequences.

The Voynich manuscript scores anomalously low on this measure. Using the methodology and corpus of Lindemann & Bowern (2020), which comprises 294 Wikipedia language samples, Voynichese exhibits a conditional character entropy below that of every natural language in the sample. Montemurro & Zanette (2013) reached a complementary conclusion using entropy at the word level, finding long-range statistical structure consistent with linguistic content.

However, h₂ is a scalar. It quantifies the degree of predictability but does not characterize its structure. BPE provides three additional diagnostics:

Compression ratio as a function of merge count reveals whether predictability is concentrated in a small number of patterns (steep initial decline) or distributed across many (gradual decline).

The merge sequence itself — the ordered list of which pairs are most productive — exposes the combinatorial structure that generates the text.

Merge-count decay — the rate at which successive merge frequencies diminish — indicates whether the text is dominated by a few patterns (slow decay) or exhibits a flatter distribution (rapid decay).

Together, these form a compression fingerprint: a richer characterization than any single statistic.

Compression curves

The figure above presents compression curves for the Voynich manuscript, English, and Latin, all size-matched to the Voynich text length of approximately 39,000 tokens. Dashed lines indicate character-shuffled controls — texts preserving the same character frequency distribution but with randomized sequential order, thereby destroying all character-level structure.

Three observations stand out:

Voynichese compresses more rapidly than either natural language at every merge step. By merge 50, its compression ratio falls below 0.35, indicating that BPE has identified sufficient redundancy to eliminate over one-third of the text.
The effect is not an alphabet-size artefact. English has 26 characters; Voynich in the EVA transliteration (Landini 2001; Zandbergen n.d.) has approximately 20–25 basic characters, depending on how ligatures and multi-stroke glyphs are counted (a question explored in a subsequent post). Despite comparable alphabet sizes, the compression behaviours differ dramatically.
Shuffling eliminates the anomaly. The shuffled Voynich control compresses substantially less aggressively, confirming that the effect resides in the sequential structure of characters, not merely in their frequency distribution.

Merge history

The first 30 BPE merges reveal which character combinations dominate each text.

In English, the merge sequence recovers familiar bigrams and trigrams: t+h → th, th+e → the, i+n → in. These reflect the morphological and phonological structure of the language.

In Voynich, the pattern differs markedly. The initial merges recover units such as c+h → ch, s+h → sh, e+e → ee, ch+e → che — highly productive syllable-like combinations, consistent with the positional regularities identified by Stolfi (n.d.) and the slot-based analysis of Currier (1976). The associated counts are disproportionately large: the most frequent Voynich merge exceeds the most frequent English merge by a wide margin, despite comparable text sizes.

This concentration — a small number of merges accounting for a large proportion of the total compression — is characteristic of a system with limited combinatorial degrees of freedom.

Cross-linguistic ranking

To calibrate these observations against a typologically diverse baseline, BPE was applied to 268 of the 294 Wikipedia languages in the Lindemann & Bowern (2020) corpus, selecting those with sufficient text to size-match to the Voynich text length.

The Voynich manuscript ranks 3rd out of 270 texts (268 languages plus the Voynich text and its shuffled control), with a compression ratio of 0.332. The natural-language mean is 0.530 (σ = 0.075), placing Voynichese at −2.64σ — 99.3% of natural languages exhibit lower compressibility.

The texts that compress more aggressively tend to employ scripts with smaller effective alphabets. It should be noted that a fixed merge count (200) applied across scripts with very different alphabet sizes is not a perfectly controlled comparison — small alphabets saturate their combinatorial space faster. The direction of the finding is stable, but the exact z-score should be treated as approximate.

Merge-count decay

The rate at which merge counts diminish from the first merge to the fiftieth is also informative. The grey bands in the figure above indicate the 50th, 80th, and 95th percentiles across all 268 natural languages. Voynich (red) begins above the 95th percentile and remains there throughout all 50 merge steps.

This result extends beyond the observation that Voynichese contains one unusually frequent bigram. The entire distribution of character-pair frequencies is shifted toward concentration. The text behaves as though it were generated by a process with fewer independent degrees of freedom than natural language typically exhibits.

After approximately 15–16 merges, however, the Voynich decay curve falls back within the natural-language envelope. The anomaly is concentrated in a small set of extremely productive character combinations; beyond those, the remaining combinatorial structure resembles that of ordinary language. This is a structural clue: whatever process generated the text produced a few dominant patterns layered on top of otherwise unremarkable character-level statistics.

Scope and limitations

BPE compression analysis does not determine what Voynichese is. The method cannot, on its own, distinguish between a natural language with unusually constrained phonotactics, an artificial language, a cipher, a hoax, or a meaningless but structured generation process.

What it does provide is a quantified structural property — hyper-concentrated character-level patterns producing anomalous compressibility — that any adequate theory of the manuscript must account for. The absolute magnitude of the anomaly depends on baseline corpus quality and alphabet decomposition; the robust claim is the existence and concentration of unusually productive character pairs.

The following post examines a prior question that any BPE analysis must address: what counts as a character? Different decompositions of the Voynich script yield different compression baselines, and the answer constrains everything that follows.

Note (added April 16, 2026)

Torsten Timm drew my attention to quality issues with the Lindemann & Bowern Wikipedia corpus used as the cross-linguistic baseline in this post: many language samples retain table fragments and markup artefacts that inflate character-level redundancy (Timm & Schinner 2021). I have since recomputed the comparison using the Leipzig Corpora Collection (270 languages, sentence-shuffled, cleaner text). The z-scores shift slightly but the direction of the findings is unchanged. A later post will present the updated baseline and a more rigorous statistical framework.

References

Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
Landini, G. (2001). Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia, 25(4), 275–295.
Lindemann, L. & Bowern, C. (2020). Character entropy in modern and historical texts: Comparison metrics for an undeciphered manuscript. https://arxiv.org/abs/2010.14697
Montemurro, M. A. & Zanette, D. H. (2013). Keywords and co-occurrence patterns in the Voynich manuscript: An information-theoretic analysis. PLoS ONE, 8(6), e66344.
Sennrich, R., Haddow, B. & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the ACL, 1715–1725.
Timm, T. & Schinner, A. (2021). Review of “The Linguistics of the Voynich Manuscript” by Claire Bowern and Luke Lindemann. Cryptologia, 45(5), 434–438.
Stolfi, J. (n.d.). Voynich manuscript stuff. https://www.ic.unicamp.br/~stolfi/voynich/
Zandbergen, R. (n.d.). The Voynich manuscript. https://voynich.nu/