What BPE Reveals about the Internal Structure of Voynichese

Applying the BPE probe to the Voynich manuscript’s internal structure reveals three properties that constrain theories of the script: rigid positional slots, fast entropy convergence, and systematic variation across sections and scribal hands.

The first two posts established BPE as a structural probe and showed how the choice of alphabet affects the compression baseline. This post turns the probe inward, examining what the merge sequence and compression behaviour reveal about the internal structure of Voynichese itself.

Three properties emerge. Each is quantified against English and Latin controls, and each places a constraint on what kind of system could have produced the text.

Positional slot structure

Stolfi observed that Voynich characters occupy rigid positional “slots” within words: some characters appear almost exclusively at the beginning of words, others only in the middle, others only at the end. This is markedly different from natural languages, where most characters appear in all positions (even if with varying frequency).

BPE provides an independent test of this observation. If the slot structure is real, BPE merges should respect positional boundaries — merging characters that occupy the same or adjacent slots, and rarely merging characters from opposite ends of a word.

To quantify this, I compute the slot purity of each alphabet: for each character with at least 100 occurrences, I measure the proportion of occurrences in its dominant position (initial, medial, or final). The average across all qualifying characters gives an overall purity score. A perfectly constrained alphabet scores 100%; a perfectly uniform one scores 33.3%.

Running this analysis across the 268 languages in the Lindemann & Bowern (2020) corpus produces a cross-linguistic baseline. The natural-language mean is 67.1% (σ = 5.3%). Voynichese scores 80.6%, placing it at +2.54σ — rank 9 out of 269 texts.

The eight languages with higher slot purity are all Southeast Asian or South Asian scripts: Thai (88.2%), Tamil (85.9%), Khmer (84.9%), Shan (84.2%), Lao (83.1%), Malayalam (82.8%), Bishnupriya Manipuri (82.3%), and Sanskrit (82.0%).

A caveat is warranted here. Several of these languages — Thai, Khmer, Shan, and Lao — are scriptio continua writing systems that do not use spaces between words in natural text. The word boundaries in their corpus representations were inserted by NLP tokenizers, not by scribes. High slot purity in these languages may partly reflect tokenizer behaviour rather than intrinsic script structure. The Voynich manuscript, by contrast, has visible hand-drawn spaces. When the comparison is restricted to languages with natural whitespace segmentation, Voynichese’s slot purity moves closer to the top of the ranking.

With that qualification, the pattern is informative. Voynichese clusters not with alphabetic scripts like English or Latin, but with writing systems where character position within a word is structurally constrained — abugidas, syllabaries, and scripts with fixed positional slots.

Among Voynich characters specifically: q (q) and s (s) are almost entirely word-initial; y (y) and n (n) are almost entirely word-final; o (o) and a (a) are predominantly medial.

BPE can only merge adjacent characters, but we can ask: when two characters are merged, do they tend to occupy the same positional role within words, or different roles? In Voynich, the merged pairs overwhelmingly share the same positional profile (e.g., two medial characters, or a medial and a final). English and Latin show a similar pattern, but with more variety in which positional combinations appear — consistent with Voynich having a more rigid positional template than natural-language alphabets.

Distribution of BPE merge types across the first 30 merges. “Same slot” merges combine characters that share a dominant positional role; “adjacent” merges combine characters from neighbouring slots (e.g., medial + final). No cross-slot merges (initial + final) occur in any of the three texts.

This is consistent with a system in which words are assembled by concatenating elements drawn from position-specific inventories — the pattern expected from a table-based or syllabary-like construction.

Entropy convergence

Character-level conditional entropy (h₂) measures how predictable the next character is given the previous one. As noted in the first post, Voynichese has anomalously low h₂. But what happens to h₂ as BPE progressively absorbs character-level regularities into its merged tokens?

If the low entropy is an artefact of EVA’s analytical decomposition — splitting single glyphs into multiple strokes — then BPE merging should gradually normalize h₂ toward the natural-language range as it reassembles those glyphs. If the anomaly runs deeper, h₂ will remain low even after extensive merging.

I measure token-level h₁ (unigram entropy) and h₂ (conditional bigram entropy) after 0, 5, 10, 20, 50, 100, 150, and 200 BPE merges, for Voynich, English, and Latin. A methodological note: at merge 0, these are character-level values; at higher merge counts, they are computed over merged tokens of increasing size. The reference baseline (natural-language mean) is character-level. This means the comparison is strictly like-for-like only at merge 0; the convergence trajectory should be read as exploratory.

At merge 0 (raw EVA), Voynichese h₂ is 2.08 bits — a z-score of −4.58 relative to the natural-language mean of 3.75 bits (σ = 0.36). English starts at 3.34 (z = −1.13) and Latin at 3.36 (z = −1.06).

As merges increase, Voynichese h₂ rises steeply. By merge 20 it reaches 3.42 (z = −0.91), recovering most of the gap. This confirms that part of the anomaly is a transcription-level effect: EVA’s analytical decomposition inflates predictability, and BPE’s merging partially undoes it.

But the convergence then stalls. English h₂ stabilizes around 4.4 bits by merge 100; Latin around 4.8. Voynichese continues to climb but follows a different trajectory: at merge 200 it reaches 4.95, which is above the natural-language baseline. The crossover occurs around merge 50, where Voynichese h₂ overtakes English.

This pattern is suggestive but should be interpreted with caution given the unit mismatch noted above. At low merge counts, Voynichese is more predictable than natural language (the EVA decomposition effect). At high merge counts, the merged tokens appear less regular in their sequencing than natural-language subword units, but this may partly reflect the different measurement scales rather than a purely structural property. A more rigorous analysis would compute a merged-token baseline for the natural languages at each step.

The alphabet hypothesis analysis from the previous post corroborates the low end: even under H3 (benches, i-sequences, doubled e), h₂ remains at −2.90σ. Under H4 (Stolfi groups), h₂ rises to 3.36 (z = −1.07) — within the natural-language range, consistent with the compression ratio result.

Section and hand variation

The Voynich manuscript is not a homogeneous text. Currier (1976) identified two distinct “languages” (A and B) based on statistical differences in character frequencies. Davis (2020) identified multiple scribal hands. The manuscript’s sections — herbal, biological, astronomical, cosmological, pharmaceutical, stars, zodiac, and text-only — differ in illustration style and may differ in textual properties.

Splitting the text by these metadata fields and running BPE independently on each partition reveals that the compression anomaly is not uniform.

Currier B compresses more aggressively than Currier A. At 200 merges, Currier B sections have a compression ratio of 0.321, compared to 0.350 for Currier A (Δ = −0.029). The full manuscript ratio is 0.344. BPE quantifies what Currier observed qualitatively: the two “languages” have different character-pair redundancy profiles.

Section variation is substantial. The biological section (exclusively Currier B) compresses to 0.283, dramatically lower than any other section and far below the manuscript average. The herbal section spans both Currier languages, and splitting it reveals different compression profiles: Herbal A (0.343) sits near the manuscript average, while Herbal B (0.328) compresses more aggressively, consistent with the overall Currier B pattern.

Section Words Ratio Δ from full Top merge
Biological 6,386 0.283 −0.060 d+y
Stars 11,643 0.322 −0.022 c+h
Herbal B 3,477 0.328 −0.016 c+h
Text-only 2,362 0.329 −0.015 c+h
Zodiac 1,314 0.330 −0.014 o+t
Pharmaceutical 2,589 0.332 −0.012 o+l
Herbal A 8,062 0.343 −0.001 c+h
Astronomical 883 0.356 +0.012 c+h
Cosmological 2,254 0.374 +0.030 c+h

The top merge varies across sections. While c+hch dominates in most sections, the biological section is led by d+ydy, the pharmaceutical by o+lol, and the zodiac by o+tot. These are not just frequency shifts — they reflect different combinatorial structures in different parts of the manuscript.

Section ratios should be read with caution, particularly for the smaller sections. The Astronomical section (883 words) and Zodiac (1,314 words) are small enough that sampling position within the manuscript can shift the ratio by ±0.02 or more; the larger partitions (Herbal A at ~8,000 words, Stars at ~11,500) are more stable. Bootstrap confidence intervals would strengthen these comparisons but are not yet computed.

Constraints on theories

These three properties — slot purity, entropy convergence, and section variation — together narrow the space of plausible generating mechanisms.

The text is not random. Random character generation produces flat compression curves, low slot purity, and no section variation. Voynichese fails all three predictions of the random hypothesis.

The text is not a simple substitution cipher. A monoalphabetic cipher preserves the compression profile and positional statistics of the source language. Voynichese does not match any natural language’s profile under any alphabet hypothesis tested.

The text has positional structure. Slot purity at +2.54σ places Voynichese among Southeast Asian and South Asian syllabaries and abugidas — scripts with inherent positional constraints. This is the pattern expected from a table-based construction (as in Rugg’s Cardan grille hypothesis) or a constrained syllabary, not from an alphabetic script.

The text has internal variation. Currier A and B differ quantifiably in compression behaviour, and the variation tracks known palaeographic divisions. Any adequate theory must account for this variation — either as two different generating processes, two different portions of a single process, or scribal variation within a single system.

A subsequent post will revisit these findings using a cleaner cross-linguistic baseline and a more rigorous statistical framework, before turning to the question of which generative models can reproduce the observed properties.

Note

As noted in the previous posts, the cross-linguistic baseline used here (Lindemann & Bowern 2020) has known quality issues. A subsequent post will revisit these comparisons using a cleaner multilingual corpus.

Note (added April 20, 2026)

An earlier version of this post treated the herbal section as a single partition and described the biological section as “predominantly Currier B.” Torsten Timm pointed out that the herbal section spans both Currier languages with dramatically different vocabularies, and that the biological section is exclusively (not predominantly) Currier B. The section table has been recomputed with the herbal section split into Herbal A (8,062 words) and Herbal B (3,477 words). The split confirms that Herbal B compresses more aggressively (0.328) than Herbal A (0.343), consistent with the overall Currier B pattern. A missing chart comparing merge slot types across Voynich, English, and Latin has also been added.

References

  • Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
  • Davis, L. F. (2020). How many glyphs and how many scribes? Digital paleography and the Voynich manuscript. Manuscript Studies, 5(1), 164–180.
  • Lindemann, L. & Bowern, C. (2020). Character entropy in modern and historical texts: Comparison metrics for an undeciphered manuscript. https://arxiv.org/abs/2010.14697
  • Stolfi, J. (n.d.). Voynich manuscript stuff. https://www.ic.unicamp.br/~stolfi/voynich/ — see also word-level grammar.