What Counts as a Character? Alphabet Hypotheses and the Compression Baseline

What is the correct unit of analysis for BPE compression? Different decompositions of the Voynich script yield different alphabets, different text lengths, and different compression fingerprints. Seven hypotheses are presented, showing that the anomaly diminishes — but does not disappear — under progressively coarser alphabets.

The previous post established BPE as a structural probe and showed that Voynichese compresses anomalously fast relative to 268 natural languages. But that analysis used the raw EVA transliteration, in which each stroke-level glyph is a separate character. This is a choice, and not necessarily the correct one.

What constitutes a “character” in the Voynich script has been debated since the earliest transcription efforts. EVA (the European Voynich Alphabet, later renamed the Extensible Voynich Alphabet; Landini & Zandbergen, late 1990s; see Zandbergen 2022 for the naming history) was designed as a minimal, stroke-level encoding: every visually distinguishable stroke receives its own symbol. But many researchers believe that certain EVA character sequences function as single glyphs. If so, the raw EVA analysis overstates the text’s redundancy — some of the “compression” BPE finds is simply reassembling units that EVA artificially decomposed.

This post presents seven alphabet hypotheses, arranged in order of increasing pre-merging. Each represents a different claim about the script’s character inventory. By running BPE on each version, I observe how the compression anomaly responds to different assumptions about the underlying alphabet.

The hypotheses

The seven hypotheses form a cumulative sequence. Each builds on the previous one by pre-merging additional character combinations before BPE begins.

H0: Raw EVA. The baseline. Every EVA stroke is treated as an independent character. Alphabet size: 20. This is what the first post analysed.

H1: Benches. Pre-merge ch ch and sh sh. These are widely treated as single glyphs — the “bench” characters that form the visual backbone of many Voynich words. EVA’s h h rarely occurs independently; it is almost exclusively part of a bench (ch, sh) or benched gallows (cth, ckh, cph, cfh). In the ZL transcription, h outside these contexts accounts for roughly 1% of all h occurrences. Alphabet size: 22 (two new symbols replace three EVA characters in combination).

H2: + i-sequences. Additionally pre-merge in in and iin iin. This follows the Currier/First Study Group (FSG) convention, which treats these as composite characters. EVA chose to decompose them analytically into individual strokes, but they are often treated as units in the transcription tradition. Alphabet size: 24.

H3: + doubles. Additionally pre-merge ee ee into a single symbol. Whether ee ee is a distinct glyph or simply two adjacent e e strokes is debated; this hypothesis treats it as a single unit for the purposes of compression analysis. Alphabet size: 25.

H3d: + Davis. A variant of H3 incorporating Davis’s paleographic argument that EVA p p and f f are abbreviations for ke ke and te te respectively (Davis 2022). Rather than merging, this hypothesis expands two characters into sequences before applying the remaining pre-merges. Alphabet size: 23 (fewer symbols because p p and f f are eliminated).

H4: Stolfi groups. A more aggressive hypothesis that builds on H3 by pre-merging a larger set of common character combinations as atomic units. Stolfi (n.d.) partitioned the EVA alphabet into “hard letters” (gallows, benches, e) and “soft letters” (q a o y d s j m n r l i), and identified common soft-letter groups — character sequences that frequently co-occur in the same word positions: am, ar, al, om, or, ol, ain, aiin, oin, oiin, plus qo, dy, and air. Stolfi was not claiming these are single characters in the Voynich script; he was describing positional regularities in how characters co-occur. H4 does not endorse that claim either — it simply tests the effect on BPE compression when these groups are pre-merged as if they were atomic units. Alphabet size: 38.

H4d: Stolfi+Davis. The most aggressive hypothesis: Stolfi groups combined with Davis expansions. Alphabet size: 36.

Compression under each hypothesis

The figure and table above show the compression ratio for each hypothesis, plotted against the natural-language mean (0.530, dashed green line) and its standard deviation bands (±1σ, ±2σ).

Interpretation

The anomaly is robust under conservative hypotheses. H1 through H3 represent interpretations that are broadly accepted or at least seriously argued in the literature. Under all of them, the Voynich text remains well below the natural-language mean — at or beyond −1.8σ. Pre-merging the widely accepted ligatures reduces the anomaly but does not eliminate it.

The Stolfi groups normalize the compression ratio. Under H4, the ratio rises to 0.500 — within one standard deviation of the natural-language mean. If the Voynich script operates as a constrained syllabary, this is the expected result: once the syllable-level units are treated as atomic characters, the remaining combinatorial structure resembles that of natural language. This does not confirm Stolfi’s specific groupings, but it shows that syllable-sized functional units would be compatible with the observed compression behaviour.

The Davis expansion shifts the ratio slightly. Expanding p pke ke and f fte te before analysis (H3d vs H3) lowers the compression ratio by about 0.008. The direction is consistent with these characters functioning as abbreviations for bigrams — but note that the expansion makes the text more compressible, not less, so it does not help explain away the anomaly.

Implications for the BPE probe

The alphabet hypothesis analysis establishes a baseline for all subsequent BPE results. Any finding reported under a particular alphabet should be interpreted in light of the pre-merging assumptions it entails.

For the remainder of this series, I report results under both H0 (raw EVA) and H3 (a conservative alphabet that includes benches, i-sequences, and doubled e). Where the choice of alphabet materially affects a conclusion, this is noted explicitly.

The compression anomaly is real but partially alphabet-dependent. Under conservative alphabets, the anomaly persists at approximately −1.8σ to −2.6σ. Under aggressive syllabary-level pre-merging, it largely disappears. This places a constraint on theories of the script’s structure: a theory that posits syllable-level functional units (as in Stolfi’s analysis) is compatible with this pattern, while a theory that treats each EVA character as independent must account for the anomaly directly.

The seven hypotheses presented here are not exhaustive. Zandbergen (2022) has formalized the full analytical-to-synthetic spectrum in his Super Transliteration Alphabet (STA), which identifies 235 distinct characters and combinations in the ZL transliteration file alone. A future post will extend this analysis across STA regularization levels.

The next post will move beyond the alphabet question to examine what BPE reveals about the internal structure of Voynichese: slot purity, entropy convergence, and variation across the manuscript’s sections and scribal hands.

References

  • Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
  • Davis, L. F. (2022). Voynich paleography. Keynote address. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote2.pdf
  • Landini, G. (2001). Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia, 25(4), 275–295.
  • Stolfi, J. (n.d.). Prefix-midfix-suffix decomposition of Voynichese words. https://www.voynich.nu/hist/stolfi/prefix-midfix-suffix.html — see also word-level grammar.
  • Zandbergen, R. (2022). Transliteration of the Voynich MS text. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote1.pdf
  • Zandbergen, R. (2025). A superset of transliteration alphabets. https://voynich.nu/extra/sta.html