What Counts as a Character? Alphabet Hypotheses and the Compression Baseline
What is the correct unit of analysis for BPE compression? Different decompositions of the Voynich script yield different alphabets, different text lengths, and different compression fingerprints. Seven hypotheses are presented, showing that the anomaly diminishes — but does not disappear — under progressively coarser alphabets.
The previous post established BPE as a structural probe and showed that Voynichese compresses anomalously fast relative to 268 natural languages. But that analysis used the raw EVA transliteration, in which each stroke-level glyph is a separate character. This is a choice, and not necessarily the correct one.
What constitutes a “character” in the Voynich script has been debated since the earliest transcription efforts. EVA (the European Voynich Alphabet, later renamed the Extensible Voynich Alphabet; Landini & Zandbergen, late 1990s; see Zandbergen 2022 for the naming history) was designed as a minimal, stroke-level encoding: every visually distinguishable stroke receives its own symbol. But many researchers believe that certain EVA character sequences function as single glyphs. If so, the raw EVA analysis overstates the text’s redundancy — some of the “compression” BPE finds is simply reassembling units that EVA artificially decomposed.
This post presents seven alphabet hypotheses, arranged in order of increasing pre-merging. Each represents a different claim about the script’s character inventory. By running BPE on each version, I observe how the compression anomaly responds to different assumptions about the underlying alphabet.
The hypotheses
The seven hypotheses form a cumulative sequence. Each builds on the previous one by pre-merging additional character combinations before BPE begins.
H0: Raw EVA. The baseline. Every EVA stroke is treated as an independent character. Alphabet size: 20. This is what the first post analysed.
H1: Benches. Pre-merge ch ch and sh sh. These are widely treated
as single glyphs — the “bench” characters that form the visual backbone of
many Voynich words. EVA’s h h rarely occurs independently; it is almost
exclusively part of a bench (ch, sh) or benched gallows (cth, ckh,
cph, cfh). In the ZL transcription, h outside these contexts accounts
for roughly 1% of all h occurrences. Alphabet size: 22 (two new symbols
replace three EVA characters in combination).
H2: + i-sequences. Additionally pre-merge in in and iin iin. This
follows the Currier/First Study Group (FSG) convention, which treats these as
composite characters. EVA chose to decompose them analytically into individual
strokes, but they are often treated as units in the transcription tradition.
Alphabet size: 24.
H3: + doubles. Additionally pre-merge ee ee into a single symbol.
Whether ee ee is a distinct glyph or simply two adjacent e e strokes
is debated; this hypothesis treats it as a single unit for the purposes of
compression analysis. Alphabet size: 25.
H3d: + Davis. A variant of H3 incorporating Davis’s paleographic
argument that EVA p p and f f are abbreviations for ke ke and te te respectively
(Davis 2022). Rather than merging, this hypothesis expands two characters
into sequences before applying the remaining pre-merges. Alphabet size: 23
(fewer symbols because p p and f f are eliminated).
H4: Stolfi groups. A more aggressive hypothesis that builds on H3 by
pre-merging a larger set of common character combinations as atomic units.
Stolfi (n.d.) partitioned the EVA alphabet into “hard letters” (gallows,
benches, e) and “soft letters” (q a o y d s j m n r l i), and identified
common soft-letter groups — character sequences that frequently co-occur in
the same word positions: am, ar, al, om, or, ol, ain, aiin,
oin, oiin, plus qo, dy, and air. Stolfi was not claiming these are single characters in the Voynich script;
he was describing positional regularities in how characters co-occur. H4
does not endorse that claim either — it simply tests the effect on BPE
compression when these groups are pre-merged as if they were atomic units.
Alphabet size: 38.
H4d: Stolfi+Davis. The most aggressive hypothesis: Stolfi groups combined with Davis expansions. Alphabet size: 36.
Compression under each hypothesis
The figure and table above show the compression ratio for each hypothesis, plotted against the natural-language mean (0.530, dashed green line) and its standard deviation bands (±1σ, ±2σ).
Interpretation
The anomaly is robust under conservative hypotheses. H1 through H3 represent interpretations that are broadly accepted or at least seriously argued in the literature. Under all of them, the Voynich text remains well below the natural-language mean — at or beyond −1.8σ. Pre-merging the widely accepted ligatures reduces the anomaly but does not eliminate it.
The Stolfi groups normalize the compression ratio. Under H4, the ratio rises to 0.500 — within one standard deviation of the natural-language mean. If the Voynich script operates as a constrained syllabary, this is the expected result: once the syllable-level units are treated as atomic characters, the remaining combinatorial structure resembles that of natural language. This does not confirm Stolfi’s specific groupings, but it shows that syllable-sized functional units would be compatible with the observed compression behaviour.
The Davis expansion shifts the ratio slightly. Expanding
p p → ke ke and f f → te te before analysis (H3d vs H3) lowers the
compression ratio by about 0.008. The direction is consistent with these
characters functioning as abbreviations for bigrams — but note that
the expansion makes the text more compressible, not less, so it does not
help explain away the anomaly.
Implications for the BPE probe
The alphabet hypothesis analysis establishes a baseline for all subsequent BPE results. Any finding reported under a particular alphabet should be interpreted in light of the pre-merging assumptions it entails.
For the remainder of this series, I report results under both H0 (raw EVA) and H3 (a conservative alphabet that includes benches, i-sequences, and doubled e). Where the choice of alphabet materially affects a conclusion, this is noted explicitly.
The compression anomaly is real but partially alphabet-dependent. Under conservative alphabets, the anomaly persists at approximately −1.8σ to −2.6σ. Under aggressive syllabary-level pre-merging, it largely disappears. This places a constraint on theories of the script’s structure: a theory that posits syllable-level functional units (as in Stolfi’s analysis) is compatible with this pattern, while a theory that treats each EVA character as independent must account for the anomaly directly.
The seven hypotheses presented here are not exhaustive. Zandbergen (2022) has formalized the full analytical-to-synthetic spectrum in his Super Transliteration Alphabet (STA), which identifies 235 distinct characters and combinations in the ZL transliteration file alone. A future post will extend this analysis across STA regularization levels.
The next post will move beyond the alphabet question to examine what BPE reveals about the internal structure of Voynichese: slot purity, entropy convergence, and variation across the manuscript’s sections and scribal hands.
References
- Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
- Davis, L. F. (2022). Voynich paleography. Keynote address. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote2.pdf
- Landini, G. (2001). Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia, 25(4), 275–295.
- Stolfi, J. (n.d.). Prefix-midfix-suffix decomposition of Voynichese words. https://www.voynich.nu/hist/stolfi/prefix-midfix-suffix.html — see also word-level grammar.
- Zandbergen, R. (2022). Transliteration of the Voynich MS text. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote1.pdf
- Zandbergen, R. (2025). A superset of transliteration alphabets. https://voynich.nu/extra/sta.html