What Counts as a Character? Alphabet Hypotheses and the Compression Baseline

The previous post established BPE as a structural probe and showed that Voynichese compresses anomalously fast relative to 268 natural languages. But that analysis used the raw EVA transliteration, in which each stroke-level glyph is a separate character. This is a choice, and not necessarily the correct one.

What constitutes a “character” in the Voynich script has been debated since the earliest transcription efforts. EVA (the European Voynich Alphabet, later renamed the Extensible Voynich Alphabet; Landini & Zandbergen, late 1990s; see Zandbergen 2022 for the naming history) was designed as a minimal, stroke-level encoding: every visually distinguishable stroke receives its own symbol. But many researchers believe that certain EVA character sequences function as single glyphs. If so, the raw EVA analysis overstates the text’s redundancy — some of the “compression” BPE finds is simply reassembling units that EVA artificially decomposed.

This post presents seven alphabet hypotheses, arranged in order of increasing pre-merging. Each represents a different claim about the script’s character inventory. By running BPE on each version, I observe how the compression anomaly responds to different assumptions about the underlying alphabet.

The hypotheses

The seven hypotheses form a cumulative sequence. Each builds on the previous one by pre-merging additional character combinations before BPE begins.

H0: Raw EVA. The baseline. Every EVA stroke is treated as an independent character. Alphabet size: 20. This is what the first post analysed.

H1: Benches. Pre-merge ch ch and sh sh. These are widely treated as single glyphs — the “bench” characters that form the visual backbone of many Voynich words. EVA’s h h rarely occurs independently; it is almost exclusively part of a bench (ch, sh) or benched gallows (cth, ckh, cph, cfh). In the ZL transcription, h outside these contexts accounts for roughly 1% of all h occurrences. Alphabet size: 22 (two new symbols replace three EVA characters in combination).

H2: + i-sequences. Additionally pre-merge in in and iin iin. This follows the Currier/First Study Group (FSG) convention, which treats these as composite characters. EVA chose to decompose them analytically into individual strokes, but they are often treated as units in the transcription tradition. Alphabet size: 24.

H3: + doubles. Additionally pre-merge ee ee into a single symbol. Whether ee ee is a distinct glyph or simply two adjacent e e strokes is debated; this hypothesis treats it as a single unit for the purposes of compression analysis. Alphabet size: 25.

H3d: + Davis. A variant of H3 incorporating the hypothesis that EVA p p and f f are abbreviations for ke ke and te te respectively. Pelling (2020) proposed this on statistical grounds (p and f are almost never followed by e, unlike k and t); Davis (2022) arrived at the same conclusion independently from paleographic evidence. Rather than merging, this hypothesis expands two characters into sequences before applying the remaining pre-merges. Alphabet size: 23 (fewer symbols because p p and f f are eliminated).

H4: Stolfi groups. A more aggressive hypothesis that builds on H3 by pre-merging a larger set of common character combinations as atomic units. Stolfi (n.d.) partitioned the EVA alphabet into “hard letters” (gallows, benches, e) and “soft letters” (q a o y d s j m n r l i), and identified common soft-letter groups — character sequences that frequently co-occur in the same word positions: am, ar, al, om, or, ol, ain, aiin, oin, oiin, plus qo, dy, and air. Stolfi was not claiming these are single characters in the Voynich script; he was describing positional regularities in how characters co-occur. H4 does not endorse that claim either — it simply tests the effect on BPE compression when these groups are pre-merged as if they were atomic units. Alphabet size: 38.

H4d: Stolfi+Davis. The most aggressive hypothesis: Stolfi groups combined with Davis expansions. Alphabet size: 36.

Compression under each hypothesis

The figure and table above show the compression ratio for each hypothesis, plotted against the natural-language mean (0.530, dashed green line) and its standard deviation bands (±1σ, ±2σ).

Interpretation

The anomaly is robust under conservative hypotheses. H1 through H3 represent interpretations that are broadly accepted or at least seriously argued in the literature. Under all of them, the Voynich text remains well below the natural-language mean — at or beyond −1.8σ. Pre-merging the widely accepted ligatures reduces the anomaly but does not eliminate it.

The Stolfi groups absorb the compression anomaly. Under H4, the ratio rises to 0.500 — within one standard deviation of the natural-language mean. This should be read as an isolation test, not a structural confirmation: pre-merging the most frequent character combinations before BPE begins mechanically removes those pairs from BPE’s reach, so a higher ratio is expected. The useful result is that the entirety of the anomaly is localized in Stolfi’s specific groups. Once those groups are collapsed, the remaining inter-group combinatorics are unremarkable.

The Davis expansion shifts the ratio slightly. Expanding p p → ke ke and f f → te te before analysis (H3d vs H3) lowers the compression ratio by about 0.008. The direction is consistent with these characters functioning as abbreviations for bigrams — but note that the expansion makes the text more compressible, not less, so it does not help explain away the anomaly.

Implications for the BPE probe

The alphabet hypothesis analysis establishes a baseline for all subsequent BPE results. Any finding reported under a particular alphabet should be interpreted in light of the pre-merging assumptions it entails.

For the remainder of this series, I report results under both H0 (raw EVA) and H3 (a conservative alphabet that includes benches, i-sequences, and doubled e). Where the choice of alphabet materially affects a conclusion, this is noted explicitly.

The compression anomaly is real but partially alphabet-dependent. Under conservative alphabets, the anomaly persists at approximately −1.8σ to −2.6σ. Under aggressive syllabary-level pre-merging, it largely disappears. This places a constraint on theories of the script’s structure: a theory that posits syllable-level functional units (as in Stolfi’s analysis) is compatible with this pattern, while a theory that treats each EVA character as independent must account for the anomaly directly.

The seven hypotheses presented here are not exhaustive. Zandbergen (2022) has formalized the full analytical-to-synthetic spectrum in his Super Transliteration Alphabet (STA), which identifies 235 distinct characters and combinations in the ZL transliteration file alone. A future post will extend this analysis across STA regularization levels.

The next post moves beyond the alphabet question to examine what BPE reveals about the internal structure of Voynichese: slot purity, entropy convergence, and variation across the manuscript’s sections and scribal hands.

Note (added April 16, 2026)

I am grateful to Torsten Timm for correspondence that prompted a deeper investigation of the baseline corpus and the robustness of these results across transcription systems. Subsequent work using a cleaner cross-linguistic corpus and comparisons across multiple transcriptions (ZL, IT, GC/v101) confirms and extends this post’s central finding. A later post will present these results in detail.

References

Currier, P. H. (1976). Some important new statistical findings. In M. E. D’Imperio (Ed.), New Research on the Voynich Manuscript: Proceedings of a Seminar. Washington, D.C.
Davis, L. F. (2022). Voynich paleography. Keynote address. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote2.pdf
Pelling, N. (2020). On single-leg gallows and the benefits of having more questions than answers. Cipher Mysteries. https://ciphermysteries.com/2020/09/05/on-single-leg-gallows-and-the-benefits-of-having-more-questions-than-answers
Landini, G. (2001). Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia, 25(4), 275–295.
Stolfi, J. (n.d.). Prefix-midfix-suffix decomposition of Voynichese words. https://www.voynich.nu/hist/stolfi/prefix-midfix-suffix.html — see also word-level grammar.
Zandbergen, R. (2022). Transliteration of the Voynich MS text. In C. Layfield & J. Abela (Eds.), Proceedings of the 1st International Conference on the Voynich Manuscript (VOY2022) (CEUR-WS Vol. 3313). https://ceur-ws.org/Vol-3313/keynote1.pdf
Zandbergen, R. (2025). A superset of transliteration alphabets. https://voynich.nu/extra/sta.html