The Voynich Manuscript is a 15th-century book written in an unknown script. No one’s been able to translate it, and many think it’s a hoax, a cipher, or a constructed language. I wasn’t trying to decode it — I just wanted to see: does it behave like a structured language?
I stripped a handful of common suffix-like endings (aiin, dy, etc.) to isolate what looked like root forms. I know that’s a strong assumption — I call it out directly in the repo — but it helped clarify the clustering. From there, I used SBERT embeddings and KMeans to group similar roots, inferred POS-like roles based on position and frequency, and built a Markov transition matrix to visualize cluster-to-cluster flow.
It’s not translation. It’s not decryption. It’s structural modeling — and it revealed some surprisingly consistent syntax across the manuscript, especially when broken out by section (Botanical, Biological, etc.).
GitHub repo: https://github.com/brianmg/voynich-nlp-analysis Write-up: https://brig90.substack.com/p/modeling-the-voynich-manuscrip...
I’m new to the NLP space, so I’m sure there are things I got wrong — but I’d love feedback from people who’ve worked with structured language modeling or weird edge cases like this.
nine_k•5h ago
<quote>
Key Findings
* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
* Cluster 3 has high diversity and flexible positioning — likely a root content class
* Transition matrix shows strong internal structure, far from random
* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
Hypothesis
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
</quote>
brig90•5h ago
gchamonlive•5h ago
InsideOutSanta•4h ago
I don't see how it could be random, regardless of whether it is an actual language. Humans are famously terrible at generating randomness.
nine_k•3h ago
InsideOutSanta•3h ago
I wouldn't assume that the writer made decisions based on these goals, but rather that the writer attempted to create a simulacrum of a real language. However, even if they did not, I would expect an attempt at generating a "random" language to ultimately mirror many of the properties of the person's native language.
The arguments that this book is written in a real language rest on the assumption that a human being making up gibberish would not produce something that exhibits many of the properties of a real language; however, I don't see anyone offering any evidence to support this claim.