17 weird facts about the Hunspell dictionary format

https://zverok.space/blog/2021-03-16-spellchecking-dictionaries.html

2•bmacho•1mo ago

Comments

jll29•1mo ago

[Hunspell has been very successful as the OP correctly points out, and my comments are intended to improve over the state of the art rather than badmouthing the fantastic work of its authors, two of who are friends of mine.]

Hunspell uses an ad-hoc file format and an ad-hoc method. When the original code was developed in Ocaml at the time, it evolved to where we are today (one of the developers, VT, was sharing offices with me for a few years, so I am a past "ear witness" of sorts).

There is an opportunity now to rebuild something more systematic based on the XFST formalism originally devised at Xerox Research Center Europe in Grenoble under Prof. Lauri Karttunen, Kenneth Beesley and team [1]. Especially since Mans Hulden has re-created their toolset as FOMA, a C re-implementation that has been open sourced.

The beauty of XFST and friends is that it's a formalization of regular relations, the language generated and accepted by extended finite state transducers - a form of two-way automata. The XFST formalism leads to more readable/maintainable lexicons and rules, and it can also be used to generate, not just to analyze.

[1] https://www.amazon.com/Finite-State-Morphology-Kenneth-Beesl...

[2] https://dsacl3-2018.github.io/xfst-demo/ and others (simplay search for e.g. "xfst|foma fst")

[3] Hulden, Mans (2008) https://aclanthology.org/E09-2008/ (A Python interface already exists, too: Hulden, M. et al. (2024) https://aclanthology.org/2024.acl-demos.24/ .)

There are many training resources for the XFST family of formalisms, and it is taught in computational linguistics courses around the world [2]. There is also tool support in the form of e.g. syntax coloring support for vim https://www.vim.org/scripts/script.php?script_id=3441 etc. - all this would make the set of potential contributors for a future version of the spell checker vastly larger (compared to requiring interested parties to analyze an obscure ad-hoc format). It would also open up future possibilities for new functionality in Open Office - e.g. the generation capability could be used to offer a button "pluralize word".

If you lose your passport in Austria, head for McDonald's Golden Arches

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

RFCs vs. READMEs: The Evolution of Protocols

Kanchipuram Saris and Thinking Machines

Chinese chemical supplier causes global baby formula recall

I've used AI to write 100% of my code for a year as an engineer

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

AI-native capabilities, a new API Catalog, and updated plans and pricing

What changed in tech from 2010 to 2020?

From Human Ergonomics to Agent Ergonomics

Advanced Inertial Reference Sphere

Toyota Developing a Console-Grade, Open-Source Game Engine with Flutter and Dart

Typing for Love or Money: The Hidden Labor Behind Modern Literary Masterpieces

Show HN: A longitudinal health record built from fragmented medical data

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse