[Hunspell has been very successful as the OP correctly points out, and my comments are intended to improve over the state of the art rather than badmouthing the fantastic work of its authors, two of who are friends of mine.]
Hunspell uses an ad-hoc file format and an ad-hoc method. When the original code was developed in Ocaml at the time, it evolved to where we are today (one of the developers, VT, was sharing offices with me for a few years, so I am a past "ear witness" of sorts).
There is an opportunity now to rebuild something more systematic based on the XFST formalism originally devised at Xerox Research Center Europe in Grenoble under Prof. Lauri Karttunen, Kenneth Beesley and team [1]. Especially since Mans Hulden has re-created their toolset as FOMA, a C re-implementation that has been open sourced.
The beauty of XFST and friends is that it's a formalization of regular relations, the language generated and accepted by extended finite state transducers - a form of two-way automata. The XFST formalism leads to more readable/maintainable lexicons and rules, and it can also be used to generate, not just to analyze.
There are many training resources for the XFST family of formalisms, and it is taught in computational linguistics courses around the world [2]. There is also tool support in the form of e.g. syntax coloring support for vim
https://www.vim.org/scripts/script.php?script_id=3441
etc. - all this would make the set of potential contributors for a future version of the spell checker vastly larger (compared to requiring interested parties to analyze an obscure ad-hoc format). It would also open up future possibilities for new functionality in Open Office - e.g. the generation capability could be used to offer a button "pluralize word".
jll29•15h ago
Hunspell uses an ad-hoc file format and an ad-hoc method. When the original code was developed in Ocaml at the time, it evolved to where we are today (one of the developers, VT, was sharing offices with me for a few years, so I am a past "ear witness" of sorts).
There is an opportunity now to rebuild something more systematic based on the XFST formalism originally devised at Xerox Research Center Europe in Grenoble under Prof. Lauri Karttunen, Kenneth Beesley and team [1]. Especially since Mans Hulden has re-created their toolset as FOMA, a C re-implementation that has been open sourced.
The beauty of XFST and friends is that it's a formalization of regular relations, the language generated and accepted by extended finite state transducers - a form of two-way automata. The XFST formalism leads to more readable/maintainable lexicons and rules, and it can also be used to generate, not just to analyze.
[1] https://www.amazon.com/Finite-State-Morphology-Kenneth-Beesl...
[2] https://dsacl3-2018.github.io/xfst-demo/ and others (simplay search for e.g. "xfst|foma fst")
[3] Hulden, Mans (2008) https://aclanthology.org/E09-2008/ (A Python interface already exists, too: Hulden, M. et al. (2024) https://aclanthology.org/2024.acl-demos.24/ .)
There are many training resources for the XFST family of formalisms, and it is taught in computational linguistics courses around the world [2]. There is also tool support in the form of e.g. syntax coloring support for vim https://www.vim.org/scripts/script.php?script_id=3441 etc. - all this would make the set of potential contributors for a future version of the spell checker vastly larger (compared to requiring interested parties to analyze an obscure ad-hoc format). It would also open up future possibilities for new functionality in Open Office - e.g. the generation capability could be used to offer a button "pluralize word".