Very interesting topic though.
I had the same question, but more generally - is this an American company thing? I can't imagine not wanting to tap the vastly larger android market. Especially for an app like this which could be marketed as a fun English learning game.
“Double lock” > “clasp” > “grab” > “dislodge”
It’s just a quick example, but I think it follows their “rough synonym“ style connections, and it’s not less reasonable than the examples.
To me, it feels like this project is kind of hampered by not having a rigorous definition of what is allowable, and then mixing in the sort of random effects of an LLM
Since this is getting eyeballs here, I will look for some less tortured long-paths to add as examples.
(It is a game best played with a grandparent's pre-war dictionary before tea-time)
The only good news is it works offline.
You could either store the word graph as a partitioned set of S3 buckets, or have a back-end that serves individual words and does rate-limiting. I guess that the back-end might be better to avoid surprise egress charges from anyone trying to download the entire dataset.
I want to try out the game but I'm discouraged by the download size.
[0] https://storage.googleapis.com/books/ngrams/books/datasetsv3...
We use the top google Ngrams in 2 ways. (a) we share it in the reference mode of our app, i.e. common words before or after; (b) we use longer N-grams, where possible, like a 4-gram, to choose literary examples that also show a common pattern.
I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) and yet there are only like 80 people editing on any given day or whatever. In some languages, it's even the best or most updated dictionary available. The barriers to entry and bureaucracy are really not high for HN audience types.
If it's anything like wikipedia, there is probably a reason more people aren't working on it, and it's because the existing people discourage it.
https://en.wiktionary.org/wiki/Wiktionary:Criteria_for_inclu...
Which words should be attested? Presumably only uncommon ones? And how is it done, is the "quotes" section the attestation? Is there vandalism to clean up, like people adding their own names to define themselves as awesome? Wiktionary seems to "just work", and I don't really understand what holds it together.
They’ll be unable to effectively patrol or prevent generative updates to the project, and for all intensive porpoises, humans will be unwilling to step foot into disputes, and AI will have free reign to redefine all human knowledge.
From the OP: "This research and computational scale was made possible by $295k NSF SBIR seed funding (#2329817) and $150k Microsoft Azure compute resources." Does that NSF funding mean it's open source? Also, I'm not 100% sure that the quote applies to all the research rather than just one component of it.
> I'll also use this post to wish that more people would edit Wiktionary. It has such a good mission (information on all words) ...
I support open source, contribute to it, and love the spirit of Wiktionary, I don't understand the practical reality of applying 'wisdom of the crowds' to a dictionary, especially the English edition, for two reasons:
Definitions are highly accurate (complete, correct, consistent), highly precise things - otherwise, what is their value? Assuming Wiktionary is descriptive - reporting the words' actual usage - it takes quite a bit of scholarship, skill, and editorial resources not to mislead people. I can't just write what I think it means - the meaning to me might not match the meaning to the person at the next desk. It takes quite a bit of research, using powerful (and sometimes expensive) tools, and understanding of lexicography to be complete and also precisely correct, including usages in places and times that are mostly unknown to any particular author. Also, writing definitions is tricky: You are using words - which have those aformentioned problems with meaning - to define words. Also, any writing anywhere can be easily misinterpreted - skill and editors are needed to avoid misunderstanding. How is the accuracy and precision problem solved?
Also, in English there are already many authoritative sources, many with a century of profesional lexicography behind them by the best in the business. Some are free. There are also meta-lookup engines such as Wordnik and OneLook. Why use Wiktionary? The few times I've compared definitions or etymologies, the authoritative sources almost always exceed or equal Wiktionary (though online copies of older print editions suffer from the minimalism caused by the constraint of printing costs). Arguably, there is nothing else both unabridged and free: Oxford unabridged costs $, so does Merriam-Webster (the free edition is abridged); American Heritage is free, but has the minimalism issue I mentioned above.
I enjoy etymology, maybe too much. It's like magic, finding out what a barrow was, or how filibuster has a direct lineage to pirates (freebooters... In Dutch.)
I can't afford, really, the nicer old English, scandi, frisan, Norse, etc. etymology dictionaries. I have incomplete scans that were printed and bound of some of them. I still have 6 etymology dictionaries, so I can be about as quick getting a dictionary as getting on the computer and going to !eo.
sociologically speaking, however, it is precisely that agreement that is what evolves alongside changes in spelling, pronounciation (and occasionally "new" words).
A few things.
>we have to agree on a decent subset of overall definitions.
Yes but we should fairly obviously understand that a word can have multiple, often competing meanings, and make an effort to learn the new ones as they become available.
As language shifts, and its shifted rapidly in my own lifetime, you can either make an effort to keep up, or be a sourpuss and refuse to understand changes in language.
It seems to me there's usually a political dimension to people who refuse to understand what people mean, because its easier to denigrate people if they cling to definitions that aren't intended by their political opponents use of a word.
I see this shit constantly mind. Gender. Liberty. Capitalism. Communism. People get stuck fighting useless battles over the right to define a word instead of just learning and embracing their opponents intention.
and to an extent, the rest of your comment - the solution, according to my PhD friend, is to establish the framing of the argument before you actually have the argument. It's more fun to not establish framing, but it's more effective to establish framing, first. I wonder if i have the publication (thesis?) he made on my NAS.
Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age. They are so expansive in scope, while often so limited in resources, and barely accept any crowd contributions. Such deliberately slow-going is often a good thing, but words also change quite quickly and these sources are now playing a very long game of catch-up. (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
Wiktionary is the big web-native word-resource (and is not cluttered with commercial junk) – allowing links, expandable quotes, images, diagrams, etc. that print's minimalism suffers from as you mention. When someone in 2025 wants information on a word, they'll likely use a search engine and click a link to Wiktionary (where Google blurbs steal some data from). Maybe they are a student wanting to confirm their nonstandard pronunciation with the IPA (still rarely used in mainstream English dictionaries) or if it's recognized in their own dialect (mainstream dictionaries rarely provide more than UK and US pronunciations) – if enough people have the same question, Wiktionary seems like the best place to put the answer – or see an accessible etymology tree. While you probably know this, it's also worth reminding that English Wiktionary isn't just for English words, it is a dictionary of all languages' words, which is written in English. It has metadata and links connecting languages' words that you can't find elsewhere.
Yes, I indeed do want people to just write what they think a word means – as a starting point in a collaborative refining process. I believe the number of word-users in the world with valuable potential contributions is a lot closer to a billion than the thousand gatekeepers working hard on classical dictionaries. The barrier to entry is really low, but the tooling could still be much better. This is one reason i'm putting my appeal under this article - because I think (professional) lexicography can stand to evolve more in the 21st century. (And are people today really buying enough dictionaries to sustain a professional version of Wiktionary, or even a professional dictionary offered in structured data form?) If we don't contribute to a crowdsourced dictionary, then we won't have any such thing.
(Meta-lookup sites are link/search engines, not dictionaries and IME really don't do a good job synthesizing their information or conventions.)
> Unabridged dictionaries take decades to release new editions and are still navigating transition into the exploding digital age.
OED is now a 100% online service - a website - that releases updates every quarter, like much software. I don't see them 'still navigating' at all.
> barely accept any crowd contributions.
OED is famous for being arguably the first crowd-sourced research project. James Murray, the first great editor and driving force behind the first edition, solicited contributions from the public of usages of words and had a massive filing system of slips with all the contributions.
"Dictionary work relied on so much correspondence that a post box was installed right outside Murray’s Oxford home ...". "His children (eventually there were eleven) were paid pocket money to sort the dictionary slips into alphabetical order upon arrival." [0]
Today OED still solicits contributions, including specific appeals to the public. Every entry in the OED has a 'Contribute' button.
https://www.oed.com/information/using-the-oed/contributing-t...
> (Yesterday I tried to verify the latter English senses of "fandango" on Wiktionary with other dictionaries; OED's entry has not been touched for 131 years! What am I going to do with that, I need to use / understand the word now!)
You are misunderstanding what 'revise' means to the OED (which is unnecessarily confusing); they still update entries without a full revision. If you look at the entry history:
fandango, n. was first published in 1894; not yet revised.
fandango, n. was last modified in March 2025.
> I don't think definitions "are" highly accurate precise things. Sometimes yes. The same scholarship, skill, and need to not mislead also applies for so many other things: encyclopedic articles, taxonomies, news, maps, operating systems. Do people still question the value of Wikipedia, OpenStreetMap?
I think there's a difference between requirements - or expectations - for a dictionary and Wikipedia:
My guess is that people don't question Wikipedia because they have different expectations for it: They don't expect accuracy, as defined by the Three Cs: Completeness, Correctness, Consistency. Wikipedia is more the accumulation of information generally believed about a topic (with some standards, imperfectly followed, for secondary source support - but secondary sources reflect general, consensus belief). It's not expected to be Complete; no encyclopedia can completely cover any topic - the point is to be a starting place, a summary - and anyway Wikipedia is a sort of work in progress. It's not expected to be Correct; it's what people generally believe. And Consistency is tough with so many authors. It's really an product of the post-truth era; that's what people want - just try questioning it.
People's expectation for dictionaries - or my expectation at least :) - is not a starting point but the final word. Almost always I already have an idea of what the word means - from partial knowledge, from experience, from context, from its components. I'm expecting the Three Cs from the dictionary, to put a fine point on my understanding and use of the word, to fill in my blind spots - including knowledge of how others have been understanding and using the word.
Maybe Wiktionary just isn't for me. But I worry that people do assume it's CCC - many people believe anything they read is accurate, especially something from an authoritative-looking source - and are confused by it.
[0] https://www.oed.com/information/about-the-oed/history-of-the...
I can answer that one. I have free access to the Oxford English Dictionary (OED), which is brilliant and generally more detailed and reliable than Wiktionary when it has the word I'm looking for, but their login page is so awful that I sometimes use en.wiktionary.org instead just to save my time and temper. Also, en.wiktionary.org has proper nouns, other languages, and occasionally it has some recent or technical English word that OED does not have. So if I'm doing some serious amateur research: OED. But if I'm doing a crossword and want to check that a word exists and is spelt how I think it is: Wiktionary.
I've used the OED login page: username, pw, [] keep me logged in. What is so awful?
Years ago, I wrote a puzzlehunt puzzle that involved navigating through words where an edge existed if the two words formed a common 2-gram (that is, they often appeared one after another in a text dump of Wikipedia).
For example, a fragment of the graph from the puzzle is: mit -> press -> office <- post <- blog.
This work is obviously much more advanced, and it's very cool to see that they managed to make it work with semantic connections. I was able to get away with a much simpler approach since I only cared about 2-grams over a set of about 1000 words (I literally used a grep command over the entire text of the English wikipedia; it took about a day to run).
But the core idea is shared: 1) wanting to build a graph representation of word connections for a puzzle, 2) it being way to much work to do that manually, 3) you would miss a bunch of edges if you did do it manually, so 4) use programming tools to construct a dataset, and then 5) the end result is surprisingly fun for the user because the dataset is comprehensive and it feels really natural.
If anyone is curious, the puzzlehunt puzzle is here: https://dhashe.com/files/puzzles/word-wide-web.pdf
And the solution is here: https://dhashe.com/files/puzzles/word-wide-web-sol.pdf
And a fair warning to anyone unfamiliar with puzzlehunt puzzles: they do not come with instructions and it is very common to get stuck when solving them, especially when solving them alone. You have not completely solved a puzzlehunt puzzle until you extract an answer word or phrase from the puzzle. This one has an extra layer after filling in the words in the graph. Peeking at the solution is encouraged if you get stuck.
I know this is not related to the app but still wanted to appreciate the thought
Is there anything the user could do to modify the next steps, other than picking a word? Perhaps selecting some sort of valence related to metaphor or meaning? "I want to pick 'pacify', but in the sense of calming down, not to utterly destroy."
On a shorter horizon, I can tune the probability that on-path terms appear in the cloud. We store a larger pool of words than are displayed, and calculate lookaheads (and lookbacks from the target).
But maybe that adds an entirely new normalization function - user types 'runs' or 'ran', the app has to normalize to 'run'.
The app could just have a 'more words' button, loading the next 17.
I had tried hard to pick a set of fairly simple words, thinking I had an intricately unique association in my head, only to find out that the reported connections were nothing more than average. My partner obviously landed in an extremely high percentile by instantly picking the first words that came to her without much thought.
My words were: apple, shotgun, stardust, anger, hygiene, etymology, proctology, slant, dictator, and displacement.
Seriously, when is the last time a casual speaker, writer, or translator used “domicile” in place of “house” in your world? It’s an archaic term appropriated into legal jargon. Flattening out language and drawing lines between terms is funny to me.
The only issue is normalizing “Thesaurus bashing” type mentalities - like this - to degrade the value of coherent, purposeful, meaningful use of English. It’s an amalgamation language with extremely difficult fluency. It’s rife with idioms and contradictory emotional context.
Oh well, I can grasp that I tend to yell at clouds when it comes to this sort of thing. It doesn’t change my opinion this is a harmful exercise and probably should not exist. There are few instances where playing a game will actually make one more stupid, but here we are.
It’d be highly valuable as a thesaurus API.
In french, there is a game to build relations with words (they provide a word, and you have to type the most related words): https://www.jeuxdemots.org They reached 677 million of relations in 2024!
Alpha Omega
Sticky Terms (I struggle with this)
Typeshift
Blackbar (old, not maintained, but we can still play. Not a game in strict sense, very enjoyable)
I suppose other languages have way less word games than English?
michaeld123•1d ago
marviel•1d ago
Which embedding types did you try? I'm surprised that embeddings weren't able to take you further with this.
michaeld123•1d ago
gagzilla•1d ago
michaeld123•1d ago
* Technical jargon → unrelated obscurities: gryllacridid (cricket family) → microclots * Proper nouns → common words: Trish Stratus (wrestler) → federating Numbers → anything: 9451 → shoulds
Mostly hyper-specific terms with few inbound connections, obscure conjugations, or rare idioms.
Superconnectors: We systematically removed generic hubs, but your question prompted us to analyze which words still act as natural bridges. Added it to the article with an interactive explorer! Top survivors:
* polish (0.18% of paths) - verb/nationality homograph * symbiosis (0.14%) - biology → cooperation bridge * treaty (0.13%) - conflict → resolution bridge
Thanks for the curiosity—it led to an interesting addition. Age correlation: No hard data, but I suspect you're right. Older words have had centuries to accumulate meanings and develop polysemous bridges.