Depends what we accept as norm.
250k words at a generous 100 bytes per word is only 25MB of memory...
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
I... did not expect this to be so popular
using this, a combo of "covered enough" for the bit and easy to use
also, since i'm tracking every word (technically a better name for this project would be The Bluesky Corpus) all inflected forms are different words, which aligns with my thinking
And what ingress bandwidth do you have?
Ingress is actually pretty manageable, ~900kbps
- Search unseen words
made me chuckle
(The website in question uses jetstream also.)
The dictionary site has only checked 4,920,000 posts, which is 0.28% of all messages.
> We just visited wheal Martyn museum in Cornwall, nice scones and a waterwheel, they also have a lot of gutters, sluices and pipes and a bit of a fixation about China Clay. More importantly they appear to be unattached at the moment
Both "wheal" (kind of cheating, that should be Wheal and is a place name) and "sluices" were new to the dictionary.
neaden•6mo ago
SirFatty•6mo ago
gpm•6mo ago
GalaxyNova•6mo ago
accrual•6mo ago
zem•6mo ago
rafram•6mo ago
forgotmypw17•6mo ago
dymk•6mo ago
Noumenon72•6mo ago
71bw•6mo ago
AgentME•6mo ago