Depends what we accept as norm.
250k words at a generous 100 bytes per word is only 25MB of memory...
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
I... did not expect this to be so popular
using this, a combo of "covered enough" for the bit and easy to use
also, since i'm tracking every word (technically a better name for this project would be The Bluesky Corpus) all inflected forms are different words, which aligns with my thinking
And what ingress bandwidth do you have?
Ingress is actually pretty manageable, ~900kbps
- Search unseen words
made me chuckle
(The website in question uses jetstream also.)
The dictionary site has only checked 4,920,000 posts, which is 0.28% of all messages.
neaden•6h ago
SirFatty•6h ago
gpm•6h ago
GalaxyNova•6h ago
accrual•6h ago
zem•5h ago
rafram•4h ago
forgotmypw17•2h ago
dymk•1h ago
Noumenon72•1h ago
AgentME•5h ago