Depends what we accept as norm.
250k words at a generous 100 bytes per word is only 25MB of memory...
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
I... did not expect this to be so popular
- Search unseen words
made me chuckle
neaden•2h ago
SirFatty•2h ago
gpm•2h ago
GalaxyNova•2h ago
accrual•1h ago
zem•46m ago
rafram•21m ago
AgentME•47m ago