To answer, I ended up building ngrep: a grep-like tool that extends regular expressions with a new operator ~(token) that matches a word by meaning using word2vec-style embeddings (FastText, GloVe, Wikipedia2Vec).
A simple demo: "~(big)+ \b~(animal;0.35)+\b" over Moby-Dick can find many ways used to refer to a large animal, surfacing "great whale", "enormous creature", "huge elephant" and so on. Pipe it through sort | uniq -c and the winner is, unsurprisingly, "great whale" :)
Built in Rust on top of the awesome fancy-regex, and ~() composes with all standard operators (negative lookahead, quantifiers, etc.). Currently a PoC with many missing optimizations (e.g: no caching, no compilation to standard regex, etc.), obviously without the guarantees of plain regex and subject to the limits of w2v-style embeddings...but thought it was worth sharing!
jruohonen•2h ago
https://linux.die.net/man/8/ngrep.
xnan•1h ago