I kept running into the same pattern: calling an LLM API thousands of times with the same prompt template, just swapping in different text. Classify this contract clause. Route this support ticket. Categorize this log line.
For teams handling contracts, patient records, or internal logs, sending that data to a third-party API isn't always an option. And at scale, you're paying per-token for what's essentially pattern matching.
So I built a CLI that trains a small local classifier from labeled examples. Give it 50 input/output pairs, it trains a ~230KB model on your machine, and you run inference locally. No network calls. Everything runs on Node.js - no Python, no GPU, no Docker.
Under the hood it uses all-MiniLM-L6-v2 for sentence embeddings (runs locally, downloads once at ~80MB) and trains a small neural network on top of your labels.
For topic/domain classification - where categories are about different things - I'm seeing 80-95% accuracy with 50 examples. It struggles with sentiment and tone (44-50%), because "amazing camera" and "terrible camera" produce nearly identical embedding vectors. I documented this openly in the benchmarks.
The benchmarks use real text from AG News (127K articles) and 20 Newsgroups (18K posts), with only 50 training samples drawn from each. The test harness and all fixture data are in the repo - clone it and run npx tsx tests/harness/run.ts to reproduce.
This isn't trying to replace LLMs. It's specifically for the repetitive classification tasks where the same prompt structure processes different data every time.
Open source, Apache 2.0. Still early. Curious whether anyone has tried similar embedding+classifier approaches for their own workflows, or if there's demand for multi-label classification.