frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

My trick for getting consistent classification from LLMs

https://verdik.substack.com/p/how-to-get-consistent-classification
92•frenchmajesty•1w ago

Comments

jawns•3h ago
If you already have your categories defined, you might even be able to skip a step and just compare embeddings.

I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.

Then created embeddings for the call notes and matched to closest category using cosine_similarity.

kurttheviking•3h ago
Out of curiosity, what embedding model did you use for this?
nerdponx•3h ago
How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
olliepro•2h ago
sentence embedding models are great for this type of thing.
minimaxir•26m ago
Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.

Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.

svachalek•3h ago
That was my first thought, why even generate tags? Curious to see if anyone's proved it's worse empirically though.
soldeace•2h ago
In a recent project I was asked to create a user story classifier to identify whether stories were "new development" or "maintenance of existing features". I tried both approaches, embeddings + cosine distance vs. directly asking a language model to classify the user story. The embeddings approach was, despite being fueled by the most powerful SOTA embedding model available, surprisingly worse than simply asking GPT 4.1 to give me the correct label.
frenchmajesty•2h ago
OP here. It depends what you use it for. You do want the tags if you intend to generate data. Let's say you prompt an LLM to go tweet on your behalf for a week, having the ability to:

- Fetch a list of my unique tags to get a sense of my topics of interests

- Have the AI dig into those specific niches to see what people have been discussing lately

- Craft a few random tweets that are topic-relevant and present them to me to curate

Is very powerful workflow that is hard to deliver on without the class labels.

copypaper•1h ago
I originally settled on doing this, but the problem is that you have to re-calculate everything if you ever add/remove a category. If your categories will always be static, that will work fine. But it's more than likely you'll eventually have to add another category down the line.

If your categories are dynamic, the way OP handles it will be much cheaper as the number of tweets (or customer service calls in your case) grows, as long as the cache hit rate is >0%. Each tweet will get it's own label, i.e. "joke_about_bad_technology_choices". Each of these labels gets put into a category, i.e. "tech_jokes". If you add/remove a category you would still need to re-calculate everything, however you would only need to re-calculate the labels to categories as opposed to every single tweet. Since similar tweets can share the same labels, you end up with less labels than total amount of tweets. As you reach the asymptotic ceiling, as mentioned in OPs post, your cost to re-embed labels to categories also becomes an asymptotic ceiling.

If the number of items you're categorizing is a couple thousand at most and you rarely add/remove categories, it's probably not worth the complexity. But in my case (and ops) it's worth it as the number of items grows infinitely.

minimaxir•1h ago
This works in a pinch but is much less reliable than using a curated set of representative examples from each targeted class.
axpy906•3h ago
Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.
frenchmajesty•2h ago
OP here. This is true. If you make your min_score .99 you can have very high confidence in copy-pasting the label, but then this is not very useful. The big question is then how far can you get from 0.99 while still having satisfying results?
sethkim•2h ago
Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.
frenchmajesty•2h ago
OP here. This is exactly right! You perfectly encapsulated the idea I stumbled up so beautifully.
dinobones•2h ago
Dunno if this passes the bootstrapping test.

This is sensitive to the initial candidate set of labels that the LLM generates.

Meaning if you ran this a few times over the same corpus, you’ll probably get different performance depending upon the order of the way you input the data and the classification tag the LLM ultimately decided upon.

Here’s an idea that is order invariant: embed first, take samples from clusters, and ask the LLM to label the 5 or so samples you’ve taken. The clusters are serving as soft candidate labels and the LLM turns them into actual interpretable explicit labels.

deepsquirrelnet•1h ago
I think a less order biased, more straightforward way would be just to vectorize everything, perform clustering and then label the clusters with the LLM.
frenchmajesty•1h ago
OP here. Yes that works too and get you to the same result. Remove risks for bias but the trade-off is higher marginal cost and latency.

The idea is also that this would be a classification system used in production whereby you classify data as it comes, so the "rolling labels" problem still exists there.

In my experience though, you can dramatically reduce unwanted bias by tuning your cosine similarity filter.

kpw94•1h ago
Nice!

So the cache check tries to find if a previously existing text embedding has >0.8 match with the current text.

If you get a cache hit here, iiuc, you return that matched' text label right away. But do you also insert a text embedding of the current text in the text embeddings table? Or do you only insert it in case of cache miss?

From reading the GitHub readme it seems you only "store text embedding for future lookups" in the case of cache miss. This is by design to keep the text embedding table not too big?

frenchmajesty•1h ago
Op here. Yes that's right. We do also insert the current text embedding on misses to expand the boundaries of the cluster.

For instance: I love McDonalds (1). I love burgers. (0.99) I love cheeseburgers with ketchup (?).

This is a bad example but in this case the last text could end up right at the boundary of the similarity to that 1st label if we did not store the 2nd, which could cause a cluster miss we don't want.

We only store the text on cache misses, though you could do both. I had not considered that idea but it make sense. I'm not very concerned about the dataset size because vector storage is generally cheap (~ $2/mo for 1M vectors) and the savings in $$$ not spend generating tokens covers for that expense generously.

dan_h•1h ago
This is very similar to how I've approached classifying RSS articles by topic on my personal project[1]. However to generate the embedding vector for each topic, I take the average vector of the top N articles tagged with that topic when sorted by similarity to the topic vector itself. Since I only consider topics created in the last few months, it helps adjust topics to account for semantic changes over time. It also helps with flagging topics that are "too similar" and merging them when clusters sufficiently overlap.

There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.

1: jesterengine.com

minimaxir•42m ago
There is a flaw with the base problem: each tweet only has one label, while a tweet is often about many different things and can't be delinated so cleanly. Here's an alternate approach that both allows for multiple labels and lower marginal costs (albeit higher initial cost) for each tweet classified.

1. Curate a large representative subsample of tweets.

2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.

3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.

4. Encode each tweet as a vector as normal.

5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.

The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.

nreece•21m ago
Am I understanding it right that for each new text (tweet) you generate its embedding first, try to match across existing vector embeddings for all other text (full text or bag of words), and then send the text to the LLM for tag classification only if no match is found or otherwise classify it to the same tag for which a match was found.

Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.

Intel and AMD standardise ChkTag to bring Memory Safety to x86

https://community.intel.com/t5/Blogs/Tech-Innovation/open-intel/ChkTag-x86-Memory-Safety/post/172...
128•ashvardanian•6d ago•58 comments

Building a message queue with only two UNIX signals

https://leandronsp.com/articles/you-dont-need-kafka-building-a-message-queue-with-only-two-unix-s...
49•SchwKatze•2h ago•28 comments

Claude Code on the web

https://www.anthropic.com/news/claude-code-on-the-web
345•adocomplete•6h ago•209 comments

My trick for getting consistent classification from LLMs

https://verdik.substack.com/p/how-to-get-consistent-classification
92•frenchmajesty•1w ago•22 comments

Production RAG: what I learned from processing 5M+ documents

https://blog.abdellatif.io/production-rag-processing-5m-documents
290•tifa2up•9h ago•80 comments

BERT is just a single text diffusion step

https://nathan.rs/posts/roberta-diffusion/
340•nathan-barry•10h ago•83 comments

A laser pointer at 2B FPS [video]

https://www.youtube.com/watch?v=o4TdHrMi6do
188•thunderbong•1d ago•28 comments

Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

https://www.tomshardware.com/tech-industry/semiconductors/alibaba-says-new-pooling-system-cut-nvi...
330•hd4•12h ago•228 comments

Show HN: I created a cross-platform GUI for the JJ VCS (Git compatible)

https://judojj.com
61•bitpatch•9h ago•6 comments

Postman which I thought worked locally on my computer, is down

https://status.postman.com
182•helloguillecl•9h ago•82 comments

Today is when the Amazon brain drain sent AWS down the spout

https://www.theregister.com/2025/10/20/aws_outage_amazon_brain_drain_corey_quinn/
271•raw_anon_1111•4h ago•114 comments

x86-64 Playground – An online assembly editor and GDB-like debugger

https://x64.halb.it/
105•modinfo•7h ago•8 comments

Code from MIT's 1986 SICP video lectures

https://github.com/felipap/sicp-code
82•felipap•3d ago•6 comments

AWS Multiple Services Down in us-east-1

https://health.aws.amazon.com/health/status?ts=20251020
1574•kondro•17h ago•1776 comments

The scariest "user support" email I've ever received

https://www.devas.life/the-scariest-user-support-email-ive-ever-received/
144•hervic•5d ago•108 comments

Art Must Act

https://aeon.co/essays/harold-rosenberg-exhorted-artists-to-take-action-and-resist-cliche
20•tintinnabula•4d ago•0 comments

TernFS – an exabyte scale, multi-region distributed filesystem

https://www.xtxmarkets.com/tech/2025-ternfs/#posix-shaped
88•kirlev•7h ago•12 comments

How to stop Linux threads cleanly

https://mazzo.li/posts/stopping-linux-threads.html
164•signa11•5d ago•60 comments

Optical diffraction patterns made with a MOPA laser engraving machine [video]

https://www.youtube.com/watch?v=RsGHr7dXLuI
106•emsign•6d ago•18 comments

Atomic-Scale Protein Filters

https://press.asimov.com/articles/filters
11•mailyk•5d ago•0 comments

Space Elevator

https://neal.fun/space-elevator/
1471•kaonwarb•20h ago•336 comments

The longest baseball game took 33 innings to win

https://www.mlb.com/news/the-longest-professional-baseball-game-ever-played
38•mooreds•5d ago•53 comments

Americans can't afford their cars any more and Wall Street is worried

https://www.telegraph.co.uk/business/2025/10/20/americans-cant-afford-cars-any-more-wall-street-w...
56•zerosizedweasle•1h ago•49 comments

Old Is Gold: Optimizing Single-Threaded Applications with Exgen-Malloc

https://arxiv.org/abs/2510.10219
4•todsacerdoti•5d ago•2 comments

Docker Systems Status: Full Service Disruption

https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/68f5e1c741c825463df7486c
326•l2dy•17h ago•124 comments

J.P. Morgan's OpenAI loan is strange

https://marketunpack.com/j-p-morgans-openai-loan-is-strange/
199•vrnvu•5h ago•130 comments

Servo v0.0.1

https://github.com/servo/servo
459•undeveloper•12h ago•141 comments

Why UUIDs won't protect your secrets

https://alexsci.com/blog/uuids-and-idor/
4•8organicbits•1h ago•0 comments

DeepSeek OCR

https://github.com/deepseek-ai/DeepSeek-OCR
861•pierre•18h ago•219 comments

iOS 26.1 lets users control Liquid Glass transparency

https://www.macrumors.com/2025/10/20/ios-26-1-liquid-glass-toggle/
144•dabinat•5h ago•122 comments