Fine-tuned small LLMs can beat large ones with programmatic data curation

https://www.tensorzero.com/blog/fine-tuned-small-llms-can-beat-large-ones-at-5-30x-lower-cost-with-programmatic-data-curation/

34•GabrielBianconi•7h ago

Comments

alchemist1e9•4h ago

I’ve been thinking about curating primary sources themselves and then using those for fine-tuning.

Anyone gone that route and know of projects with very high quality curated source materials? ideally categorized and labeled.

k8si•4h ago

Maybe this is a nitpick but CoNLL NER is not a "challenging task". Even pre-LLM systems were getting >90 F1 on that as far back as 2016.

Also, just in case people want to lit review further on this topic: they call their method "programmatic data curation" but I believe this approach is also called model distillation and/or student-teacher training.

GabrielBianconi•3h ago

Thanks for the feedback!

We chose a set of tasks with different levels of complexity to see how this approach would scale. For LLMs, the "challenge" with NER is not the task itself but the arbitrariness of the labels in the dataset. I agree it's still much simpler than the other tasks we present (agentic RAG, agentic tool use, maze navigation).

There are definitely strong parallels to model distillation and student-teacher training, with the primary difference being that we don't simply take all the data from the larger model but rather filter the dataset based on metrics from the environment. In the "Does curation even matter?" section, we show that this generally improves the result by a good margin.

We link to Vicuna, which might be the closest reference as prior art: https://lmsys.org/blog/2023-03-30-vicuna/

Thanks!

mwigdahl•3h ago

Is this just distillation but with a step to filter out low-quality responses first?

GabrielBianconi•3h ago

AFAIK, distillation typically refers to tuning on the logits of the larger model, so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post). We fine-tune on the outputs themselves.

But broadly speaking, yes, we generate data using a large model, curate the best samples using metrics from the environment, and fine-tune on that data. This isn't a novel technique from an academic perspective; our focus is on applying it to different use cases (e.g. agentic RAG, agentic tool use) and models (OpenAI, Google, Qwen).

Thanks!

mwigdahl•3h ago

Thanks for the explanation and the clarification on terminology! I've used a similar approach myself and it sounded like you were doing something similar.

littlestymaar•1h ago

> AFAIK, distillation typically refers to tuning on the logits of the larger model

I think this is called “logit distillation” which is a particular form of distillation but not the only one.

> so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post)

Dististillation from competitors' API is so common it has been given a name: it's called “distealing”.

6510•3h ago

Noob question: Would it be possible to train a small model for a single prompt?

GabrielBianconi•3h ago

With supervised fine-tuning (SFT), you'll often see good results with 100-1000+ datapoints (they can be variations of the same prompt template). If you have more limited data, reinforcement fine-tuning (RFT) can work well in the 10-100 range.

Good luck!

Show HN: I spent 6 years building a ridiculous wooden pixel display

Is It FOSS?

Qwen-Image: Crafting with native text rendering

NASA's Curiosity picks up new skills

AWS European Sovereign Cloud to be operated by EU citizens

How we made JSON.stringify more than twice as fast

What Does One Billion Dollars Look Like?

Indian Sign Painting: A typeface designer's take on the craft

Content-Aware Spaced Repetition

Show HN: I've been building an ERP for manufacturing for the last 3 years

Job-seekers are dodging AI interviewers

EconTeen – Financial Literacy Lessons and Tools for Teens

Hiroshima (1946)

OpenIPC: Open IP Camera Firmware

Cellular Starlink expands to support IoT devices

Once a death sentence, cardiac amyloidosis is finally treatable

DrawAFish.com Postmortem

How we built Bluey’s world

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

What Can a Cell Remember?

A deep dive into Rust and C memory interoperability

Customizing tmux

My Ideal Array Language

Show HN: Sidequest.js – Background jobs for Node.js using your database

Read your code

Century-old stone “tsunami stones” dot Japan's coastline (2015)

Objects should shut up

Show HN: Tiny logic and number games I built for my kids

Is the interstellar object 3I/ATLAS alien technology? [pdf]

Circadian justice (2022)

Fine-tuned small LLMs can beat large ones with programmatic data curation

Comments

Show HN: I spent 6 years building a ridiculous wooden pixel display

Is It FOSS?

Qwen-Image: Crafting with native text rendering

NASA's Curiosity picks up new skills

AWS European Sovereign Cloud to be operated by EU citizens

How we made JSON.stringify more than twice as fast

What Does One Billion Dollars Look Like?

Indian Sign Painting: A typeface designer's take on the craft

Content-Aware Spaced Repetition

Show HN: I've been building an ERP for manufacturing for the last 3 years

Job-seekers are dodging AI interviewers

EconTeen – Financial Literacy Lessons and Tools for Teens

Hiroshima (1946)

OpenIPC: Open IP Camera Firmware

Cellular Starlink expands to support IoT devices

Once a death sentence, cardiac amyloidosis is finally treatable

DrawAFish.com Postmortem

How we built Bluey’s world

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

What Can a Cell Remember?

A deep dive into Rust and C memory interoperability

Customizing tmux

My Ideal Array Language

Show HN: Sidequest.js – Background jobs for Node.js using your database

Read your code

Century-old stone “tsunami stones” dot Japan's coastline (2015)

Objects should shut up

Show HN: Tiny logic and number games I built for my kids

Is the interstellar object 3I/ATLAS alien technology? [pdf]

Circadian justice (2022)