Also, just in case people want to lit review further on this topic: they call their method "programmatic data curation" but I believe this approach is also called model distillation and/or student-teacher training.
We chose a set of tasks with different levels of complexity to see how this approach would scale. For LLMs, the "challenge" with NER is not the task itself but the arbitrariness of the labels in the dataset. I agree it's still much simpler than the other tasks we present (agentic RAG, agentic tool use, maze navigation).
There are definitely strong parallels to model distillation and student-teacher training, with the primary difference being that we don't simply take all the data from the larger model but rather filter the dataset based on metrics from the environment. In the "Does curation even matter?" section, we show that this generally improves the result by a good margin.
We link to Vicuna, which might be the closest reference as prior art: https://lmsys.org/blog/2023-03-30-vicuna/
Thanks!
But broadly speaking, yes, we generate data using a large model, curate the best samples using metrics from the environment, and fine-tune on that data. This isn't a novel technique from an academic perspective; our focus is on applying it to different use cases (e.g. agentic RAG, agentic tool use) and models (OpenAI, Google, Qwen).
Thanks!
I think this is called “logit distillation” which is a particular form of distillation but not the only one.
> so you wouldn't be able to do that with fine-tuning APIs (OpenAI + Google in our blog post)
Dististillation from competitors' API is so common it has been given a name: it's called “distealing”.
Good luck!
alchemist1e9•4h ago
Anyone gone that route and know of projects with very high quality curated source materials? ideally categorized and labeled.