The End of the Train-Test Split

https://folio.benguzovsky.com/train-test

35•gmays•2mo ago

Comments

elpakal•2mo ago

> Since the data will always be flawed and the test set won't be blind, the machine learning engineer's priority should be spent working with policy teams to improve the data.

It's interesting to watch this dynamic change from data set size measuring contests to quality and representativeness. In "A small number of samples can poison LLMs of any size" from Claude they hit on the same shift, but their position is more about security considerations than quality.

https://www.anthropic.com/research/small-samples-poison

henning•2mo ago

> Two months later, you've cracked it

Hehe.

roadside_picnic•2mo ago

> You make an LLM decision tree, one LLM call per policy section, and aggregate the results.

I can never understand why people jump to these weird direct calls to the LLM rather than working with embeddings for classification tasks.

I have a hard time believing that

- the context text embedding

- the image vector representation

- the policy text embedding(s)

Cannot be combined to create a classification model is likely several orders of magnitude faster than chaining calls to an LLM, and I wouldn't be remotely surprised to see it perform notably better on the task described.

I have used LLM as classifier and it does make sense in cases of extremely limited data (though they rarely work well enough), but if you're going to be calling the LLM in such complex ways it's better to stop thinking of this as a classic ML problem and rather think of it as an agentic content moderator.

In this case you can ignore the train/test split in favor of evals which you would create as you would for any other LLM agent workflow.

stephantul•2mo ago

I don’t really believe this is a paradigm shift with regards to train/test splits.

Before LLMs you would do a lot of these things, it’s just become a lot easier to get started and not train. What the author describes is very similar to the standard ml product loop in companies, including it being very difficult to “beat” the incumbent model because it has been overfit on the test set that is used compare the incumbent to your own model.

SectorC: A C Compiler in 512 bytes

Brookhaven Lab's RHIC concludes 25-year run with final collisions

The F Word

Software factories and the agentic moment

I write games in C (yes, C)

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

First Proof

The Waymo World Model

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Selection Rather Than Prediction

We mourn our craft

Coding agents have replaced every framework I used

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

A Fresh Look at IBM 3270 Information Display System

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

History and Timeline of the Proco Rat Pedal (2021)

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox