frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The End of the Train-Test Split

https://folio.benguzovsky.com/train-test
35•gmays•2mo ago

Comments

elpakal•2mo ago
> Since the data will always be flawed and the test set won't be blind, the machine learning engineer's priority should be spent working with policy teams to improve the data.

It's interesting to watch this dynamic change from data set size measuring contests to quality and representativeness. In "A small number of samples can poison LLMs of any size" from Claude they hit on the same shift, but their position is more about security considerations than quality.

https://www.anthropic.com/research/small-samples-poison

henning•2mo ago
> Two months later, you've cracked it

Hehe.

roadside_picnic•2mo ago
> You make an LLM decision tree, one LLM call per policy section, and aggregate the results.

I can never understand why people jump to these weird direct calls to the LLM rather than working with embeddings for classification tasks.

I have a hard time believing that

- the context text embedding

- the image vector representation

- the policy text embedding(s)

Cannot be combined to create a classification model is likely several orders of magnitude faster than chaining calls to an LLM, and I wouldn't be remotely surprised to see it perform notably better on the task described.

I have used LLM as classifier and it does make sense in cases of extremely limited data (though they rarely work well enough), but if you're going to be calling the LLM in such complex ways it's better to stop thinking of this as a classic ML problem and rather think of it as an agentic content moderator.

In this case you can ignore the train/test split in favor of evals which you would create as you would for any other LLM agent workflow.

stephantul•2mo ago
I don’t really believe this is a paradigm shift with regards to train/test splits.

Before LLMs you would do a lot of these things, it’s just become a lot easier to get started and not train. What the author describes is very similar to the standard ml product loop in companies, including it being very difficult to “beat” the incumbent model because it has been overfit on the test set that is used compare the incumbent to your own model.

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

https://github.com/skorotkiewicz/fsid
1•modinfo•4m ago•0 comments

Show HN: Holy Grail: Open-Source Autonomous Development Agent

https://github.com/dakotalock/holygrailopensource
1•Moriarty2026•11m ago•1 comments

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•18m ago•1 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•18m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
1•rolph•21m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•21m ago•2 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•23m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•25m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•26m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•27m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
3•rolph•28m ago•1 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•31m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•34m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
4•cratermoon•35m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•36m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•36m ago•1 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•39m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

1•vampiregrey•41m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•42m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
2•hhs•44m ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•45m ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

5•Philpax•45m ago•0 comments

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•49m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
2•cui•51m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
2•geox•53m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
3•EA-3167•53m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
6•fliellerjulian•55m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model

https://github.com/ghostty-org/ghostty/pull/10559
2•DustinEchoes•57m ago•0 comments

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

https://github.com/sultanvaliyev/sshcode
1•sultanvaliyev•58m ago•0 comments

Microsoft appointed a quality czar. He has no direct reports and no budget

https://jpcaparas.medium.com/microsoft-appointed-a-quality-czar-he-has-no-direct-reports-and-no-b...
3•RickJWagner•59m ago•0 comments