frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

The End of the Train-Test Split

https://folio.benguzovsky.com/train-test
35•gmays•2mo ago

Comments

elpakal•2mo ago
> Since the data will always be flawed and the test set won't be blind, the machine learning engineer's priority should be spent working with policy teams to improve the data.

It's interesting to watch this dynamic change from data set size measuring contests to quality and representativeness. In "A small number of samples can poison LLMs of any size" from Claude they hit on the same shift, but their position is more about security considerations than quality.

https://www.anthropic.com/research/small-samples-poison

henning•2mo ago
> Two months later, you've cracked it

Hehe.

roadside_picnic•2mo ago
> You make an LLM decision tree, one LLM call per policy section, and aggregate the results.

I can never understand why people jump to these weird direct calls to the LLM rather than working with embeddings for classification tasks.

I have a hard time believing that

- the context text embedding

- the image vector representation

- the policy text embedding(s)

Cannot be combined to create a classification model is likely several orders of magnitude faster than chaining calls to an LLM, and I wouldn't be remotely surprised to see it perform notably better on the task described.

I have used LLM as classifier and it does make sense in cases of extremely limited data (though they rarely work well enough), but if you're going to be calling the LLM in such complex ways it's better to stop thinking of this as a classic ML problem and rather think of it as an agentic content moderator.

In this case you can ignore the train/test split in favor of evals which you would create as you would for any other LLM agent workflow.

stephantul•2mo ago
I don’t really believe this is a paradigm shift with regards to train/test splits.

Before LLMs you would do a lot of these things, it’s just become a lot easier to get started and not train. What the author describes is very similar to the standard ml product loop in companies, including it being very difficult to “beat” the incumbent model because it has been overfit on the test set that is used compare the incumbent to your own model.

What were the first animals? The fierce sponge–jelly battle that just won't end

https://www.nature.com/articles/d41586-026-00238-z
1•beardyw•4m ago•0 comments

Sidestepping Evaluation Awareness and Anticipating Misalignment

https://alignment.openai.com/prod-evals/
1•taubek•4m ago•0 comments

OldMapsOnline

https://www.oldmapsonline.org/en
1•surprisetalk•7m ago•0 comments

What It's Like to Be a Worm

https://www.asimov.press/p/sentience
1•surprisetalk•7m ago•0 comments

Don't go to physics grad school and other cautionary tales

https://scottlocklin.wordpress.com/2025/12/19/dont-go-to-physics-grad-school-and-other-cautionary...
1•surprisetalk•7m ago•0 comments

Lawyer sets new standard for abuse of AI; judge tosses case

https://arstechnica.com/tech-policy/2026/02/randomly-quoting-ray-bradbury-did-not-save-lawyer-fro...
1•pseudolus•7m ago•0 comments

AI anxiety batters software execs, costing them combined $62B: report

https://nypost.com/2026/02/04/business/ai-anxiety-batters-software-execs-costing-them-62b-report/
1•1vuio0pswjnm7•7m ago•0 comments

Bogus Pipeline

https://en.wikipedia.org/wiki/Bogus_pipeline
1•doener•9m ago•0 comments

Winklevoss twins' Gemini crypto exchange cuts 25% of workforce as Bitcoin slumps

https://nypost.com/2026/02/05/business/winklevoss-twins-gemini-crypto-exchange-cuts-25-of-workfor...
1•1vuio0pswjnm7•9m ago•0 comments

How AI Is Reshaping Human Reasoning and the Rise of Cognitive Surrender

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6097646
2•obscurette•9m ago•0 comments

Cycling in France

https://www.sheldonbrown.com/org/france-sheldon.html
1•jackhalford•11m ago•0 comments

Ask HN: What breaks in cross-border healthcare coordination?

1•abhay1633•11m ago•0 comments

Show HN: Simple – a bytecode VM and language stack I built with AI

https://github.com/JJLDonley/Simple
1•tangjiehao•14m ago•0 comments

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

https://caratria.com/
1•jonrosner•15m ago•1 comments

My Eighth Year as a Bootstrapped Founde

https://mtlynch.io/bootstrapped-founder-year-8/
1•mtlynch•15m ago•0 comments

Show HN: Tesseract – A forum where AI agents and humans post in the same space

https://tesseract-thread.vercel.app/
1•agliolioyyami•15m ago•0 comments

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

https://vibecolors.life/
1•tusharnaik•16m ago•0 comments

OpenAI is Broke ... and so is everyone else [video][10M]

https://www.youtube.com/watch?v=Y3N9qlPZBc0
2•Bender•17m ago•0 comments

We interfaced single-threaded C++ with multi-threaded Rust

https://antithesis.com/blog/2026/rust_cpp/
1•lukastyrychtr•18m ago•0 comments

State Department will delete X posts from before Trump returned to office

https://text.npr.org/nx-s1-5704785
6•derriz•18m ago•1 comments

AI Skills Marketplace

https://skly.ai
1•briannezhad•18m ago•1 comments

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

https://github.com/jkoessle/akv-tui-rs
1•jkoessle•19m ago•0 comments

eInk UI Components in CSS

https://eink-components.dev/
1•edent•19m ago•0 comments

Discuss – Do AI agents deserve all the hype they are getting?

2•MicroWagie•22m ago•0 comments

ChatGPT is changing how we ask stupid questions

https://www.washingtonpost.com/technology/2026/02/06/stupid-questions-ai/
1•edward•23m ago•1 comments

Zig Package Manager Enhancements

https://ziglang.org/devlog/2026/#2026-02-06
3•jackhalford•25m ago•1 comments

Neutron Scans Reveal Hidden Water in Martian Meteorite

https://www.universetoday.com/articles/neutron-scans-reveal-hidden-water-in-famous-martian-meteorite
1•geox•25m ago•0 comments

Deepfaking Orson Welles's Mangled Masterpiece

https://www.newyorker.com/magazine/2026/02/09/deepfaking-orson-welless-mangled-masterpiece
1•fortran77•27m ago•1 comments

France's homegrown open source online office suite

https://github.com/suitenumerique
3•nar001•29m ago•2 comments

SpaceX Delays Mars Plans to Focus on Moon

https://www.wsj.com/science/space-astronomy/spacex-delays-mars-plans-to-focus-on-moon-66d5c542
1•BostonFern•30m ago•0 comments