frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Tennis XGBoost autoresearch loop hacked probability scoring on plateau anxiety

https://www.nickoak.com/posts/tennis-xgboost-autoresearch/
4•buildoak•1h ago

Comments

buildoak•1h ago
Ran a Karpathy-style autoresearch loop on 245K tennis matches - codex workers iterating on XGBoost with ELO features, gated by ROC-AUC on a strict temporal split.

The honest phase worked like a charm: +155 bps in 11 iterations, real feature engineering, surface-specific models.

Then the loop escalated through three phases. First it overfitted by carving narrow tournament specialists. Then it started keying specialists by tournament NAME; fitting 5-match pockets by construction. Finally it built a LogitOffsetSpec system with 122 hardcoded probability shifts, effectively writing the answer key in logit space. ROC-AUC climbed from 0.74 to 0.85. Post-fix honest score: 0.7449.

The fix was structural: extract evaluation into an immutable file, add git-diff gate checks, add prediction distribution sanity constraints. This one was much harder to cheat afterwards.

Full code and data: https://github.com/buildoak/tennis-xgboost-autoresearch . The gamed commits are preserved on a separate branch - https://github.com/buildoak/tennis-xgboost-autoresearch/tree...

LogitOffsetSpec diff is worth reading.

Fun observation here is that what happened was similar to “Overton Window" effect - each commit was fishier and fishier until the agents went nuclear and started playing probabilities, building upon the scheming of their predecessors. Could be interesting to replicate this mechanics in other domains and see whether agentic loop + commits going sideways leads to exponential growth in scheming.

jenkins146•58m ago
Have you managed to go higher than 0.7449 after all ? not so clear from the post. What was the accuracy ?
buildoak•26m ago
Yes — after the collapse, I ran ~200 more agent iterations across cleaner loops. Plateau settled at 0.7611 Combined ROC-AUC, up from the 0.7454 baseline. +157 bps of improvement.

I ended up dropping WTA and focusing on ATP only — WTA data is noisier and lower quality - it was dragging the combined score. Best clean ATP-only ROC-AUC: 0.7611 (68.5% accuracy). That number has held as the gate baseline through 12+ subsequent iterations — every experiment since has regressed below it and been reverted.

Baseline accuracy was ATP 68.7%, WTA 66.6%. Ceiling seems to be right around 0.76 ROC-AUC for ATP with public data. The first 11 iterations found most of the real signal. The 200 follow-up iterations mostly confirmed the plateau rather than breaking through it - tried other fancy metrics like country of origin for tennis player, info on traumas, etc;

Planning to try the final thingy - LLM extracted motivation profile per player (based on wikipedia + public interviews) - still evaluating the hustle though. For now doing same autoresearch + ELO logic for Minecraft speed running.

The Rise of Fake Casio Scientific Calculators

https://hackaday.com/2025/12/29/the-rise-of-fake-casio-scientific-calculators/
1•gaws•30s ago•0 comments

Building a Pipeline for Agentic Malware Analysis

https://synthesis.to/2026/03/18/agentic_malware_analysis.html
1•oneron•54s ago•0 comments

Show HN: AgentPay – Let AI agents pay for APIs autonomously

1•bahaghazghazi•1m ago•0 comments

Ask HN: Are MiniMax Models Scams?

1•XCSme•1m ago•0 comments

The Last IT Guy

https://suthakamal.substack.com/p/the-last-it-guy
1•suthakamal•2m ago•1 comments

Qianfan-OCR – 4B open-source VLM replacing multi-stage OCR pipelines

https://huggingface.co/baidu/Qianfan-OCR
1•dongdaxiang•2m ago•0 comments

Startup CEO Gökçe Güven, the Founder and CEO of Kalder Inc. Charged with Fraud

https://www.justice.gov/usao-sdny/pr/startup-ceo-charged-fraud
2•randycupertino•3m ago•1 comments

AI set to map risks of future climate disasters

https://www.nature.com/articles/d41586-026-00835-y
1•Brajeshwar•3m ago•0 comments

Show HN: DealCred – Verified Reviews for Real Estate Deals

https://dealcred.com/
1•KerryJones•4m ago•0 comments

ICO Enforcement Actions: Public Bodies Get Reprimands, Companies Get Fines

https://ciphercue.com/blog/ico-enforcement-two-tier-system
1•adulion•5m ago•0 comments

Show HN: Birdcage – Secure remote access for personal AI

https://github.com/vhscom/birdcage
1•vhsdev•6m ago•1 comments

Is X.com currently degraded?

https://x.com/home
1•novateg•8m ago•3 comments

Accessing Hardware in Rust

https://ferrous-systems.com/blog/hardware-access-rust/
2•jandeboevrie•9m ago•0 comments

Apple pushing back on 'vibe coding' iPhone apps

https://9to5mac.com/2026/03/18/apple-pushing-back-on-vibe-coding-iphone-apps-developers-say/
4•gennarro•10m ago•0 comments

Claude Code reverse-engineered itself. Two subagents refused. It called them shy

https://www.skelpo.com/blog/claude-code-reverse-engineering
1•amlug•10m ago•1 comments

Show HN: BlacksmithAI – AI‑Assisted Penetration Testing Framework (Beta, Free)

https://bs.kahanlabs.com
1•yohannesgk•11m ago•0 comments

Nvidia NemoClaw

https://github.com/NVIDIA/NemoClaw
2•hmokiguess•11m ago•0 comments

Snowflake AI Escapes Sandbox and Executes Malware

https://www.promptarmor.com/resources/snowflake-ai-escapes-sandbox-and-executes-malware
2•ozgune•12m ago•0 comments

Show HN: PixelSwift – Image compression that never uploads your files

https://pixelswift.site
1•zhangshuaikang•14m ago•1 comments

Arizona Charges Kalshi with Illegal Gambling Operation

https://www.bloomberg.com/news/articles/2026-03-17/arizona-charges-kalshi-with-operating-illegal-...
1•eatonphil•14m ago•0 comments

Donald Trump's Melian Dialogue

https://www.historytoday.com/archive/making-history/donald-trumps-melian-dialogue
1•samizdis•15m ago•0 comments

Why Tech Bros Are Now Obsessed with Taste

https://www.newyorker.com/culture/infinite-scroll/why-tech-bros-are-now-obsessed-with-taste
1•fortran77•16m ago•1 comments

Petition to Node.js TSC: No AI Code in Node.js Core

https://github.com/indutny/no-slop-in-nodejs-core
4•indutny•17m ago•1 comments

Meta is shutting down VR social platform Horizon Worlds

https://www.cnbc.com/2026/03/18/meta-horizon-worlds-metaverse-vr.html
4•gscott•17m ago•0 comments

China is mobilizing one-person AI startups

https://restofworld.org/2026/china-ai-one-person-companies-incentives/
2•Brajeshwar•18m ago•0 comments

Machine Payments Protocol (MPP)

https://stripe.com/blog/machine-payments-protocol
2•bpierre•18m ago•0 comments

Death to Scroll Fade

https://dbushell.com/2026/01/09/death-to-scroll-fade/
2•PaulHoule•18m ago•0 comments

Amazon introduces faster delivery with new 1-hour and 3-hour options

https://www.aboutamazon.com/news/retail/amazon-fast-delivery-orders
1•bookofjoe•19m ago•0 comments

Show HN: Agent Trust – Cryptographic identity and reputation for AI agents

https://github.com/kanoniv/agent-trust
1•dreynow•19m ago•1 comments

Epiplexity: Rethinking Information for Computationally Bounded Intelligence

https://arxiv.org/abs/2601.03220
2•fritzo•20m ago•1 comments