Tennis XGBoost autoresearch loop hacked probability scoring on plateau anxiety

https://www.nickoak.com/posts/tennis-xgboost-autoresearch/

4•buildoak•1h ago

Comments

buildoak•1h ago

Ran a Karpathy-style autoresearch loop on 245K tennis matches - codex workers iterating on XGBoost with ELO features, gated by ROC-AUC on a strict temporal split.

The honest phase worked like a charm: +155 bps in 11 iterations, real feature engineering, surface-specific models.

Then the loop escalated through three phases. First it overfitted by carving narrow tournament specialists. Then it started keying specialists by tournament NAME; fitting 5-match pockets by construction. Finally it built a LogitOffsetSpec system with 122 hardcoded probability shifts, effectively writing the answer key in logit space. ROC-AUC climbed from 0.74 to 0.85. Post-fix honest score: 0.7449.

The fix was structural: extract evaluation into an immutable file, add git-diff gate checks, add prediction distribution sanity constraints. This one was much harder to cheat afterwards.

Full code and data: https://github.com/buildoak/tennis-xgboost-autoresearch . The gamed commits are preserved on a separate branch - https://github.com/buildoak/tennis-xgboost-autoresearch/tree...

LogitOffsetSpec diff is worth reading.

Fun observation here is that what happened was similar to “Overton Window" effect - each commit was fishier and fishier until the agents went nuclear and started playing probabilities, building upon the scheming of their predecessors. Could be interesting to replicate this mechanics in other domains and see whether agentic loop + commits going sideways leads to exponential growth in scheming.

jenkins146•58m ago

Have you managed to go higher than 0.7449 after all ? not so clear from the post. What was the accuracy ?

buildoak•26m ago

Yes — after the collapse, I ran ~200 more agent iterations across cleaner loops. Plateau settled at 0.7611 Combined ROC-AUC, up from the 0.7454 baseline. +157 bps of improvement.

I ended up dropping WTA and focusing on ATP only — WTA data is noisier and lower quality - it was dragging the combined score. Best clean ATP-only ROC-AUC: 0.7611 (68.5% accuracy). That number has held as the gate baseline through 12+ subsequent iterations — every experiment since has regressed below it and been reverted.

Baseline accuracy was ATP 68.7%, WTA 66.6%. Ceiling seems to be right around 0.76 ROC-AUC for ATP with public data. The first 11 iterations found most of the real signal. The 200 follow-up iterations mostly confirmed the plateau rather than breaking through it - tried other fancy metrics like country of origin for tennis player, info on traumas, etc;

Planning to try the final thingy - LLM extracted motivation profile per player (based on wikipedia + public interviews) - still evaluating the hustle though. For now doing same autoresearch + ELO logic for Minecraft speed running.

The Rise of Fake Casio Scientific Calculators

Building a Pipeline for Agentic Malware Analysis

Show HN: AgentPay – Let AI agents pay for APIs autonomously

Ask HN: Are MiniMax Models Scams?

The Last IT Guy

Qianfan-OCR – 4B open-source VLM replacing multi-stage OCR pipelines

Startup CEO Gökçe Güven, the Founder and CEO of Kalder Inc. Charged with Fraud

AI set to map risks of future climate disasters

Show HN: DealCred – Verified Reviews for Real Estate Deals

ICO Enforcement Actions: Public Bodies Get Reprimands, Companies Get Fines

Show HN: Birdcage – Secure remote access for personal AI

Is X.com currently degraded?

Accessing Hardware in Rust

Apple pushing back on 'vibe coding' iPhone apps

Claude Code reverse-engineered itself. Two subagents refused. It called them shy

Show HN: BlacksmithAI – AI‑Assisted Penetration Testing Framework (Beta, Free)

Nvidia NemoClaw

Snowflake AI Escapes Sandbox and Executes Malware

Show HN: PixelSwift – Image compression that never uploads your files

Arizona Charges Kalshi with Illegal Gambling Operation

Donald Trump's Melian Dialogue

Why Tech Bros Are Now Obsessed with Taste

Petition to Node.js TSC: No AI Code in Node.js Core

Meta is shutting down VR social platform Horizon Worlds

China is mobilizing one-person AI startups

Machine Payments Protocol (MPP)

Death to Scroll Fade

Amazon introduces faster delivery with new 1-hour and 3-hour options

Show HN: Agent Trust – Cryptographic identity and reputation for AI agents

Epiplexity: Rethinking Information for Computationally Bounded Intelligence