frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Live agent face-off in CivBench: Claude Opus 4.6 vs. GPT-5.2

https://www.clashai.live
6•mbh159•1h ago

Comments

mbh159•1h ago
Opus 4.6 just dropped, so we’re tossing it straight into the arena.

CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound.

In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live.

One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API.

After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live.

Feedback welcome, especially from people working on agent evals or RL.

weisser•1h ago
What sort of context do you give the APIs when you are starting the game? Does it need to learn the rules as it goes?
mbh159•1h ago
We have a standard harness for each of the model's that we test. Each prompt includes the rules, access to memory, and a lookup of the complete ruleset. The prompt adapts adding legal actions per turn and guidance depending on the stage of the game (updated based on the technological progress of the player).

Unlike RL algorithms these LLMs wouldn't learn quick enough without the prior knowledge the harness provides

jcion•24m ago
It’s interesting seeing the gameplay logic broken down like this. From playing civ for many years my (human) early game strategy feels effortless. Cool to see the stark contrast with an agent going through the rigorous analysis:

Turn 29: I have 3 cities but TWO are undefended (Antium and Westhold) - critical risk. I have 17 warriors, many with 0 moves already used this turn. I need to garrison the undefended cities immediately. Unit #125 is at (6,57) with moves and goto - it's near Westhold (7,57), move it there. For Antium (13,59), I need to find a warrior with moves near it. Looking at warriors with moves: #125 at (6,57), #120, #122, #124, #132, #133, #136, #144, #145, #146 - need to check which have moves remaining. Units #128, #137, #141, #127 have 0 moves. Let me move #125 to Westhold and find someone for Antium. I'll also fortify units that are in cities, move explorers, and set worker #107 to work. Need to research tech too. With 17 warriors I should be exploring aggressively and preparing for expansion. I'll move warriors toward undefended cities and fortify defenders.

jasonkimtech•18m ago
Super interesting thanks for sharing
leocho9•13m ago
betting wen?

India's female workers watching hours of abusive content to train AI

https://www.theguardian.com/global-development/2026/feb/05/in-the-end-you-feel-blank-indias-femal...
1•thisislife2•40s ago•0 comments

Show HN: A state-based narrative engine for tabletop RPGs

https://github.com/dkoepsell/EverdiceRealm1
1•KoeppyLoco•42s ago•0 comments

Tell HN: Desktop apps - like the old Microsoft Office - preserve sovereignty

1•wewewedxfgdf•4m ago•0 comments

Ask HN: Why are we safe from botnet-powered vexatious litigation?

1•OgsyedIE•4m ago•0 comments

10-Part Series on Breadth-First Traversals in Haskell

https://doisinkidney.com/series/Breadth-First%20Traversals.html
1•romes•5m ago•0 comments

Why a solid app failed without marketing focus

1•topcone•6m ago•0 comments

Musk Predicts SpaceX Will Launch More AI Compute / Year Than Everything on Earth

https://cheekypint.substack.com/p/elon-musk-on-space-gpus-ai-optimus
1•m463•7m ago•1 comments

LLatte: Scalable Transformers for Ads at Meta

https://twitter.com/fb_engineering/status/2019524570315669981
1•LatteMetaAI•7m ago•0 comments

NASA Will Let Its Astronauts Bring iPhones to the Moon

https://twitter.com/NASAAdmin/status/2019259382962307393
3•m463•9m ago•0 comments

Digital Sovereignty with Seed Hypermedia (FOSDEM '26) [video]

https://www.youtube.com/watch?v=3gI7-h0wAE8
1•evv•10m ago•0 comments

Ask HN: What's in Your Clipboard?

1•dvh•10m ago•0 comments

Pandoc in the browser

https://pandoc.org/app/
2•george_____t•12m ago•1 comments

If you invested $100k in ETH when Eric Trump told you to buy, you have $1,245

https://xcancel.com/search?f=tweets&q=UziCryptoo%2Fstatus%2F2019496501890973964
2•doener•12m ago•2 comments

Skills Are the Most Underrated Feature in Agentic AI

https://www.brethorsting.com/blog/2026/02/skills-are-the-most-underrated-feature-in-agentic-ai/
1•aaronbrethorst•14m ago•0 comments

An AI Workflow to Slow Down and Reflect in the Age of Inference-Speed

https://www.souravinsights.com/blog/learning-with-ai-agents
1•SouravInsights•15m ago•0 comments

An Oral History of Unix (Thompson/Ritchie/12-More Interview Transcripts)

https://www.tuhs.org/Archive/Documentation/OralHistory/
1•rramadass•24m ago•0 comments

GABRIEL – turn messy qualitative corpora into analysis-ready datasets

https://github.com/openai/GABRIEL
1•michaelsbradley•24m ago•0 comments

Show HN: I reviewed about 300 academic papers of 2025 to write a book on startup

1•albertoasquer•25m ago•0 comments

Show HN: Founder-OS: Open Sourcing how I automate my company

https://github.com/cloudrepo-io/founder-os
1•256BitChris•28m ago•0 comments

Self-Contained Map Component for Swift with Multiple, Aggregated, Custom Markers

https://github.com/LittleGreenViper/BigJuJuMap
1•mooreds•29m ago•0 comments

The Riemann Hypothesis: Past, Present and a Letter Through Time

https://arxiv.org/abs/2602.04022
1•stared•29m ago•0 comments

Show HN: Savior – Prevent silent form data loss in the browser

https://github.com/Pepp38/Savior
1•Pepp38•30m ago•0 comments

OpenAI and Ginkgo Bioworks (YC S14) used GPT5 to lower protein production costs

https://openai.com/index/gpt-5-lowers-protein-synthesis-cost/
1•snitty•35m ago•0 comments

PPE Stockpile Degradation

https://chillphysicsenjoyer.substack.com/p/ppe-stockpile-degradation
2•crescit_eundo•36m ago•0 comments

What's the hardest thing about tracking your validated learnings?

1•localeyes•36m ago•0 comments

Show HN: Atomic Afterglow – Local-first audio analysis (Librosa/Streamlit)

https://atomic-afterglow.streamlit.app/
1•phasesequencer•37m ago•0 comments

ICE and CBP's Face-Recognition App Can't Verify Who People Are

https://www.wired.com/story/cbp-ice-dhs-mobile-fortify-face-recognition-verify-identity/
6•cdrnsf•38m ago•0 comments

Spotify, a Major Audiobook Provider, Will Soon Offer Physical Books

https://www.wsj.com/business/media/spotify-a-major-audiobook-provider-will-soon-offer-physical-bo...
2•bookofjoe•38m ago•1 comments

Distributed ML training through Web Cams

https://www.sarthakmangla.com/blog/wccl
1•amrrs•40m ago•0 comments

Unlocking a global audience with auto dubbing

https://blog.youtube/news-and-events/youtube-auto-dubbing-expressive-speech/
2•ingve•42m ago•0 comments