frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: A business SIM where humans beat GPT-5 by 9.8 X

14•sumit_psp•1h ago
Hi HN,

Can current AI systems actually run a business?

There’s a growing belief that LLM agents can already manage entire teams, replace the entire software stack or even act as an AI CEO.

So we built a controlled, measurable environment to evaluate this premise.

Why did we build this benchmark?

A modern enterprise operates in a dynamic environment with high uncertainty and incomplete information. The CEO has to deal with delayed consequences, staffing/resource tradeoffs and death by a thousand cuts of failure modes.

If we ever want AI systems that can meaningfully make operational or strategic decisions, say an AI CEO, then they must be able to handle these dynamics.

So we made one.

What did we build?

Mini Amusement Parks (MAPs) is a RollerCoaster Tycoon style business simulator with: - Stochastic events - Incomplete information - Staffing, restocking, maintenance - Long horizon planning - Compounding operational failures - Resource constraints - Spatial layout affecting outcomes

You can play it & make it to the leaderboard here: https://maps.skyfall.ai/play (it’s fun)

It looks like a simple game. But underneath, it’s a benchmark designed to answer one question:

Can an agent operate a business coherently over time?

What we tested

We evaluated: - Humans (internal and external testers) - Multiple GPT-5 agents - Variants with additional tools, documents, practice mode, planning scaffolds, etc.

We intentionally stacked in favour of the models - full documentation, step by step action interfaces, sandbox exploration mode, extra observations, multiple prompting strategies, etc.

What happened?

Humans destroyed the agents by FAR. Even the strongest model, with documentation, tool use, and sandbox “practice”, reached <10% of human performance. The failure modes were consistent: - chasing flashy upgrades instead of profitable ones - ignoring maintenance, staffing, restocking - overreacting to noise - zero long-term plan - sandbox training often made things worse

It became clear: LLMs can use tools, but they cannot run systems. They break when randomness, time, and spatial constraints matter.

Why does this matter?

There’s a growing narrative that: - LLMs will run entire companies - LLMs will take over the jobs of CEOs - LLMs can be autonomous agents - LLMs can manage workflows end-to-end

MAPs show the complete opposite.

Operating a business requires: foresight, risk modeling, temporal reasoning, causal understanding, prioritization under uncertainty, adaptive planning. These are the basics of what a functional and real AI CEO would need and this is exactly where the current models break.

If an LLM can’t run a toy business, how can you trust it with a real business?

This benchmark is our first step toward understanding what an AI system would actually need in order to exhibit enterprise level decision making and the basics of the AI CEO. AI CEO is not a chatbot, not chain of thought, definitely not an agent wrapper but a true demonstration of operational intelligence.

We’re sharing this because: - we want the community to try to beat the models - we want criticism of the benchmark - most importantly, we want an honest discussion about what “AI CEO” is and should do (surely it’s not LLMs)

If you want to try beating the agents (it’s fun!): https://maps.skyfall.ai/play

If you want the read more about it, you can do so here: https://skyfall.ai/blog/building-the-foundations-of-an-ai-ce...

Check our the launch video here: https://www.youtube.com/watch?v=7oqVAWw5Ii8

Happy to answer questions in the thread.

Comments

devincintron•1h ago
Saving this for next time I get over caffeinated and try to convince my friends that economically viable AI will make their CPG business irrelevant
devincintron•1h ago
Have you talked to Alex Duffy from Good Start Labs? Recommend reaching out
kevc•49m ago
It feels like we are pretty far away from LLMs running a concession stand (see andon labs) so not surprised it would struggle here. Still the failure modes are super interesting and having benchmarks seems to be the starting point to domain-specific improvements.
WellingtonWells•4m ago
I'm kinda curious how a VLM would do -- better spatial reasoning but worse planning? I don't use an AI web browser, but I'd be curious to know what happens if you throw something like OpenAI Atlas at the game's webpage.

Versatile gene-switch tool uses non-toxic molecule for safer research

https://phys.org/news/2025-11-versatile-gene-tool-toxic-molecule.html
1•PaulHoule•2m ago•0 comments

Show HN: Build AI chatbots and structured APIs easily with custom RAG knowledge

https://easyai.passiolife.com
1•aebranton•3m ago•1 comments

Adobe to Buy Semrush for $1.9B

https://techcrunch.com/2025/11/19/adobe-to-buy-semrush-for-1-9-billion/
1•bhartzer•3m ago•0 comments

Visual Studio Code Private Marketplace: Your Team's Secure+Curated Extension Hub

https://code.visualstudio.com/blogs/2025/11/18/privatemarketplace
1•janpio•3m ago•0 comments

Show HN: Run Unsloth Dynamic GGUFs using Docker model runner

https://github.com/docker/model-runner
1•ericcurtin•4m ago•0 comments

Mystery of where Led Zeppelin first rehearsed together solved after 56 years

https://ledzepnews.com/2025/02/13/the-mystery-of-where-led-zeppelin-first-rehearsed-together-has-...
1•austinallegro•5m ago•0 comments

Bank Expense Searching Is Annoying So I Fixed It

https://paperright.xyz/blog/searching-in-paperright/
1•polalavik•6m ago•0 comments

A skin-permeable polymer for non-invasive transdermal insulin delivery

https://www.nature.com/articles/s41586-025-09729-x
1•gnabgib•7m ago•0 comments

BBC misrepresented Covid risk to support lockdown (2024)

https://www.telegraph.co.uk/news/2024/01/25/covid-inquiry-bbc-misrepresented-risk-pandemic/
2•rgrieselhuber•9m ago•0 comments

Don't Sleep on MCP

https://goto-code.com/dont-sleep-on-mcp/
2•mstipetic•14m ago•0 comments

Why Samsung Phones Are Failing Emergency Calls in Australia

https://hackaday.com/2025/11/19/why-samsung-phones-are-failing-emergency-calls-in-australia/
3•mivok•14m ago•2 comments

Designing twisty puzzles: a group theoretical approach

https://www.youtube.com/watch?v=1JbLGJTh3uk
1•fanf2•15m ago•0 comments

Vec2text – reconstruct text sequences from embeddings

https://github.com/vec2text/vec2text
1•gregsadetsky•15m ago•0 comments

Classroom capitalism: Why private equity is quietly taking over Indian schools

https://e27.co/classroom-capitalism-why-private-equity-is-quietly-taking-over-indian-schools-2025...
5•donbox•17m ago•0 comments

Anukari on the CPU (part 3: in retrospect)

https://anukari.com/blog/devlog/anukari-on-the-cpu-part-3-in-retrospect
1•humbledrone•18m ago•0 comments

VibeSDK/Cloudflare

https://github.com/cloudflare/vibesdk
1•trw55•19m ago•0 comments

Amazon Greenlights a New Stargate Series

https://www.gateworld.net/news/2025/11/amazon-greenlights-new-stargate-series/
1•Kye•19m ago•0 comments

Extending PartiQL for use with DynamoDB by directly editing the AST

https://chalk.ai/blog/partiql
1•noleary•20m ago•0 comments

Symbolic Optimal Assembly Program

https://en.wikipedia.org/wiki/Symbolic_Optimal_Assembly_Program
1•gjvc•20m ago•0 comments

Product Launch: Route Optimization SaaS

https://sco.essofore.com
1•sijain2•22m ago•0 comments

Show HN: Time Journey

https://www.timejourney.ai/
1•jumbotron737•24m ago•0 comments

Running a Heating, Ventilation, and Air Conditioning (HVAC) Model in EnergyPlus

https://www.ptidej.net/blog/running-an-hvac-model-in-energyplus/
2•yann-gael•24m ago•1 comments

Ask HN: Have you ever seen a perfect codebase?

2•mcdow•27m ago•2 comments

Linus Torvalds is optimistic about vibe coding except for this one use

https://www.zdnet.com/article/linus-torvalds-is-surprisingly-optimistic-about-vibe-coding-except-...
1•CrankyBear•27m ago•0 comments

Adobe to Buy Semrush for $1.9B

https://www.cnbc.com/2025/11/19/adobe-ai-semrush-stock-deal.html
3•pdyc•27m ago•0 comments

Cypherpunks Hall of Fame

https://github.com/cypherpunkshall/cypherpunkshall.github.io
9•kiray•29m ago•1 comments

Real evidence that LLMs cannot operate businesses

https://skyfall.ai/blog/building-the-foundations-of-an-ai-ceo
3•sumit_psp•30m ago•0 comments

A better way to search Hacker News using LLMs

https://github.com/typedef-ai/fenic-examples/tree/main/hn_agent
3•cpard•31m ago•1 comments

GPT-5.1-Codex-Max System Card

https://openai.com/index/gpt-5-1-codex-max-system-card/
1•wertyk•31m ago•0 comments

Kinds of Stealing

https://seths.blog/2025/11/kinds-of-stealing/
3•speckx•32m ago•0 comments