frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•5m ago•0 comments

Sid Meier's System for Real-Time Music Composition and Synthesis

https://patents.google.com/patent/US5496962A/en
1•GaryBluto•12m ago•1 comments

Show HN: Slop News – HN front page now, but it's all slop

https://dosaygo-studio.github.io/hn-front-page-2035/slop-news
3•keepamovin•13m ago•1 comments

Show HN: Empusa – Visual debugger to catch and resume AI agent retry loops

https://github.com/justin55afdfdsf5ds45f4ds5f45ds4/EmpusaAI
1•justinlord•16m ago•0 comments

Show HN: Bitcoin wallet on NXP SE050 secure element, Tor-only open source

https://github.com/0xdeadbeefnetwork/sigil-web
2•sickthecat•18m ago•1 comments

White House Explores Opening Antitrust Probe on Homebuilders

https://www.bloomberg.com/news/articles/2026-02-06/white-house-explores-opening-antitrust-probe-i...
1•petethomas•18m ago•0 comments

Show HN: MindDraft – AI task app with smart actions and auto expense tracking

https://minddraft.ai
2•imthepk•23m ago•0 comments

How do you estimate AI app development costs accurately?

1•insights123•24m ago•0 comments

Going Through Snowden Documents, Part 5

https://libroot.org/posts/going-through-snowden-documents-part-5/
1•goto1•25m ago•0 comments

Show HN: MCP Server for TradeStation

https://github.com/theelderwand/tradestation-mcp
1•theelderwand•28m ago•0 comments

Canada unveils auto industry plan in latest pivot away from US

https://www.bbc.com/news/articles/cvgd2j80klmo
2•breve•29m ago•1 comments

The essential Reinhold Niebuhr: selected essays and addresses

https://archive.org/details/essentialreinhol0000nieb
1•baxtr•31m ago•0 comments

Rentahuman.ai Turns Humans into On-Demand Labor for AI Agents

https://www.forbes.com/sites/ronschmelzer/2026/02/05/when-ai-agents-start-hiring-humans-rentahuma...
1•tempodox•33m ago•0 comments

StovexGlobal – Compliance Gaps to Note

1•ReviewShield•36m ago•1 comments

Show HN: Afelyon – Turns Jira tickets into production-ready PRs (multi-repo)

https://afelyon.com/
1•AbduNebu•37m ago•0 comments

Trump says America should move on from Epstein – it may not be that easy

https://www.bbc.com/news/articles/cy4gj71z0m0o
6•tempodox•37m ago•2 comments

Tiny Clippy – A native Office Assistant built in Rust and egui

https://github.com/salva-imm/tiny-clippy
1•salvadorda656•42m ago•0 comments

LegalArgumentException: From Courtrooms to Clojure – Sen [video]

https://www.youtube.com/watch?v=cmMQbsOTX-o
1•adityaathalye•45m ago•0 comments

US moves to deport 5-year-old detained in Minnesota

https://www.reuters.com/legal/government/us-moves-deport-5-year-old-detained-minnesota-2026-02-06/
8•petethomas•48m ago•3 comments

If you lose your passport in Austria, head for McDonald's Golden Arches

https://www.cbsnews.com/news/us-embassy-mcdonalds-restaurants-austria-hotline-americans-consular-...
1•thunderbong•53m ago•0 comments

Show HN: Mermaid Formatter – CLI and library to auto-format Mermaid diagrams

https://github.com/chenyanchen/mermaid-formatter
1•astm•1h ago•0 comments

RFCs vs. READMEs: The Evolution of Protocols

https://h3manth.com/scribe/rfcs-vs-readmes/
3•init0•1h ago•1 comments

Kanchipuram Saris and Thinking Machines

https://altermag.com/articles/kanchipuram-saris-and-thinking-machines
1•trojanalert•1h ago•0 comments

Chinese chemical supplier causes global baby formula recall

https://www.reuters.com/business/healthcare-pharmaceuticals/nestle-widens-french-infant-formula-r...
2•fkdk•1h ago•0 comments

I've used AI to write 100% of my code for a year as an engineer

https://old.reddit.com/r/ClaudeCode/comments/1qxvobt/ive_used_ai_to_write_100_of_my_code_for_1_ye...
2•ukuina•1h ago•1 comments

Looking for 4 Autistic Co-Founders for AI Startup (Equity-Based)

1•au-ai-aisl•1h ago•1 comments

AI-native capabilities, a new API Catalog, and updated plans and pricing

https://blog.postman.com/new-capabilities-march-2026/
1•thunderbong•1h ago•0 comments

What changed in tech from 2010 to 2020?

https://www.tedsanders.com/what-changed-in-tech-from-2010-to-2020/
3•endorphine•1h ago•0 comments

From Human Ergonomics to Agent Ergonomics

https://wesmckinney.com/blog/agent-ergonomics/
1•Anon84•1h ago•0 comments

Advanced Inertial Reference Sphere

https://en.wikipedia.org/wiki/Advanced_Inertial_Reference_Sphere
1•cyanf•1h ago•0 comments
Open in hackernews

Show HN: A business SIM where humans beat GPT-5 by 9.8 X

23•sumit_psp•2mo ago
Hi HN,

Can current AI systems actually run a business?

There’s a growing belief that LLM agents can already manage entire teams, replace the entire software stack or even act as an AI CEO.

So we built a controlled, measurable environment to evaluate this premise.

Why did we build this benchmark?

A modern enterprise operates in a dynamic environment with high uncertainty and incomplete information. The CEO has to deal with delayed consequences, staffing/resource tradeoffs and death by a thousand cuts of failure modes.

If we ever want AI systems that can meaningfully make operational or strategic decisions, say an AI CEO, then they must be able to handle these dynamics.

So we made one.

What did we build?

Mini Amusement Parks (MAPs) is a RollerCoaster Tycoon style business simulator with: - Stochastic events - Incomplete information - Staffing, restocking, maintenance - Long horizon planning - Compounding operational failures - Resource constraints - Spatial layout affecting outcomes

You can play it & make it to the leaderboard here: https://maps.skyfall.ai/play (it’s fun)

It looks like a simple game. But underneath, it’s a benchmark designed to answer one question:

Can an agent operate a business coherently over time?

What we tested

We evaluated: - Humans (internal and external testers) - Multiple GPT-5 agents - Variants with additional tools, documents, practice mode, planning scaffolds, etc.

We intentionally stacked in favour of the models - full documentation, step by step action interfaces, sandbox exploration mode, extra observations, multiple prompting strategies, etc.

What happened?

Humans destroyed the agents by FAR. Even the strongest model, with documentation, tool use, and sandbox “practice”, reached <10% of human performance. The failure modes were consistent: - chasing flashy upgrades instead of profitable ones - ignoring maintenance, staffing, restocking - overreacting to noise - zero long-term plan - sandbox training often made things worse

It became clear: LLMs can use tools, but they cannot run systems. They break when randomness, time, and spatial constraints matter.

Why does this matter?

There’s a growing narrative that: - LLMs will run entire companies - LLMs will take over the jobs of CEOs - LLMs can be autonomous agents - LLMs can manage workflows end-to-end

MAPs show the complete opposite.

Operating a business requires: foresight, risk modeling, temporal reasoning, causal understanding, prioritization under uncertainty, adaptive planning. These are the basics of what a functional and real AI CEO would need and this is exactly where the current models break.

If an LLM can’t run a toy business, how can you trust it with a real business?

This benchmark is our first step toward understanding what an AI system would actually need in order to exhibit enterprise level decision making and the basics of the AI CEO. AI CEO is not a chatbot, not chain of thought, definitely not an agent wrapper but a true demonstration of operational intelligence.

We’re sharing this because: - we want the community to try to beat the models - we want criticism of the benchmark - most importantly, we want an honest discussion about what “AI CEO” is and should do (surely it’s not LLMs)

If you want to try beating the agents (it’s fun!): https://maps.skyfall.ai/play

If you want the read more about it, you can do so here: https://skyfall.ai/blog/building-the-foundations-of-an-ai-ce...

Check our the launch video here: https://www.youtube.com/watch?v=7oqVAWw5Ii8

Happy to answer questions in the thread.

Comments

devincintron•2mo ago
Saving this for next time I get over caffeinated and try to convince my friends that economically viable AI will make their CPG business irrelevant
devincintron•2mo ago
Have you talked to Alex Duffy from Good Start Labs? Recommend reaching out
kevc•2mo ago
It feels like we are pretty far away from LLMs running a concession stand (see andon labs) so not surprised it would struggle here. Still the failure modes are super interesting and having benchmarks seems to be the starting point to domain-specific improvements.
WellingtonWells•2mo ago
I'm kinda curious how a VLM would do -- better spatial reasoning but worse planning? I don't use an AI web browser, but I'd be curious to know what happens if you throw something like OpenAI Atlas at the game's webpage.
chaosadm•2mo ago
So there are a couple of papers that try to use LLMs for UI-based enterprise task benchmarking like WorkArena++(ServiceNow) where the agent has to solve a couple of relatively simple enterprise tasks (like creating incident tickets based on some criteria that has to be determined by the agent etc). This benchmark in particular had quite low accuracy numbers especially on the more composite tasks. Curious about the OpenAI Atlas thing too.
devincintron•2mo ago
What business has the smallest context window to operate?

Like maybe if you can have constraints in place such that the space of variables is minimal we already have economically relevant AI

Like a drop shipping t-shirt thing - surely the right sequence of LMs can

(1) parse out vibes/trends (e.g., "67 is currently a meme") (2) tool call that out to a print shop (3) spam it on twitter

Seems like there's just so much white space on benchmarks and gyms for this

sumit_psp•2mo ago
Even in the minimal example there are way more variables than it first seems.

1. How many shirts do we order? 2. When is it worth moving on to the next trend? 3. How should we handle shipping? Do we market globally or locally?

Even the smallest business require a lot of balancing of priorities and planning for the long run with uncertain returns

devincintron•2mo ago
True

What's like the most minimally scoped business someone could operate entirely digitally though? Is it the drop ship crap? Or maybe like a web game w/ ad revenue?

WellingtonWells•2mo ago
Webgame with ad rev might work -- I was thinking some kind of churned out self-publishing of children's books? Though I'm not sure if you'd actually turn a profit. Whatever it was, it'd definitely have to be heavily engineered though -- custom tools, and basically a glorified flow chart.
devincintron•2mo ago
Okay so yeah this interests me because it seems so plausible that the right set of of LMs w/ custom tools surely could turn a profit on self-publishing of children's e-books.

Like how could that NOT be the case?

I feel like any sufficiently variable sparse business w/ minimal tool call needs it'd have to be the case that you can turn on a model and get something going.

t-shirt drop ship, children e-books, webgame ad-rev, something of this sort

sohilbhatia•2mo ago
Insanely cool
marvy101•2mo ago
cool
chaosadm•2mo ago
The game environment looks pretty neat. Not surprised to see LLMs struggling but with a benchmark to focus new techniques on, I am excited how some of the new solutions trying to top the leaderboard would do.
ludograngerFRA•2mo ago
Awesome !