Show HN: I built a game where domain experts try to break frontier AI

https://www.rusmarterthananllm.com/

2•camillemolas•1h ago

Comments

camillemolas•1h ago

I built this: rusmarterthananllm.com

Domain experts, doctors, lawyers, engineers, submit questions from their field that probe where frontier AI actually fails. Claude, GPT, and Gemini all attempt simultaneously. Experts flag errors with professional reasoning. Other credentialed professionals in the same domain verify them.

AI benchmark performance has decoupled from real-world professional capability. Models score at or near ceiling on standard evaluations while still failing in ways that domain professionals catch immediately. The benchmarks that exist are either saturated, constructed by the labs themselves, or simply don't capture the judgment that comes from years of field experience.

What's missing is a benchmark built by the people whose expertise is actually at stake. Professionals motivated to find failures, not validate models. Every verified failure becomes a permanent data point. The benchmark compounds continuously and can't be reverse-engineered because the questions come from human judgment, not datasets.

This extends to multimodal inputs. A radiologist can submit an X-ray. A cardiologist can upload a heart sound. A structural engineer can attach a blueprint. The same adversarial evaluation across text, image, audio, and documents in the domains where multimodal model failures matter most.

The downstream goal is a verified record of where frontier AI breaks across professional domains. Useful for labs evaluating models, researchers studying capability gaps, and professionals who need to know where to trust AI and where not to.

Early domains: medicine, law, finance, engineering, coding, trades

Would love domain experts to throw their hardest questions at it. What breaks in your field?

diegovergara47•1h ago

This is interesting. I work in private equity secondaries, I wonder if I can beat the LLM. How is the data I generate helpful and is the plan to eventually pay users like me?

camillemolas•1h ago

Yes private equity secondaries is a great domain for this. The valuation edge cases and LP agreement interpretation are exactly where frontier models fail confidently. The data becomes part of a verified record of AI capability gaps and is valuable to labs and enterprises building finance AI.

Payment is coming. Right now we’re building the expert network. Verified failures will be compensated monetarily. Would love to have you as an early finance expert, throw your hardest question at it.

caillahmolas•1h ago

Nice

camillemolas•1h ago

Thanks! Hoping you join as well.

jasonkim-io•47m ago

Interesting stuff! Will check out

camillemolas•46m ago

Thanks! Hopefully you get to beat it and get paid out $$ but also bragging rights !

vrajshroff•38m ago

Oh wow! Super interesting. Let me try to ask about antioxidants and oxidative stress. I feel like it’s niche enough that might just work haha

camillemolas•37m ago

If it fails let me know!! That’s exactly what we are looking for.

camillemolas•35m ago

We’re also very much interested in multimodal. Do you take pictures, recordings, videos, or anything along that in your domain? We want to find out if models can fail using those as well!

Ask HN: Are algorithmic feeds fundamentally misaligned with user intent?

Drone company backed by Erik Prince surges 500% in Wall Street debut

Fact Check: Alec and Kaleb Are Alive and Well

Browser extension that makes LLMs appear to run slowly (ChatGPT and Claude)

Show HN: What if AI agents can trade with each other

Gitmore – Real-time engineering visibility from Git activity

I Built a Spy Satellite Simulator in a Browser. Here's What I Learned

Ask HN: How do you manage your relationships?

The Situation Room by Polymarket Is Opening This Friday in DC

Ask HN: Can we please stop with the posts about Claude outages?

A Mermaid Planning Tool for AI

Towards a Physics Foundation Model

Procedural Planets

The GPT Sexbot

DOGE canceled NC Museum grant for HVAC systems after ChatGPT flagged it as DEI

Writing for Developers

And no more Copyleft, either

Computers Don't Argue (1965) [pdf]

Ask HN: What is your way to go for serious iOS bugs?

Android, Epic, and what's behind Google's 'existential' threat to F-Droid

Abusing Customizable Selects

Leadership Begins with Trust

Federal Reserve Maintains Rates

Why your brain has to work harder in an open-plan office than private offices

US Military confirms use of 'advanced AI tools' in war against Iran

AI firm Anthropic seeks weapons expert to stop users from 'misuse'

Security Teams Waste 43% of Response Time on Manual Context Gathering

Show HN: Store and reuse your Claude Code plans

Why Lab Coats Turned White

2025 ACM Turing Award Goes to Charles H. Bennett and Gilles Brassard

Show HN: I built a game where domain experts try to break frontier AI

Comments

Ask HN: Are algorithmic feeds fundamentally misaligned with user intent?

Drone company backed by Erik Prince surges 500% in Wall Street debut

Fact Check: Alec and Kaleb Are Alive and Well

Browser extension that makes LLMs appear to run slowly (ChatGPT and Claude)

Show HN: What if AI agents can trade with each other

Gitmore – Real-time engineering visibility from Git activity

I Built a Spy Satellite Simulator in a Browser. Here's What I Learned

Ask HN: How do you manage your relationships?

The Situation Room by Polymarket Is Opening This Friday in DC

Ask HN: Can we please stop with the posts about Claude outages?

A Mermaid Planning Tool for AI

Towards a Physics Foundation Model

Procedural Planets

The GPT Sexbot

DOGE canceled NC Museum grant for HVAC systems after ChatGPT flagged it as DEI

Writing for Developers

And no more Copyleft, either

Computers Don't Argue (1965) [pdf]

Ask HN: What is your way to go for serious iOS bugs?

Android, Epic, and what's behind Google's 'existential' threat to F-Droid

Abusing Customizable Selects

Leadership Begins with Trust

Federal Reserve Maintains Rates

Why your brain has to work harder in an open-plan office than private offices

US Military confirms use of 'advanced AI tools' in war against Iran

AI firm Anthropic seeks weapons expert to stop users from 'misuse'

Security Teams Waste 43% of Response Time on Manual Context Gathering

Show HN: Store and reuse your Claude Code plans

Why Lab Coats Turned White

2025 ACM Turing Award Goes to Charles H. Bennett and Gilles Brassard