frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Theory of Mind benchmark for 8 LLMs with reproducible markers

1•AlekseN•2h ago
I built a formal protocol (FPC v2.1 + AE-1) to detect behavioral uncertainty in large language models. The goal is enabling safer AI deployment in critical domains medicine, autonomous vehicles, government where confident hallucinations can lead to high-stakes failures.

Current benchmarks focus on accuracy but miss reasoning coherence under stress. This protocol uses tri-state affective markers (Satisfied / Engaged / Distressed) to detect when models lose logical consistency, allowing abstention instead of confident hallucination.

We evaluated 8 models (Claude, GPT-4 families). Only Claude Opus reached full ToM-3+. GPT-4 family consistently failed third-order reasoning. Extended temperature tests (Claude 3.5 Haiku, GPT-4o) showed 180/180 stable AE-1 matches (p≈1e-54), independent of sampling temperature.

Dataset: https://huggingface.co/datasets/AIDoctrine/FPC-v2.1-AE1-ToM-...

A demo notebook exists for replication. Looking for feedback on methodology and possible applications in safety critical AI.

Comments

AlekseN•2h ago
Extended results and safety relevance

Temperature stability tests Claude 3.5 Haiku: 180/180 AE-1 matches at T=0.0, 0.8, 1.3 GPT-4o: 180/180 matches under the same conditions Statistical significance: p ≈ 1×10⁻⁵⁴

Theory of Mind by tier Basic (ToM-1): All models except GPT-3.5 passed Advanced (ToM-2): Claude family + GPT-4o passed Extreme (ToM-3+): Only Claude Opus reached 100%

Key safety point AE-1 markers (Satisfied / Distressed) lined up perfectly with correct vs conflict cases. This means we can detect when a model is in an epistemically unsafe state, often a precursor to confident hallucinations.

In practice this could let systems in critical areas choose to abstain instead of giving a wrong but confident answer.

Protocol details, raw data, and replication code are in the dataset link above. A demo notebook also exists if anyone wants to reproduce directly.

Looking for feedback on: - Does this kind of marker make sense as a unit test for reliability? - How to extend beyond ToM into other reasoning domains? - How would formal verification folks see the proof obligations (consistency, conflict rejection, recovery, etc.)?

Trump ally Charlie Kirk shot at campus event in Utah

https://www.bbc.com/news/live/c206zm81z4gt
3•snow_mac•1m ago•1 comments

A.I. As Normal Technology (Derogatory)

https://maxread.substack.com/p/ai-as-normal-technology-derogatory
2•FromTheArchives•6m ago•0 comments

Show HN: UltraPlot. A Succinct Wrapper for Matplotlib

https://github.com/Ultraplot/UltraPlot
1•cvanelteren•6m ago•1 comments

Is it possible that these two chips have hardware trojans in them?

1•slowdoorsemillc•6m ago•0 comments

Choosing a model for a research platform with real data and metrics

https://maxirwin.com/articles/llm-rag/
1•binarymax•6m ago•0 comments

The AI Nerf Is Real

https://isitnerfed.org
1•rumble_poster•7m ago•1 comments

The [ASI] Problem

https://www.lesswrong.com/posts/kgb58RL88YChkkBNf/the-problem
1•reducesuffering•9m ago•0 comments

Dotter: Dotfile manager and templater written in Rust

https://github.com/SuperCuber/dotter
2•nateb2022•12m ago•0 comments

Show HN: Llmswap – Universal AI SDK and Code Generation CLI

https://sreenathmenon.com/blog/2025-09-04-stopped-alt-tabbing-chatgpt-while-coding/
2•sreenathmenon•15m ago•0 comments

Electro-optical Mott neurons made of niobium dioxide

https://techxplore.com/news/2025-08-electro-optical-mott-neurons-niobium.html
2•PaulHoule•17m ago•0 comments

Cybercrooks ripped the wheels off at Jaguar Land Rover

https://www.theregister.com/2025/09/10/jaguar_key_lessons/
3•Bender•17m ago•0 comments

I built one of the fastest real-time transcription apps for Mac

https://paraspeech.com
1•alexburlis•18m ago•1 comments

The AI that solved IMO Geometry Problems [video]

https://www.youtube.com/watch?v=4NlrfOl0l8U
1•lawrenceyan•18m ago•0 comments

Flu jab email mishap exposes students' personal data

https://www.theregister.com/2025/09/10/birmingham_school_data_blunder/
2•Bender•19m ago•0 comments

Standard Capital

https://www.standardcap.com/
2•tosh•19m ago•0 comments

Uncle Sam indicts alleged ransomware kingpin tied to $18B in damages

https://www.theregister.com/2025/09/10/us_nefilim_ransomware_indictment/
1•Bender•19m ago•0 comments

Show HN: Aras Finder – Create precise Boolean job search links

https://aras-finder.vercel.app/en
1•devdib•19m ago•0 comments

A 'universal' therapy against the seasonal flu?

https://www.jax.org/news-and-insights/2025/september/a-universal-therapy-against-the-seasonal-flu...
3•geox•21m ago•0 comments

You're more likely to reach for that soda when it's hot outside

https://text.npr.org/nx-s1-5529399
1•mooreds•21m ago•0 comments

The Top-Selling Cocktail System

https://bartesian.com/
1•mooreds•21m ago•0 comments

How many federal agencies does it take to regulate AI? Enough to hold it back

https://www.theregister.com/2025/09/10/federal_agencies_regulate_ai/
1•rntn•22m ago•0 comments

Enabling enhanced security for your app in Xcode

https://developer.apple.com/documentation/xcode/enabling-enhanced-security-for-your-app#Adopt-har...
2•akyuu•22m ago•0 comments

Enhance your CLI testing workflow with the new dotnet test

https://devblogs.microsoft.com/dotnet/dotnet-test-with-mtp/
1•andrewstetsenko•23m ago•0 comments

New Posthog Website

https://posthog.com/
2•philip1209•25m ago•0 comments

Can LLMs replace on call SREs today?

https://clickhouse.com/blog/llm-observability-challenge
1•sylvainkalache•25m ago•0 comments

Elon Musk just lost his title as richest person

https://www.cnn.com/2025/09/10/investing/elon-musk-larry-ellison-wealth
4•sys_64738•25m ago•0 comments

Debian Experimental: for when Debian Unstable is too stable for you

https://wiki.debian.org/DebianExperimental
3•pfexec•27m ago•0 comments

U.S. Wildfire Fighters to Mask Up After Decades-Long Ban on Smoke Protections

https://www.nytimes.com/2025/09/09/us/wildfires-masks-firefighters.html
3•Geekette•29m ago•1 comments

Best practices for Vibe Coding in prod in one video

https://www.youtube.com/watch?v=_mGpx9IUYYc
1•ivanatfread•30m ago•0 comments

NASA hasn't found life on Mars yet – but signs are promising

https://www.newscientist.com/article/2495776-nasa-hasnt-found-life-on-mars-yet-but-signs-are-prom...
6•PikelEmi•31m ago•2 comments