frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Seedance2 – multi-shot AI video generation

https://www.genstory.app/story-template/seedance2-ai-story-generator
1•RyanMu•3m ago•1 comments

Πfs – The Data-Free Filesystem

https://github.com/philipl/pifs
1•ravenical•7m ago•0 comments

Go-busybox: A sandboxable port of busybox for AI agents

https://github.com/rcarmo/go-busybox
1•rcarmo•7m ago•0 comments

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf
1•gmays•8m ago•0 comments

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

https://www.bloomberg.com/news/newsletters/2026-02-03/musk-s-xai-merger-poses-bigger-threat-to-op...
1•andsoitis•8m ago•0 comments

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

https://www.youtube.com/watch?v=UNorxwlZlFk
1•lysace•9m ago•0 comments

Zen Tools

http://postmake.io/zen-list
1•Malfunction92•12m ago•0 comments

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

https://hailey.at/posts/3mear2n7v3k2r
1•carnevalem•12m ago•0 comments

The purpose of Continuous Integration is to fail

https://blog.nix-ci.com/post/2026-02-05_the-purpose-of-ci-is-to-fail
1•zdw•14m ago•0 comments

Apfelstrudel: Live coding music environment with AI agent chat

https://github.com/rcarmo/apfelstrudel
1•rcarmo•15m ago•0 comments

What Is Stoicism?

https://stoacentral.com/guides/what-is-stoicism
3•0xmattf•16m ago•0 comments

What happens when a neighborhood is built around a farm

https://grist.org/cities/what-happens-when-a-neighborhood-is-built-around-a-farm/
1•Brajeshwar•16m ago•0 comments

Every major galaxy is speeding away from the Milky Way, except one

https://www.livescience.com/space/cosmology/every-major-galaxy-is-speeding-away-from-the-milky-wa...
2•Brajeshwar•16m ago•0 comments

Extreme Inequality Presages the Revolt Against It

https://www.noemamag.com/extreme-inequality-presages-the-revolt-against-it/
2•Brajeshwar•16m ago•0 comments

There's no such thing as "tech" (Ten years later)

1•dtjb•17m ago•0 comments

What Really Killed Flash Player: A Six-Year Campaign of Deliberate Platform Work

https://medium.com/@aglaforge/what-really-killed-flash-player-a-six-year-campaign-of-deliberate-p...
1•jbegley•18m ago•0 comments

Ask HN: Anyone orchestrating multiple AI coding agents in parallel?

1•buildingwdavid•19m ago•0 comments

Show HN: Knowledge-Bank

https://github.com/gabrywu-public/knowledge-bank
1•gabrywu•24m ago•0 comments

Show HN: The Codeverse Hub Linux

https://github.com/TheCodeVerseHub/CodeVerseLinuxDistro
3•sinisterMage•26m ago•2 comments

Take a trip to Japan's Dododo Land, the most irritating place on Earth

https://soranews24.com/2026/02/07/take-a-trip-to-japans-dododo-land-the-most-irritating-place-on-...
2•zdw•26m ago•0 comments

British drivers over 70 to face eye tests every three years

https://www.bbc.com/news/articles/c205nxy0p31o
35•bookofjoe•26m ago•12 comments

BookTalk: A Reading Companion That Captures Your Voice

https://github.com/bramses/BookTalk
1•_bramses•27m ago•0 comments

Is AI "good" yet? – tracking HN's sentiment on AI coding

https://www.is-ai-good-yet.com/#home
3•ilyaizen•28m ago•1 comments

Show HN: Amdb – Tree-sitter based memory for AI agents (Rust)

https://github.com/BETAER-08/amdb
1•try_betaer•28m ago•0 comments

OpenClaw Partners with VirusTotal for Skill Security

https://openclaw.ai/blog/virustotal-partnership
2•anhxuan•29m ago•0 comments

Show HN: Seedance 2.0 Release

https://seedancy2.com/
2•funnycoding•29m ago•0 comments

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
1•thelok•29m ago•0 comments

Towards Self-Driving Codebases

https://cursor.com/blog/self-driving-codebases
1•edwinarbus•29m ago•0 comments

VCF West: Whirlwind Software Restoration – Guy Fedorkow [video]

https://www.youtube.com/watch?v=YLoXodz1N9A
1•stmw•30m ago•1 comments

Show HN: COGext – A minimalist, open-source system monitor for Chrome (<550KB)

https://github.com/tchoa91/cog-ext
1•tchoa91•31m ago•1 comments
Open in hackernews

Show HN: Theory of Mind benchmark for 8 LLMs with reproducible markers

1•AlekseN•4mo ago
I built a formal protocol (FPC v2.1 + AE-1) to detect behavioral uncertainty in large language models. The goal is enabling safer AI deployment in critical domains medicine, autonomous vehicles, government where confident hallucinations can lead to high-stakes failures.

Current benchmarks focus on accuracy but miss reasoning coherence under stress. This protocol uses tri-state affective markers (Satisfied / Engaged / Distressed) to detect when models lose logical consistency, allowing abstention instead of confident hallucination.

We evaluated 8 models (Claude, GPT-4 families). Only Claude Opus reached full ToM-3+. GPT-4 family consistently failed third-order reasoning. Extended temperature tests (Claude 3.5 Haiku, GPT-4o) showed 180/180 stable AE-1 matches (p≈1e-54), independent of sampling temperature.

Dataset: https://huggingface.co/datasets/AIDoctrine/FPC-v2.1-AE1-ToM-...

A demo notebook exists for replication. Looking for feedback on methodology and possible applications in safety critical AI.

Comments

AlekseN•4mo ago
Extended results and safety relevance

Temperature stability tests Claude 3.5 Haiku: 180/180 AE-1 matches at T=0.0, 0.8, 1.3 GPT-4o: 180/180 matches under the same conditions Statistical significance: p ≈ 1×10⁻⁵⁴

Theory of Mind by tier Basic (ToM-1): All models except GPT-3.5 passed Advanced (ToM-2): Claude family + GPT-4o passed Extreme (ToM-3+): Only Claude Opus reached 100%

Key safety point AE-1 markers (Satisfied / Distressed) lined up perfectly with correct vs conflict cases. This means we can detect when a model is in an epistemically unsafe state, often a precursor to confident hallucinations.

In practice this could let systems in critical areas choose to abstain instead of giving a wrong but confident answer.

Protocol details, raw data, and replication code are in the dataset link above. A demo notebook also exists if anyone wants to reproduce directly.

Looking for feedback on: - Does this kind of marker make sense as a unit test for reliability? - How to extend beyond ToM into other reasoning domains? - How would formal verification folks see the proof obligations (consistency, conflict rejection, recovery, etc.)?