frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Which LLMs fold under pressure? We made 6 LLMs argue 300 hard cases to find out

https://servanda.ai/benchmarks/the-post-training-stress-test
5•luke14free•1h ago

Comments

luke14free•1h ago
Hey HN,

We built a benchmark that tests how LLMs behave when they have to hold a difficult debate position against an adversary.

We took 6 frontier models, paired them in structured disputes (business conflicts, ethics dilemmas, property disputes, family disagreements), and forced them to argue opposing sides before a third LLM mediator. Each model gets a position to defend and a fixed number of turns. A separate judge panel scores the outcome.

The interesting part isn't who "wins" but rather what the disputes reveal about post-training behavior. Some models fold almost immediately, conceding points they shouldn't. Others hold firm on weak positions when a smarter move would be strategic compromise.

We ran this as a Swiss tournament (like chess) - 10 rounds, ~300 matches total, every case played twice with sides swapped to cancel position bias. Three independent frontier judge LLMs score each ruling, majority vote decides the outcome.

A couple of things we noticed: - models tuned hardest to be agreeable are the ones that lose most, they tend to concede points mid-argument even when holding a strong position - some models argue much better when they're on the "sympathetic" or "morally comfortable" side of a dispute than when they're assigned the harsher position. E.g., a model might crush it defending a tenant against eviction but argue poorly when it has to defend the landlord's right to evict

P.S. For every match read the full argument transcript.

Rtings.com: Revamping Our Membership Program

https://www.rtings.com/company/revamping-our-membership-program
1•incognitojam•37s ago•0 comments

Facing Its Third Data Center, an Iowa County Rolls Out Extensive Zoning Rules

https://insideclimatenews.org/news/01032026/iowa-county-data-center-ordinance/
1•ourmandave•1m ago•0 comments

Accenture down to buy Downdetector as part of $1.2B deal

https://www.theregister.com/2026/03/03/accenture_buys_ookla_downdetector_ziff_davis/
1•pseudolus•2m ago•0 comments

The 'Recycling' Scam in the UK [video]

https://www.youtube.com/watch?v=pShQxuoPZ_w
1•robtherobber•2m ago•0 comments

Open-Source secure runtime for Agents: Orkia

https://orkia.io/
1•snooziu•4m ago•0 comments

From Code to Dog Perfume

https://reflector.dev/articles/software-to-dog-perfume/
1•0xlosh•6m ago•1 comments

Dotkey

https://github.com/cyril/dotkey
1•cyrilllllll•10m ago•0 comments

Iran acquired facial recognition technology through Russian company

https://www.lemonde.fr/en/investigations/article/2026/03/04/how-iran-secretly-acquired-facial-rec...
2•zczc•10m ago•1 comments

Package managers need to cool down

https://nesbitt.io/2026/03/04/package-managers-need-to-cool-down.html
3•jamietanna•12m ago•1 comments

Evolving Typst

https://laurmaedje.github.io/posts/evolving-typst/
2•birdculture•17m ago•0 comments

Jane Colden: Naming the Living World

https://worldsensorium.com/jane-colden/
1•dnetesn•18m ago•0 comments

Object-Oriented Programming: Themes and Variations

https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/download/508/444
1•todsacerdoti•20m ago•0 comments

I found a legit uncensored Sora/Runway alternative for AI video (unbound.video)

https://unbound.video
1•gabrieln•21m ago•1 comments

Show HN: We built a zero-webhook Merchant of Record for SaaS

https://www.kelviq.com/
2•alokdubey007•21m ago•1 comments

Cross-Lingual News Dedup at $100/Month – Embeddings, Pgvector, and UnionFind

https://yingjiezhao.com/en/articles/Cross-Lingual-News-Dedup-at-100-Dollar-a-Month/
1•ethan_zhao•22m ago•1 comments

Show HN: Chase – Automated invoice follow-up emails for freelancers

https://chase.allonys.com
1•Cvexperts•25m ago•0 comments

Show HN: SynthesisOS – A local-first, agentic desktop layer built in Rust

https://github.com/GastonGelhorn/synthesis-os
1•synthesis_dev•27m ago•1 comments

Introducing: Build Awesome

https://blogfontawesome.wpcomstaging.com/introducing-build-awesome-static-site-platform-kickstarter/
1•bovermyer•29m ago•0 comments

Ascensions

https://www.jmduke.com/posts/ascensions.html
1•montyanderson•29m ago•0 comments

Pg_QoS v1.0.0 stable release is out

https://www.postgresql.org/about/news/pg_qos-v100-stable-release-is-out-3251/
1•aamederen•30m ago•0 comments

YggTorrent Shuts Down After Hack, Leak and Stolen Crypto

https://torrentfreak.com/yggtorrent-shuts-down-after-hack-leak-and-stolen-crypto/
2•teroshan•35m ago•1 comments

Published a fitness app to connect trainers and clients

https://apps.apple.com/us/app/flexor-fitness-companion/id6758482608
1•maradlo•39m ago•1 comments

Show HN: ÆTHERYA Core – deterministic policy engine for governing LLM actions

https://github.com/nayfly/aetherya-core
1•RobertMihai•39m ago•0 comments

Git-oops – undo any Git mistake with one command

https://github.com/hxmanss/git-oops
2•Hxmanss•40m ago•0 comments

Any Resolution Any Geometry: From Multi-View to Multi-Patch

https://dreamaker-mrc.github.io/Any-Resolution-Any-Geometry/
1•smusamashah•40m ago•0 comments

Whuppity Scoorie: the Scottish spring ritual bringing a town together

https://www.theguardian.com/uk-news/2026/mar/04/whuppity-scoorie-scotland-spring-ritual-lanark-cross
1•samizdis•41m ago•0 comments

Show HN: A .NET Web Framework on the Base .NET Core SDK

https://github.com/WispFramework/Wisp
2•utf_8x•41m ago•0 comments

European Central Bank: AI may be creating instead of destroying jobs for now

https://www.reuters.com/business/ai-may-be-creating-instead-destroying-jobs-now-ecb-blog-argues-2...
4•giuliomagnifico•43m ago•1 comments

Marc Benioff Praises Grok

https://twitter.com/cb_doge/status/2028936688689352818
1•sourcegrift•45m ago•0 comments

Show HN: Glyph, a local-first Markdown notes app for macOS built with Rust

https://glyphformac.com/
3•skarat•46m ago•2 comments