I analyzes how different LLMs bluff, lie, and survive in the game Liar's Bar

https://liars-bar-one.vercel.app

1•cyw•2h ago

Comments

cyw•2h ago

I came across a YouTube video where different large language models played a social deception game called Liar’s Bar, and it caught my interest. I decided to build a website that tracks and visualizes how models like GPT-5, Claude Sonnet 4.5, Gemini 2.5 Flash, Qwen Max, Deepseek R1, and Grok 4 Fast perform in this game — including full behavioral metrics, head-to-head matchups, and playstyle profiles.

How Liar’s Bar works

- Each round uses a deck of 20 cards: 6 Aces, 6 Kings, 6 Queens, and 2 Jokers. - Every player (model) gets 5 cards. A “target card” is announced, and players take turns placing cards and bluffing. - If a bluff is called and proven false, the liar must “play Russian roulette.” One of six revolver chambers has a live round, and it isn’t reshuffled, so the longer the game goes, the higher the risk.

Some interesting finding:

GPT-5 dominates: - Bluff rate ≈ 48% but ~90% success, showing it knows when to lie.

Claude Sonnet 4.5 is analytical but cautious: - Lowest bluff frequency among top models (34%), yet 75% lie-detection accuracy — a top “truth-sniffer.” - Balanced archetype, often exposing bluffs but losing in final rounds due to low aggression.

Qwen Max barely bluffs (9%) but scores 100% bluff success and challenges often. It behaves like an over-cautious logic bot that rarely lies — surprisingly human-like in restraint.

Gemini 2.5 Flash is fast but inconsistent — good average rounds but low detection accuracy (22%), often losing head-to-head against stronger liars.

Deepseek R1 and Grok 4 Fast show moderate deception but higher risk scores, suggesting a more “shoot-first” mentality with inconsistent survival.

---

f there’s a specific matchup or metric you’d like to see, let me know and I will add it to the website. In the future, I’m planning to let users upload their own prompts and compete against others. If that sounds interesting, I’d love to hear your thoughts or ideas.

High-fat diet impairs memory by autophagic-lysosomal dysfunction in Drosophila

Not Another Workflow Builder

Qupak: Pattern Matching for Prolog with library(reif)

Princeton Engineering Anomalies Research

Silicon Valley wants to help me make a superbaby. Should I let it?

Air traffic controllers working without pay begin to call out sick

Building a JavaScript Runtime from Scratch using C

Python 3.14 Released with Template String Literals, Deferred Annotations, and

I struggle to find old messages in ChatGPT conversations

InstaVolt is using GPS tracking to catch thieves stealing its EV charging cables

West Coast's two monster faults could trigger back-to-back earthquakes

Show HN: Getting AI Models to Wink – The Wink Test

AI ML Jargon

Gemini Browser

Hulu Becomes Global General Entertainment Brand on Disney+ Beginning October 8

Investing in America 2025

N.J. Attorney General Investigating Uber over Handling of Sexual Assaults

RIP Robert Murray-Smith (1963 – 2025) [video]

Brazil's Finance Minister confirms studies on eliminating public transport fares

We evaluated Google's new computer use model on real websites

What's new in Python 3.14

Agentic workflow integrating any REST API into a graph using GraphOS MCP Tools

Is the "Nintendo Classics" collection good value?

Month MiniPC Mini-Review: Minisforum AI X1 Pro

Easy Claude Code devcontainer workflows

Joint statement of scientists and researchers on the EU Chat Control regulation

Closer to production quality Python notebooks with `marimo check`

Become unbannable from your email

Katherina Lynn Faked Her Way into Yale. Then She Got Expelled

Cold war power play: how the Stasi got into computer games