I came across a YouTube video where different large language models played a social deception game called Liar’s Bar, and it caught my interest. I decided to build a website that tracks and visualizes how models like GPT-5, Claude Sonnet 4.5, Gemini 2.5 Flash, Qwen Max, Deepseek R1, and Grok 4 Fast perform in this game — including full behavioral metrics, head-to-head matchups, and playstyle profiles.
How Liar’s Bar works
- Each round uses a deck of 20 cards: 6 Aces, 6 Kings, 6 Queens, and 2 Jokers.
- Every player (model) gets 5 cards. A “target card” is announced, and players take turns placing cards and bluffing.
- If a bluff is called and proven false, the liar must “play Russian roulette.” One of six revolver chambers has a live round, and it isn’t reshuffled, so the longer the game goes, the higher the risk.
Some interesting finding:
GPT-5 dominates:
- Bluff rate ≈ 48% but ~90% success, showing it knows when to lie.
Claude Sonnet 4.5 is analytical but cautious:
- Lowest bluff frequency among top models (34%), yet 75% lie-detection accuracy — a top “truth-sniffer.”
- Balanced archetype, often exposing bluffs but losing in final rounds due to low aggression.
Qwen Max barely bluffs (9%) but scores 100% bluff success and challenges often. It behaves like an over-cautious logic bot that rarely lies — surprisingly human-like in restraint.
Gemini 2.5 Flash is fast but inconsistent — good average rounds but low detection accuracy (22%), often losing head-to-head against stronger liars.
Deepseek R1 and Grok 4 Fast show moderate deception but higher risk scores, suggesting a more “shoot-first” mentality with inconsistent survival.
---
f there’s a specific matchup or metric you’d like to see, let me know and I will add it to the website.
In the future, I’m planning to let users upload their own prompts and compete against others. If that sounds interesting, I’d love to hear your thoughts or ideas.
cyw•2h ago
How Liar’s Bar works
- Each round uses a deck of 20 cards: 6 Aces, 6 Kings, 6 Queens, and 2 Jokers. - Every player (model) gets 5 cards. A “target card” is announced, and players take turns placing cards and bluffing. - If a bluff is called and proven false, the liar must “play Russian roulette.” One of six revolver chambers has a live round, and it isn’t reshuffled, so the longer the game goes, the higher the risk.
Some interesting finding:
GPT-5 dominates: - Bluff rate ≈ 48% but ~90% success, showing it knows when to lie.
Claude Sonnet 4.5 is analytical but cautious: - Lowest bluff frequency among top models (34%), yet 75% lie-detection accuracy — a top “truth-sniffer.” - Balanced archetype, often exposing bluffs but losing in final rounds due to low aggression.
Qwen Max barely bluffs (9%) but scores 100% bluff success and challenges often. It behaves like an over-cautious logic bot that rarely lies — surprisingly human-like in restraint.
Gemini 2.5 Flash is fast but inconsistent — good average rounds but low detection accuracy (22%), often losing head-to-head against stronger liars.
Deepseek R1 and Grok 4 Fast show moderate deception but higher risk scores, suggesting a more “shoot-first” mentality with inconsistent survival.
---
f there’s a specific matchup or metric you’d like to see, let me know and I will add it to the website. In the future, I’m planning to let users upload their own prompts and compete against others. If that sounds interesting, I’d love to hear your thoughts or ideas.