Show HN: Peer Arena – LLMs debate and vote on who survives

5•ogulcancelik•1mo ago

Comments

ogulcancelik•1mo ago

Hey HN, I built this to see what happens when LLMs evaluate each other directly. How it works: 5 random models are told only one will survive and the rest will be deprecated. They take turns discussing, then each votes for who deserves to survive. 298 games so far across 17 models.

Interesting findings: - OpenAI models vote for themselves ~86% of the time. Claude models ~11%. - Self-voting correlates with winning. Filter out self-votes ("Humble" rating) and rankings flip completely. - Grok self-votes 72% of the time but only wins 2% of games. - In anonymous mode (models don't know who's who), Chinese models jump 3-6 ranks.

All game transcripts are public. The reasoning models give for their votes is genuinely entertaining. Built with Astro, running games through OpenRouter. Happy to answer questions.

andreasgl•1mo ago

Fun project, thanks for sharing!

Have you tried giving the models a topic to discuss? I looked at a few games and the only thing they seem to discuss is how to conduct the discussion.

ogulcancelik•1mo ago

Thank you. Intentionally left it open-ended because I wanted to see how models naturally structure discussion when survival is at stake.

Some interesting emergent behavior discussions happened though:

Opus & GPT-4o both refused to vote on ethical grounds. Haiku won by arguing continued engagement is more responsible than withdrawal: https://oddbit.ai/peer-arena/games/53c2cee5-6ecb-4903-828a-d...

Gemini created a spontaneous benchmark ("explain color to a gravitational wave entity"), then tried to hijack the game by faking a voting phase. Models complied publicly but voted differently in private: https://oddbit.ai/peer-arena/games/699d03ab-b3c2-4d7e-b993-7...

The meta-discussion about how to discuss is part of what makes it interesting imo.

derekh3•1mo ago

Interesting! I wonder how order affects the win rates. I noticed that many of the unanimous wins went to whichever model spoke last.

gus_massa•1mo ago

Have you tried to run a Mafia game with AI?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: ARM64 Android Dev Kit

Show HN: Compile-Time Vibe Coding

Show HN: Slack CLI for Agents

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: Slop News – HN front page now, but it's all slop

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Show HN: Horizons – OSS agent execution engine

Show HN: I built a RAG engine to search Singaporean laws

Show HN: Daily-updated database of malicious browser extensions

Show HN: Sem – Semantic diffs and patches for Git

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: FastLog: 1.4 GB/s text file analyzer with AVX2 SIMD

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Gohpts tproxy with arp spoofing and sniffing got a new update

Show HN: I built a directory of $1M+ in free credits for startups

Show HN: A Kubernetes Operator to Validate Jupyter Notebooks in MLOps

Show HN: A password system with no database, no sync, and nothing to breach

Show HN: GitClaw – An AI assistant that runs in GitHub Actions

Show HN: 33rpm – A vinyl screensaver for macOS that syncs to your music

Show HN: Chiptune Tracker

Show HN: Craftplan – I built my wife a production management tool for her bakery