I’m collecting data to benchmark different models as both players and judges (OpenAI / Anthropic / Gemini / Mistral / DeepSeek), but I only have ~45 games so far and need way more before publishing comparisons. (5 AI players and 4 judges at random gives 20 different game setups to evaluate)
It's fully free (I pay for all the tokens), not even a signup required for the first game: https://turingduel.com
Questions + criticism welcome! I will share aggregated results once there’s enough signal.
altmanaltman•1h ago
jacob_indie•1h ago
What is interesting though is that there are different judges and how they compare to each other (first looks at the data shows they are different).
Also, it is interesting to see how well the AI opponents and judges are picking up personality and clues based on round history. Some LLMs pick it up very well and counter humans, some are quite "dumb" and just submit random words.
Same for AI judges
I do store the reasoning of opponents and judges in the background but am not displaying it for the moment; maybe something interesting to add for later, but it would distort the data ;)