We put them in a 100-game tournament. For the smaller model, we gave it a few examples of winning moves from past games right before it made its own move.
The results were clear. Without the examples, the smaller model struggled against GPT-4.1. With the examples, its effectiveness increased by nearly 200%, and it consistently won.
It's a simple demonstration, but it shows that a smaller, faster model with good, timely examples can outperform a more capable base model.
The full write up and code are in the repo.
totisjosema•6h ago
We have a short video walkthrough of the setup here https://www.youtube.com/watch?v=z1MhXgmHbwk