Following up on my prompt tool. I realized that "prompt engineering" is useless if we don't know which model actually handles the instruction best. We often argue if GPT-5 is better than Gemini 3 Pro, but it usually depends on the specific use case.
So I built a module to create Head-to-Head Model Benchmarks (Tier Lists).
How it works:
Define a specific task (e.g., "Writing a LinkedIn launch post").
Run it simultaneously on 14+ models (including the latest GPT-5, Gemini 3, Claude 4.5 families).
Compare metrics: Speed, Token usage, and qualitative output.
Rank them: Drag and drop models into S, A, B, C tiers based on the results.
Why I built this: I found that for coding, Model X might be S-Tier, but for creative writing, it drops to C-Tier. This tool lets the community build a library of proven "best models" for specific intents.
The experiment linked above shows how Gemini 3 Pro outperformed others in marketing copy (S-Tier) for my specific criteria, while others felt too robotic.
Would love to see you create your own benchmarks and share the results. Which model is your current daily driver for coding vs writing?