You've reached the end!
As far as Opus vs. GPT 5.5 etc, I generally decide with:
1. Code? -> Opus
2. Docs? -> GPT
3. Real-time or recent information needed? -> Gemini
It's far from perfect though. Would love to hear others thoughts.
Perplexity for deep research
Claude Opus for coding, Sonnet for writing
Gemma4 for local AI overviews and analysis
Qwen coder for local prototyping
PaulHoule•5h ago
Like if you want to accurately know if one model is better than another you have to test it on hundreds if not thousands of examples which are carefully graded in difficulty, not in the training sets, etc.
Practically you might try model A and model B and use each one 2-3 times on different tasks and walk out with the impression that A is really good and B sux, but it could be model A got lucky because you asked it to do things it is good at or maybe it just got lucky and got the right answer anyway.
See https://arxiv.org/html/2410.12972v1 and https://arxiv.org/pdf/2505.14810 -- those papers are considering a general space of tasks but you could totally do the same kind of eval for the tasks you care about.
bix6•3h ago
PaulHoule•1h ago