I tested:
- Gemini Pro 3 - Opus 4.6 - GLM-5 - Kimi 2.5
My rough criteria:
- Code correctness (first-pass compile success) - Quality of architectural suggestions - Refactor clarity - Handling of existing code context - Cost per useful output
Surprisingly (at least to me), Kimi 2.5 gave the best cost/performance ratio for this particular workload. It wasn’t always the most “verbose” or polished, but it required the fewest correction loops per dollar spent.
Opus 4.6 felt strong on reasoning-heavy changes, but cost scaled quickly. Gemini Pro 3 was decent but inconsistent in multi-file refactors. GLM-5 was interesting but sometimes hallucinated internal project structures.
This is obviously anecdotal and project-specific.
Curious:
What models are people here using for real-world codebases?
Has anyone benchmarked cost vs correction loops?
Are people optimizing for raw quality or iteration speed per dollar?
Would love to hear other dev experiences, especially from people working in Go or other statically typed backends.