I always feel like there are gaps between benchmarks and real-world performance on all models, but especially open models as of late. I've used deepseek and kimi (albeit 2.6) extensively and while they work well for maybe 70 percent of tasks, its the last 30 they always trip themselves up on.
They all seem to be very poor at long-running tasks, maintaining context when a change spans multiple areas/layers of a codebase, and making architectural choices. That said, for making the next todo app or ai powered calorie tracker, they are just fine, and the most consumer friendly pricing.
noah34•1h ago
They all seem to be very poor at long-running tasks, maintaining context when a change spans multiple areas/layers of a codebase, and making architectural choices. That said, for making the next todo app or ai powered calorie tracker, they are just fine, and the most consumer friendly pricing.