1. Agentic coding tasks consume ~1000× more tokens than chat or reasoning workloads. And input tokens, not output, become the dominant cost driver, because each round re-feeds the entire trajectory back into the model.
2. More tokens ≠ better outcomes. Runs on the same task can vary by up to 30× in token use, and accuracy often peaks at intermediate cost. Beyond that, extra spending tends to reflect redundant exploration and does not bring further performance gain.
3. Models differ substantially in token efficiency. On the same successfully solved tasks, Kimi-K2 and Claude Sonnet-4.5 use roughly twice as many tokens as GPT-5.2. The gap becomes even larger when all the models fail.
4. Human-rated task difficulty weakly predicts actual cost. "Easy" tasks for humans can be surprisingly expensive for agents, and vice versa. The classic "Moravec's Paradox" is also true for coding agents!
5. Agents struggle to predict their own costs. Self-prediction correlations top out around 0.39, and every model we tested systematically underestimates what a task will cost. Result-based pricing still has a long way to go when we cannot even figure out the token cost beforehand.
jiaxinpei•1h ago
1. Agentic coding tasks consume ~1000× more tokens than chat or reasoning workloads. And input tokens, not output, become the dominant cost driver, because each round re-feeds the entire trajectory back into the model. 2. More tokens ≠ better outcomes. Runs on the same task can vary by up to 30× in token use, and accuracy often peaks at intermediate cost. Beyond that, extra spending tends to reflect redundant exploration and does not bring further performance gain. 3. Models differ substantially in token efficiency. On the same successfully solved tasks, Kimi-K2 and Claude Sonnet-4.5 use roughly twice as many tokens as GPT-5.2. The gap becomes even larger when all the models fail. 4. Human-rated task difficulty weakly predicts actual cost. "Easy" tasks for humans can be surprisingly expensive for agents, and vice versa. The classic "Moravec's Paradox" is also true for coding agents! 5. Agents struggle to predict their own costs. Self-prediction correlations top out around 0.39, and every model we tested systematically underestimates what a task will cost. Result-based pricing still has a long way to go when we cannot even figure out the token cost beforehand.