Towards coding agents, I wonder if there are any good / efficient ways to measure how much different implementations work on coding? SWE-bench seems good, but expensive to run. Effectively I’m curious for things like: given tool definition X vs Y (eg. diff vs full file edit), prompt for tool X vs Y (how it’s described, does it use examples), model choice (eg. MCP with Claude, but python-exec inline with GPT-5), sub-agents, todo lists, etc. how much across each ablation, does it matter? And measure not just success, but cost to success too (efficiency).
Overall, it seems like in the phase space of options, everything “kinda works” but I’m very curious if there are any major lifts, big gotchas, etc.
I ask, because it feels like the Claude code cli always does a little bit better, subjectively for me, but I haven’t seen a LLMarena or clear A vs B, comparison or measure.
I've been using it a lot lately and anything beyond basic usage is an absolute chore.
binalpatel•1h ago
https://github.com/caesarnine/rune-code
Part of the reason I switched to it initially wasn't so much it's niceties versus just being disgusted at how poor the documentation/experience of using LiteLLM was and I thought the folks who make Pydantic would do a better job of the "universal" interface.
ziftface•57m ago