For example the models from Anthropic, OpenAI, Google etc. can be accessed via: - IDE integration, e.g. VS Code, JetBrains etc. - Dedicated apps and CLIs, e.g. Codex, Claude, Copilot CLI etc.
It's already bad enough that SWE orgs are struggling to quantify the strength weaknesses of the models themselves and now we have their integration/entry points to test out too and I'm not sure how we can even being to systematically evaluate these tools...
How are you approaching this? What's worked for you and what's not?