Hi HN, I noticed it is almost impossible to run evals or train models on 3rd party integrations, so I built interactive environments for them. Feedback is more than welcome. Thanks!
Interesting fact - running evals on 40 tasks for Linear API, most frontier models scored surprisingly well:
- Claude Opus 4.5: 95% (38/40)
- GLM 4.6: 87.5% (35/40)
- Claude Sonnet 4.5: 85% (34/40)
- Claude Haiku 4.5: 82.5% (33/40)
- Kimi K2: 82.5% (33/40)
- Grok 4.1 Fast: 80% (32/40)
- GPT 5.1: 77.5% (31/40)
This makes me think whether we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.
hugobiais•2mo ago
Super interesting!
At my company we have our agent writing code to make API calls and we were looking for a way to evaluate our agent on exactly that! The problem with doing that yourself using the Gmail, Linear or Slack API is that you quickly hit rates limits, but if we have a copy of it, problem solved.
Will definitely try this!
hubertmarek•2mo ago
Where rate limits the main blocker for you?
akshay326•1mo ago
thanks for sharing, love the transparency sharing test results too.
mildly curious - why did you chose Slack & Linear? why not something else?
hubertmarek•2mo ago
Interesting fact - running evals on 40 tasks for Linear API, most frontier models scored surprisingly well:
- Claude Opus 4.5: 95% (38/40) - GLM 4.6: 87.5% (35/40) - Claude Sonnet 4.5: 85% (34/40) - Claude Haiku 4.5: 82.5% (33/40) - Kimi K2: 82.5% (33/40) - Grok 4.1 Fast: 80% (32/40) - GPT 5.1: 77.5% (31/40)
This makes me think whether we really need to reinvent the wheel and make special interfaces (MCPs) for agents interacting with services, when they can just use APIs as they are.