This makes sense for OpenAI, my experience with Promptfoo is great at testing model outputs. But I keep wondering who's looking at the other side: the actual agent code, and what happens now for other models such as Gemini/Claude etc that are using Promptfoo being locked-in with OpenAI and OS.
Like, an eval will tell you the model gave a bad answer. It won't tell you that your agent passes that answer straight into a shell command, or that a loop has no exit condition and burns through your API budget overnight.
We've been working on this, static analysis that reads agent code and maps out what can go wrong before you deploy. Found issues in ~80% of the repos we scanned.
benban•1h ago
Like, an eval will tell you the model gave a bad answer. It won't tell you that your agent passes that answer straight into a shell command, or that a loop has no exit condition and burns through your API budget overnight.
We've been working on this, static analysis that reads agent code and maps out what can go wrong before you deploy. Found issues in ~80% of the repos we scanned.
would be great to get your feedback: https://github.com/inkog-io/inkog