I'm planning on adding automated evals for your prompts + tool definitions against whatever models you want each time you make a change. Want to know how "dumb" of a model you can get away with? I want to add that as an option too. New model comes out and you need to see how it'll do? You get the idea.
I don't have billing/pricing setup yet which is why it's on the waitlist but sign up and I'll let you in to test it out and give feedback.