So we built Fallom (homage to Asimov), a platform where you can compare how multiple models perform on your own evals or production data. You can easily see cost and performance differences and know if it’s worth switching models.
Would love feedback from anyone who’s built internal model testing pipelines. We learned a lot the hard way and are still learning.