With the buzz that's happening with all the new AI models that get released (what feels like every other week), how are companies running internal AI evals to determine which model is best for their use case?
Comments
veloryn•32m ago
Lot of teams still seem to rely on ad-hoc eval sets and manual spot checks, especially for domain-specific use cases. The harder problem starts when agents or tool use enter the picture, the evaluation surface expands beyond model output quality to things like tool selection reliability, reasoning loops, cost stability, and cascading failure modes across steps. At that point you’re effectively evaluating system behavior rather than just model accuracy.
veloryn•32m ago