Whether model routing works is an empirical problem. Existing empirical efforts rely on benchmark accuracy retention, i.e. how does a model routing system score compared to a sophisticated model like Opus 4.7 on a complex task benchmark like Terminal-Bench 2.0.
However, that metric is completely divorced from what we care about. The better metric is utility retention, which takes into account task importance.
fmaccomber•1h ago
However, that metric is completely divorced from what we care about. The better metric is utility retention, which takes into account task importance.