Interesting idea especially the “navigation vs selection” framing.
In practice, how are you measuring the +10pp gain? Are you using fixed eval sets or something more dynamic?
I’ve seen small models look better on benchmarks but regress pretty quickly once prompts/tools change slightly, so curious how stable these gains are over time.
pranabsarkar•1h ago
Fixed eval — 80 tools, 200 queries, 4 model sizes. +10pp came from "all tools" vs "tiered" on 1.5B.
You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.
gbibas•1h ago
This is cool. I am working on something similar for code schema reads by AI, which cost me a lot of tokens. I’ll share once battle tested. The idea of abstracting and then giving it a tree to follow is where I landed also.
aayushkumar121•1h ago
In practice, how are you measuring the +10pp gain? Are you using fixed eval sets or something more dynamic?
I’ve seen small models look better on benchmarks but regress pretty quickly once prompts/tools change slightly, so curious how stable these gains are over time.
pranabsarkar•1h ago
You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.