v0.4.4: Added adaptive 2nd-hop impact analysis to callers search — when a function has ≤10 unique callers, tilth automatically traces callers-of-callers in a single scan. First full 26-task Opus baseline (previously 5 hard tasks only). Haiku adoption improved from 42% to 78%, flipping Haiku from a cost regression to -38% $/correct.
v0.4.5: Bumped TOKEN_THRESHOLD from 3500 to 6000 estimated tokens (~24KB), so mid-sized files return full content instead of an outline that agents then read back via 5–7 sequential --section calls. Fixed two major regressions: gin_radix_tree (+35% → ~tie) and rg_search_dispatch (+90% → -26% win). Sonnet hit 100% accuracy (52/52) and -34% $/correct overall.
--
https://github.com/jahala/tilth/
Full results: https://github.com/jahala/tilth/blob/main/benchmark/README.m...
-- PS: I dont have the budget to run the benchmark a lot (especially with Opus), so if any token whales has capacity to run some benchmarks, please feel free to PR results.
jahala•1h ago