Great work! I wonder if there is a way to combine similar cache items instead of dropping unlikely ones. Could the proposed attention estimation be used for that?
yorwba•2m ago
Yes, for example https://arxiv.org/pdf/2506.05410 merges two neighboring tokens with the lowest sum of past attention scores, and this method would enable using future expected attention instead.
yalok•35m ago
The paper only mentions evals for Ruler 4K and 16K - I wish they’d go further and measure for longer context windows. I was wondering if there would be some gain as compared to baseline (no compression) for this method - their results for Qwen with Ruler 16K seem to allude to that - at small compression ratios the evals look better than baseline - which means they are not just improving inference speed/memory, but addressing attenuation dilution problem…
tripplyons•52m ago
yorwba•2m ago