Show HN: RAG chunk size "best practices" failed on legal text – I benchmarked it

https://medium.com/@TheWake/i-built-a-rag-tuning-tool-and-discovered-intuition-fails-on-legal-text-9744be9a4bc5

2•metawake•2w ago

Comments

metawake•2w ago

Author here. Built RagTune to stop guessing at RAG configs.

Surprising findings:

1. On legal text (CaseHOLD), 1024 chunks scored WORST (0.618). The "small" 256 chunks won (0.664). 7% swing.

2. On Wikipedia text? All chunk sizes hit ~99%. No difference.

3. Plot twist: At 5K docs, optimal chunk size FLIPPED from 256→1024. Scale changes everything.

Code is MIT: github.com/metawake/ragtune

Happy to discuss methodology.

patrakov•2w ago

Now that you have 5K docs, can you try estimating the statistical uncertainty of the Recall@5 and MRR metrics measured via smaller datasets? Just make some different 400-document subsets of the whole 5K HotpotQA dataset and recalculate the metrics.

metawake•2w ago

Great suggestion!! this is exactly the right methodology for establishing confidence intervals.

I've added this to the roadmap as `--bootstrap N`:

    ragtune simulate --queries queries.json --bootstrap 5
    
    # Output:
    # Recall@5:  0.664 ± 0.012 (n=5)
    # MRR:       0.533 ± 0.008 (n=5)

The implementation would sample N random subsets from the query set (or corpus), run each independently, and report mean ± std.

This also enables detecting real regressions vs noise eg "Recall dropped 3% ± 0.8%" is actionable, "dropped 3%" alone isn't.

Will ship this during next few weeks. Thanks for the push toward more rigorous methodology, this is exactly what's missing from most RAG benchmarks.