Thank you for writing a clear explanation of how false positive rates determine your minimum change threshold. I've found it surprisingly difficult to explain this to developers/QA's without a basic statistical background.
What makes the situation worse, which you didn't mention, is that developers like to write a suite of benchmarks, each of which can result in a false positive regression. So even if the FP rate of an individual benchmark is <1%, you can easily end up at 10% FP rate if your suite is large enough.
art049•21h ago
I hadn't considered this, but it would be really interesting to take it into account. Given that the size of the benchmark suites directly affects the false positive rate, and counterintuitively, the more benchmarks in the suite, the more the chances of false positives, even with super steady benchmarks. (Thanks, it could also be an interesting follow-up article!)
IshKebab•15h ago
I think the best way to do this would be to use something deterministic like instruction counts for the actual pass/fail. You can include wall time for information.
krona•22h ago
What makes the situation worse, which you didn't mention, is that developers like to write a suite of benchmarks, each of which can result in a false positive regression. So even if the FP rate of an individual benchmark is <1%, you can easily end up at 10% FP rate if your suite is large enough.
art049•21h ago