I'd love to learn about more statistical techniques for doing such analysis. For example, one thing I looked into was correlating tenants and identifying a likely culprit, and it's often not just a matter of absolute request volume. If multiple tenants' latencies increase at once, it's usually because one of them started doing something. But its hard to isolate what that is, when there are many different types of workloads with unpredictable performance impact.
krona•6mo ago
What makes the situation worse, which you didn't mention, is that developers like to write a suite of benchmarks, each of which can result in a false positive regression. So even if the FP rate of an individual benchmark is <1%, you can easily end up at 10% FP rate if your suite is large enough.
art049•6mo ago