Recently I tried to write down a more systematic way to reason about latency distributions in production: how different distribution shapes behave, why aggregation and sampling often lie to us, and why segmentation (by endpoint, tenant, region, workload) usually matters more than adding more percentiles.
I’m curious how others here approach this in practice:
Do you have a mental model for interpreting P99 during incidents?
What charts or breakdowns have actually helped you debug latency issues?
Have you been burned by “good-looking” percentiles that hid real problems?
I wrote up my notes here for reference: https://optyxstack.com/performance/latency-distributions-in-practice-reading-p50-p95-p99-without-fooling-yourself
Would love to hear how people handle this in real systems.