In practice, when something breaks, it seems like the workflow is usually:
an alert fires (Datadog/Sentry/CloudWatch/etc.)
or a customer complains
engineers then start checking logs, traces, dashboards across multiple systems
and eventually manually reconstruct what happened across services
What I’m curious about:
How do you actually trace a single failed request or transaction across multiple services today?
What tools do you rely on most in practice (not in theory)?
Where does it usually break down — logs, tracing, instrumentation, or just missing context?
How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?
What part of this is still mostly manual stitching together of information?
Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.
verdverm•19m ago