Also SigNoz supports rendering practically unlimited number of spans in trace detail UI and allows filtering them as well which has been really useful in analyzing batch processes: https://signoz.io/blog/traces-without-limits/
You can further run aggregation on spans to monitor failures and latency.
PS: I am SigNoz maintainer
---
While digging into Honeycomb's open source story, I did find these two awesome toys, one relevant to the otel discussion and one just neato
https://github.com/honeycombio/refinery (Apache 2) -- Refinery is a tail-based sampling proxy and operates at the level of an entire trace. Refinery examines whole traces and intelligently applies sampling decisions to each trace. These decisions determine whether to keep or drop the trace data in the sampled data forwarded to Honeycomb.
https://github.com/honeycombio/gritql (MIT) -- GritQL is a declarative query language for searching and modifying source code
Is that beginning "logged" at a separate point in time from when the span end is logged?
> AIUI, there aren't really start or end messages,
Can you explain this sentence a bit more? How does it have a duration without a start and end?
The thing is that at scale you’d never be able to guarantee that the start of the span showed up at a collector in chronological order anyway, especially due to the queuing intervals being distinct per collection sidecar. But what you could do with two events is discover spans with no orderly ending to them. You could easily truncate traces that go over the span limit instead of just dropping them on the floor (fuck you for this, OTEL, this is the biggest bullshit in the entire spec). And you could reduce the number of traceids in your parsing buffer that have no metadata associated with them, both in aggregate and number of messages in the limbo state per thousand events processed.
As such, it doesn't really have a beginning or end except that it has fields for duration and timestamps.
I'd check out the OTEL docs since I think seeing the examples as JSON helps clarify things. It looks like they have events attached to spans which is optional. https://opentelemetry.io/docs/concepts/signals/traces/
The other turns out to be our OPs teams problem more than OTEL’s. Well a little of both. If a trace goes over a limit then OTEL just silently drops the entire thing, and the default size on AWS is useful for toy problems not retrofitting onto live systems. It’s the silent failure defaults of OTEL that are giant footguns. Give me a fucking error log on data destruction, you asshats.
I’ll just use Prometheus next time, which is apparently what our OPs team recommended (except one individual who was the one I talked to).
We had Grafana Agent running which was wrapping the reference implementation OTEL collector written in go and it was pretty easy to see when data was being dropped via logs.
I think some limitation is also on the storage backend. We were using Grafana Cloud Tempo which imposes limits. I'd think using a backend that doesn't enforce recency would help.
With the OTEL collector I'd think you could utilize some processors/connectors or write your own to handle individual spans that get too big. Not sure on backends but my current company uses Datadog and their proprietary solution handles >30k spans per trace pretty easily.
I think the biggest issue is the low cohesion, high DIY nature of OTEL. You can build powerful solutions but you really need to get low level and assemble everything yourself tuning timeouts, limits, etc for your use case.
OTEL is the SpringBoot of telemetry and if you think those are fighting words then I picked the right ones.
And if you’ve ever tried to trace a call tree using correlationIDs and Splunk queries and still say OTEL is ‘just a fancy’ then you’re in dangerous territory, even if it’s just by way of explanation. Don’t feed the masochists. When masochists derail attempts at pain reduction they become sadists.
We ended up taking tracing out of these jobs, and only using on requests that finish in short order, like UI web requests. For our longer jobs and fanout work, we started passing a metadata object around that appended timing data related that specific job and then at egress, would capture the timing metadata and flag abnormalities.
We essentially propagate the context manually between APIs and lambdas, through HTTP headers, SNS/SQS and even storage. We also ended up pulling Tempo’s Parquet files (we’re self-hosting the grafana stack) into Redash to be able to do real analysis. We got a lot of great insights and were able to do a lot of tuning that would have been impossible otherwise, but it was quite an investment. Would love to know if there is anything out there that would have made this less painful.
I was at first implementing otel throughout my api, but ran into some minor headaches and a lot of boilerplate. I shopped a bit around and saw that Sentry has a lot of nice integrations everywhere, and seems to have all the same features (metrics, traces, error reporting). I'm considering just using Sentry for both backend and frontend and other pieces as well.
Curious if anyone has thoughts on this. Assuming Sentry can fulfill our requirements, the only thing taht really concerns me is vendor-lockin. But I'm wondering other people's thoughts
OTeL also has numerous integrations https://opentelemetry.io/ecosystem/registry/. In contrast, Sentry lacks traditional metrics and other capabilities that OTeL offers. IIRC, Sentry experimented with "DDM" (Delightful Developer Metrics), but this feature was deprecated and removed while still in alpha/beta.
Sentry excels at error tracking and provides excellent browser integration. This might be sufficient for your needs, but if you're looking for the comprehensive observability features that OpenTelemetry provides, you'd likely need a full observability platform.
1: https://github.com/getsentry/self-hosted/blob/25.5.1/docker-...
Otel can take a little while to understand because, like many standards, it's designed by committee and the code/documentation will reflect that. LLMs can help but the last time I was asking them about otel they constantly gave me code that was out of date with the latest otel libraries.
Prometheus is bog easy to run, Grafana understands it and anything involving alerting/monitoring from logs is bad idea for future you, I PROMISE YOU, PLEASE DON'T!
Why is issuing alerts for log events a bad idea?
If you need an artifact from your system, it should be tested. We test our logs and many types of metrics. Too many incidents from logs or metrics changing and no longer causing alerts. Never got to build out my alert test bed that exercises all know alerts in prod, verifying they continue to work.
Biggest one, sample rate is much higher (every log) and this can cause problems if service goes haywire and starts spewing logs everywhere. Logging pipelines tend to be very rigid as well for various reasons. Metrics are easier to handle as you can step back sample rate, drop certain metrics or spin up additional Prometheus instances.
Logging format becomes very rigid and if the company goes multiple languages, this can be problematic as different languages can behave differently. Is this exception something we care about or not? So we throw more code in attempt to get logging alerting into state that does not drive everyone crazy where if we were just doing "rate(critical_errors[5m] > 10" in Prometheus, we would be all set!
From other comments as well, seems it's still worth trying to integrate otel. Appreciate everyone's insights
Maybe this has changed?
I'll have to try this!
edit: actually Jaeger can just read those files directly, so no need to run a collector with the receiver. This is great!
reactordev•7mo ago
That said, if you own your infrastructure, I’d build out a signoz cluster in a heartbeat. Otel is awesome but once you set down a path for your org, it’s going to be extremely painful to switch. Choose otel if you’re a hybrid cloud or you have on premises stuff. If you’re on AWS, CloudWatch is a better option simply because they have the data. Dead simple tracing.
6r17•7mo ago
I wonder if there are any other adapters for trace injest instead of OTEL ?
darkstar_16•7mo ago
6r17•7mo ago
elza_1111•7mo ago
bbkane•7mo ago
mdaniel•7mo ago
https://github.com/uptrace/uptrace/blob/v1.7.6/LICENSE
https://github.com/openobserve/openobserve/blob/v0.14.7/LICE...
FunnyLookinHat•7mo ago
We've frequently seen a slowdown or error at the top of our stack, and the teams are able to immediately pinpoint the problem as a downstream service. Not only that, they can see the specific issue in the downstream service almost immediately!
Once you get to that level of detail, having your infrastructure metrics pulled into your Otel provider does start to make some sense. If you observe a slowdown in a service, being able to see that the DB CPU is pegged at the same time is meaningful, etc.
[Edit - Typo!]
makeavish•7mo ago
Also SigNoz has native correlation between different signals out of the box.
PS: I am SigNoz Maintainer
elza_1111•7mo ago
reactordev•7mo ago
Otel provides a means to sugar any metric with labels and attributes which is great (until you have high cardinality) but there are still things that are at the infrastructure level that only CloudWatch knows of (on AWS). If you’re running K8s on your own hardware - Otel would be my first choice.
elza_1111•7mo ago
Check this out, https://signoz.io/blog/6-silent-traps-inside-cloudWatch-that...
mdaniel•7mo ago
The demo for https://github.com/draios/sysdig was also just amazing, but I don't have any idea what the storage requirements would be for leaving it running