Scaling our observability platform by embracing wide events and replacing OTel

https://clickhouse.com/blog/scaling-observability-beyond-100pb-wide-events-replacing-otel

200•valyala•7mo ago

Comments

ofrzeta•7mo ago

Whenever I read things like this I think: You are doing it wrong. I guess it is an amazing engineering feat for Clickhouse but I think we (as in IT or all people) should really reduce the amount of data we create. It is wasteful.

XorNot•7mo ago

The problem with this is generally that you have logs from years ago, but no way to get a live stream of logs which are happening now.

(one of my immense frustrations with kubernetes - none of the commands for viewing logs seem to accept logical aggregates like "show me everything from this deployment").

knutzui•7mo ago

Maybe not via kubectl directly, but it is rather trivial to build this, by simply combining all log streams from pods of a deployment (or whatever else).

k9s (k9scli.io) supports this directly.

madduci•7mo ago

And what is the sense of keeping years of logs? I could probably understand very sensitive industries, but In general, I see a pure waste of resources. At most you need 60-90 days of logs.

brazzy•7mo ago

One nice side effects of the GDPR is that you're not allowed to keep logs indefinitely if there is any chance at all that they contain personal information. The easiest way to comply is to throw away logs after a month (accepted as the maximum justifiable for general error analysis) and be more deliberate about what you keep longer.

Sayrus•7mo ago

Access logs and payment information for compliance, troubleshooting and evaluating trends of something you didn't know existed until months or years later, finding out if an endpoint got exploited in the past for a vulnerability that you only now discovered, tracking events that may span across months. Logs are a very useful tool in many non-dev or longer term uses.

fc417fc802•7mo ago

My home computer has well over 20 TB of storage. I have several LLMs, easily half a TB worth. The combined logs generated by every single program on my system might total 100 GB per year but I doubt it. And that's before compression.

Would you delete a text file that's a few KB from a modern device in order to save space? It just doesn't make any sense.

sureglymop•7mo ago

It makes sense to keep a high fidelity history of what happened and why. However, I think the issue is more that this data is not refined correctly.

Even when it comes to logging in the first place, I have rarely seen developers do it well, instead logging things that make no sense just because it was convenient during development.

But that touches on something else. If your logs are important data, maybe logging is the wrong way to go about it. Instead think about how to clean, refine and persist the data you need like your other application data.

I see log and trace collecting in this way almost as a legacy compatibility thing, analog to how kubernetes and containerization allows you to wrap up any old legacy application process into a uniform format, just collecting all logs and traces is backwards compatible with every application. But in order to not be wasteful and only keep what is valuable, a significant effort would be required afterwards. Well, storage and memory happen to be cheap enough to never have to care about that.

AlecBG•7mo ago

This sounds pretty easy to hack together with 10s of lines of python

Sayrus•7mo ago

Stern[1] does that. You can tail deployments, filter by labels and more.

[1] https://github.com/stern/stern

ofrzeta•7mo ago

What about "kubectl logs deploy/mydep --all-containers=true" but I guess you want more than that? Maybe https://www.kubetail.com?

shikhar•7mo ago

We have a customer using s2.dev for this capability – granular tail-able streams with granular access control (e.g. let an end user of a job tail it with a read-only access token). We'll be shipping an OTel endpoint soon to make it even easier.

CSDude•7mo ago

Blanket statements like this miss the point. Not all data is waste. Especially high-cardinality, non-sampled traces. On a 4-core ClickHouse node, we handled millions of spans per minute. Even short retention windows provided critical visibility for debugging and analysis.

Sure, we should cut waste, but compression exists for a reason. Dropping valuable observability data to save space is usually shortsighted.

And storage isn't the bottleneck it used to be. Tiered storage with S3 or similar backends is cheap and lets you keep full-fidelity data without breaking the budget.

ofrzeta•7mo ago

> Dropping valuable observability data to save space is usually shortsighted

That's a bit of a blanket statement, too :) I've seen many systems where a lot of stuff is logged without much thought. "Connection to database successful" - does this need to be logged on every connection request? Log level info, warning, debug? Codebases are full of this.

throwaway0665•7mo ago

There's always another log that could have been key to getting to the bottom of an incident. It's impossible to know completely what will be useful in advance.

citrin_ru•7mo ago

Probably not very useful for prod (non debug) logging, but it’s useful when such events are tracked in metrics (success/failure, connect/response times). And modern databases (including ClickHouse) can compress metrics efficiently so not much space will be spent on a few metrics.

nijave•7mo ago

Yes, it allows you to bisect a program to see the block of code between log statements where the program malfunctioned. More log statements slice the code into smaller blocks meaning less places to look.

vidro3•7mo ago

in our app each user polls for a resource availability every 5 mins. do we really need "connection successful" 500x per minute? i dont see this as breaking up the logs into smaller sections. i see it as noise. i'd much rather have a ton of "connection failed" whenever that occurs than the "success" constantly

jiggawatts•7mo ago

I agree with both you and the person you're replying to, but...

My centrist take is that data can be represented wastefully, which is often ignored.

Most "wide" log formats are implemented... naively. Literally just JSON REST APIs or the equivalent.

Years ago I did some experiments where I captured every single metric Windows Server emits every second.

That's about 15K metrics, down to dozens of metrics per process, per disk, per everything!

There is a poorly documented API for grabbing everything ('*') as a binary blob of a bunch of 64-bit counters. My trick was that I then kept the previous such blob and simply took the binary difference. This set most values to zero, so then a trivial run length encoding (RLE) reduced a few hundred KB to a few hundred bytes. Collect an hour of that, compress, and you can store per-second metrics collected over a month for thousands of servers in a few terabytes. Then you can apply a simple "transpose" transformation to turn this into a bunch of columns and get 1000:1 compression ratios. The data just... crunches down into gigabytes that can be queried and graphed in real time.

I've experimented with Open Telemetry, and its flagrantly wasteful data representations make me depressed.

Why must everything be JSON!?

nijave•7mo ago

I think Prometheus works similar to this with some other tricks like compressing metric names.

OTEL can do gRPC and a storage backend can encode that however it wants. However, I do agree it doesn't seem like efficiency was at the forefront when designing OTEL

valyala•7mo ago

These tricks are essential for every database optimized for metrics / logs / traces. For example, you can read on how VictoriaMetrics can compress production metrics to less than a byte per sample (every sample includes metric name, key=value labels, numeric metric value and metric timestamp with millisecond precision). https://faun.pub/victoriametrics-achieving-better-compressio...

pdimitar•7mo ago

Very curious to read your code doing it. Thought of a very similar approach but never had the time. Are you keeping it somewhere?

jiggawatts•7mo ago

I only ever got it to a proof of concept. The back end worked as advertised, the issue was that there are too many bugs in WMI so collecting that many performance counters had weird side effects.

Google was doing something comparable internally and this spawned some fun blog titles like “I have 64 cores but I can’t even move my mouse cursor.”

pdimitar•7mo ago

Ah, I don't mean the Windows-specific stuff. I mean the binary diffing and RLE.

While not difficult, I am just curious how others approached it.

valyala•7mo ago

This is called "progress". Humans always generate the amounts of data which can be stored and processed by the tools they have. The more data the tool can process under the given budget limit, the more data will be generated and stored.

tjungblut•7mo ago

tldr, they now do a zero (?) copy of raw bytes instead of marshaling and unmarshaling json.

the_real_cher•7mo ago

What is the trick that this and dynamo use?

Are they just basically large hash tables?

valyala•7mo ago

There are two tricks used by ClickHouse and similar databases:

- Smart placement of the data on disk, which allows skipping the majority of data and reading only the needed chunks (and these chunks are stored in a compressed form in order to reduce disk read IO usage even more). This includes column-oriented storage and LSM-like trees.

- Brute-force optimizations all over the place, which allow processing the found data at the maximum speed by employing all the compute resources (CPU, RAM, disk IO, network bandwidth) in the most efficient way. For example, ClickHouse can process more than a billion of rows per second per every CPU core, and the scan speed scales linearly with the number of available CPU cores.

atemerev•7mo ago

When I get back from Clickhouse to Postgres, I am always shocked. Like, what it is doing for some minutes importing this 20G dump? Shouldn't it take seconds?

joshstrange•7mo ago

Every time I use Clickhouse I want blow my brains out, especially knowing that Postgres exists. I’m not saying Clickhouse doesn’t have its place or that Postgres can do everything that Clickhouse can.

What I am saying is that I really dislike working in Clickhouse with all of the weird foot guns. Unless you are using it in a very specific, and in my opinion, limited way, it feels worse than Postgres in every way.

atemerev•7mo ago

I mostly need analytics, all data is immutable and append-only.

joshstrange•7mo ago

And that’s exactly the limited-ness I’m talking about. If that works for you, Clickhouse is amazing. For things like logs I can 100% see the value.

Other data that is ETL’d and might need to update? That sucks.

edmundsauto•7mo ago

There are design patterns / architectures that data engineers often employ to make this less "sucky". Data modeling is magical! (Specifically talking about things like datelist and cumulative tables)

atemerev•7mo ago

If you can afford rare, batched updates, it sucks much less.

Anyway, yes, if your data is highly mutable, or you cannot do batch writes, then yes, Clickhouse is a wrong choice. Otherwise... it is _really_ hard to ignore 50x (or more) speedup.

Logs, events, metrics, rarely updated things like phone numbers or geocoding, archives, embeddings... Whoooop — it slurps entire Reddit in 48 seconds. Straight from S3. Magic.

If you still want really fast analytics, but have more complex scenarios and/or data loading practices, there's also Kinetica... if you can afford the price. For tiny datasets (a few terabytes), DuckDB might be a great choice too. But Postgres is usually a wrong thing to make work.

slt2021•7mo ago

you are doing data warehousing wrong, need to learn basics of data warehousing best practices.

Data Warehouse consists of Slowly Changing Dimensions and Facts. none of these require updates

mdaniel•7mo ago

Anything in my life that uses Zookeeper or its dumbass etcd friend means I'm going to have a real bad time. I am thankful they're at least shipping their own ZK-ish but it seems to have fallen into the same trap as etcd, where membership has to be managed like the precious little pets that they are https://clickhouse.com/docs/guides/sre/keeper/clickhouse-kee...

jiggawatts•7mo ago

Zookeeper in the only clustering product I’ve ever used that actively refused to start a cluster after an all-nodes stop/start.

It blows my mind that a high availability system would purposefully prevent availability as a “feature”.

sciurus•7mo ago

Although this is oversimplifying things [0], in the face of partitions zookeeper emphasizes consistency over availability.

[0] https://martin.kleppmann.com/2015/05/11/please-stop-calling-...

jiggawatts•7mo ago

The problem with that is all nodes stop-start is not a partition!

A partition is when some nodes can’t reach other nodes.

Zookeeper instead has an issue where it does try to restart but the timeout (why?!) is too short, something like 30 seconds. If the majority of your nodes don’t all start within a certain time window the whole cluster stays down until someone manually intervenes.

I discovered this fun feature when keeping non-prod systems off to save money in the cloud.

It also has an impact when making certain big bang changes in production.

valyala•7mo ago

Just don't use ClickHouse for OLTP tasks. ClickHouse is an analytical database, which isn't optimized for transactional workloads. Keep calm and use Postgresql for OLTP, and ClickHouse for OLAP.

mrbluecoat•7mo ago

Noteworthy point:

> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.

fuzzy2•7mo ago

Everything OTel I ever did was fully active. So I wouldn't say this is very noteworthy. Instead it is wrong/incomplete information.

behemot•7mo ago

we use k8s + otel filelog receiver. in this case you don't have to connect to the clickhouse instance to collect what it's writing to stdout/stderr, just tail /var/log/pods/*/*/*.log.

jurgenkesker•7mo ago

So yeah, this is only really relevant for collecting logs from clickhouse. Not for logs from anything else. Good for them, and I really love Clickhouse, but not really relevant.

dangoodmanUT•7mo ago

You must be fun at parties

iw7tdb2kqo9•7mo ago

I haven't worked in ClickHouse level scale.

Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.

Why would I use ClickHouse instead of storing log data as json file for historical log data?

sethammons•7mo ago

Scale and costs. We are faced with logging scale at my work. A naive "push json into splunk" will cost us over $6M/year, but I can only get maybe 5-10% of that approved.

In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward.

munchbunny•7mo ago

> Can you search log data in this volume?

(Context: I work at this scale)

Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string".

My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet.

Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic.

gnaman•7mo ago

How do engineers troubleshoot then? Our engineers would throw hands if they are asked not to parse through two months worth of log volume for a single issue.

munchbunny•7mo ago

In practice, at the scale I work at, it's barely feasible to scan one week of logs, let alone two months, because you'll be waiting hours for the result. So you learn strategies to only need to scan a subset of the logs at a time.

h1fra•7mo ago

Couple of years ago clickhouse wasn't that good with full text search, to me that was the biggest drawback. Yes it's faster and can handle ES scale but depending on your use case it's way faster to query ES when you do FTS or grouping without pre-build index.

valyala•7mo ago

How much RAM does Elasticsearch need for fast full-text search over 100 petabytes of logs? 100 petabytes is 100 millions of gigabytes, just in case.

valyala•7mo ago

> Why would I use ClickHouse instead of storing log data as json file for historical log data?

There are multiple reasons:

1. Databases optimized for logs (such as ClickHouse or VictoriaLogs) store logs in a compressed form, where values per every log field are grouped and compressed individually (aka column-oriented storage). This results in smaller storage space comparing to plain files with JSON logs, even if they are compressed.

2. Databases optimized for logs perform typical queries at much faster speed comparing to grep over JSON files. Performance gains may be 1000x and more because these databases skip reading unneeded data. See https://chronicles.mad-scientist.club/tales/grepping-logs-re...

3. How are you going to grep 100 petabytes of JSON files? Databases optimized for logs allow querying such amounts of logs because they can scale horizontally by adding more storage nodes and storage space.

revskill•7mo ago

THis industry is mostly filled with half-baked or in-progress standards which leads to segmentation of the ecosystems. From graphql, to openapi, to mcp,... to everything, nothing is perfect and it's fine.

The problem is, people who created spec is just following trial and error approach, which is insane.

Thaxll•7mo ago

I mean if you don´t get the logs when the serivce is down the entire solution is useless.

b0a04gl•7mo ago

tbh that's not the flex. storing 100PB of logs just means we haven't figured out what's actually worth logging. metrics + structured events can usually tell 90% of the story. the rest? trace level chaos no one reads unless prod's on fire. what'd could've done better be: auto pruning logs that no alert ever looked at. or logs that never hit a search query in 3 months. call it attention weighted retention. until then this is just high end digital landfill with compression

imiric•7mo ago

Sure, but if the data is already there, it's a sifting and pruning problem, which can be done after ingestion, if needed.

It's better to have all data and not need it, than to need it and not have it. Assuming you have the resources to ingest it in the first place, which seems like the focus of the optimization work they did.

hnlmorg•7mo ago

I’m of the opposite opinion. It’s better to ingest everything and then filter out the stuff you don’t want at the observability platform.

The problem of filtering out debug logs is you don’t need them, until you do. And then trying to recreate an event you can’t even debug is often impossible. So it’s easier to then retrieve those debug logs if they’re already there but hidden.

jgalt212•7mo ago

> then filter out the stuff you don’t want

This is often easier said than done. And there's ginormous costs associated with logging everything. Money that can be better spent elsewhere.

Also, logging everything creates yet another security hole to worry about.

hnlmorg•7mo ago

Not really. Most observability platforms already have tools to support this kind of workflow in a more cost effective way.

> Also, logging everything creates yet another security hole to worry about.

I think the real problem isn’t logging, it’s the fact that your developers are logging sensitive information. If they’re doing that, then it’s a moot point if those logs are also being pushed to a third party observability platform or not because you’re already leaking sensitive information.

jgalt212•7mo ago

Fair enough, but if you don't push them to "log everything" there are less chances for error.

hnlmorg•7mo ago

I disagree.

If developers think “log everything” means “log PII” then that developer is a liability regardless.

Also, this is the sort of thing that should get picked up in non-prod environments before it becomes a problem.

If you get to the point where logging is a risk then you’ve had other failures in processes.

phillipcarter•7mo ago

> And there's ginormous costs associated with logging everything

If you use a tool that defaults the log spew to a cheap archive, sampling to the fast store, and a way to pull from the archive on-demand much of that is resolved. FWIW I think most orgs get big scared at seeing $$$ in their cloud bills, but don't properly account for time spent by engineers rummaging around for data they need but don't have.

nijave•7mo ago

>but don't properly account for time spent by engineers rummaging around for data they need but don't have

This is a tricky one that's come up recently. How you you quantify the value of $$$ observability platform? Anecdotally I know robust tracing data can help me find problems in 5-15 minutes that would have taken hours or days with manual probing and scouring logs.

Even then you have the additional challenge of quantifying the impact of the original issue.

phillipcarter•7mo ago

At the end of the day it's just vibes. If the company is one that sees:

- Reliability as a cost center

- Vendor costs are to be limited

- CIO-driven rather than CTO-driven

Then it's going to be a given that they prioritize costs that are easy to see, and will do things like force a dev team to work for a month to shave ~2k/month off of a cloud bill. In my experience, these orgs will also sometimes do a 180 when they learn that their SLAs involve paying out to customers at a premium during incidents, which is always very funny to observe. Then you talk to some devs and they say things like "we literally told them this would happen years ago and it fell on deaf ears" or something.

behemot•7mo ago

> Anecdotally I know robust tracing data can help me find problems in 5-15 minutes that would have taken hours or days with manual probing and scouring logs.

exactly. high-cardinality, wide structured events are the way.

hinkley•7mo ago

Java had particularly bad performance for logging for a good while and I used to make applications noticeably faster by clearing out the logs nobody cared about anymore. Just have to be careful about side effects in the log lines.

gavinray•7mo ago

"Better to have it and not need it; than to need it, and not have it..."

jkogara•7mo ago

Or more succinctly, albeit less eloquently: "Better to be looking at it than looking for it."

9dev•7mo ago

Until you’re working with personal information of EU customers, where the opposite maxime applies: "Only store what you absolutely need"

Seriously, storing petabytes of logs is a guarantee for someone on your team writing sensitive data to logs, and/or violate regulations.

jodrellblank•7mo ago

“You can’t have everything. Where would you put it?” - Steven Wright.

“Better to have hoarding disorder than to need a fifty year old carrier bag full of rotting bus tickets and not have one” really should need more justification than a quote about how convenient it is to have what you need. The reason caches exist as a thing is so you can have what you probably need handy because you can’t have everything handy and have to choose. The amount of things you might possibly want or need one day - including unforeseen needs - is unbounded, and refusing to make a decision is not good engineering, it’s a cop-out.

Apart from cost, the more time and money you spend indexing, cataloging, searching it. How many companies are going to run an internal Google-2002 sized infrastructure just to search their old hoarded data?

gavinray•7mo ago

I'm not sure what poor engineering practices you have seen, but in my painfully-gotten experience, application of this principle usually amounts to having varying levels of a debug log flag that dump this info either to stdout via JSONL that's piped somewhere, or as attributes in OTEL spans.

This has never been a source of significant issues for me.

hnlmorg•7mo ago

This is a really easy problem to solve.

Step one: add log severity to your log messages (pretty much every log library supports this out of the box).

Step two: add a log archive (you should have this anyway so that logs can be retained past the initial retention period of your log querying tools. Eg you might have a compliance requirement to keep logs for two years but you obviously wouldn’t want anything that old stored in your expensive fast log search)

Step three: create a way to ingest your archived logs (again, something your business should have, otherwise what’s the bloody point in having an archive)

Step four: have a rule that pushes logs of high severity straight into your log ingestion pipeline, and logs of lower severity into your archive.

Step four seems to be the piece that most people are oblivious too. But it’s generally really easy to implement. Particularly so if you’re using a reputable observability platform.

People who think “log everything” means “log PII” or “stick everything in the same log ingestion pipeline” are simply doing logging wrong. I’m not normally one to say “you’re doing it wrong” but when it comes to logging, these tools are long since mature now. The problem isn’t the tooling, it’s people’s awareness of it.

lelanthran•7mo ago

> "Better to have it and not need it; than to need it, and not have it..."

Having it is pointless if your SNR is so low that it costs more money than simply waiting for the bug the next time it comes up.

IMO, if a bug never surfaces again, that's not a bug I care about anyway. Keeping all generated data in case someone wants to see the record from a bug 3 months ago is absolutely pointless - if it hasn't surfaced again in the last three weeks, you absolutely have more high-priority things to look at!

I want to see this mythical company, where a paid employee is dedicated by the company to look at a log from 3 months ago, to solve a bug that hasn't resurfaced in that three month period!

hinkley•7mo ago

Once a bug is closed the value of those logs starts to decay. And the fact is that we get punished for working on things that aren’t in the sprint, and working on “done done” stories is one of those ways. Even if you want to clean up your mess, there’s incentive not to. And many of us very clearly don’t like to clean up our own messes, so a ready excuse gets them out of the conversation about a task they don’t want to be voluntold to do.

pstuart•7mo ago

My approach for this is to add dev logging IN ALL CAPS so that it stands out as ugly and "need adjusting", which is to delete it before merging to main.

hinkley•7mo ago

On my last project I was able to convince the team to clean up feature toggles before closing out epics. But I didn’t make much headway on logs. I came at them sideways and got all but one of my coworkers to stop trying to generate charts from Splunk and use Grafana instead. And I squeezed him by adding stats for things he liked to look at b

hnlmorg•7mo ago

In DevOps (et al) the value of those logs doesn’t decay in the same way it does in pure dev.

Also, as I pointed out elsewhere, modern observability platforms enable a way to have those debug logs available as an archive that can be optionally ingested after an incident but without filling up your regular quota of indexed logs. Thus giving you the best of both worlds (all logging but without the expense and flooding your daily logs with debug messages)

hinkley•7mo ago

> In DevOps (et al) the value of those logs doesn’t decay in the same way it does in pure dev.

I’ve been on-call, and I think you’re cherry picking. The world has too many devs who still debug with log statements. Those logs never had any value to anyone but the original author.

I’ve also seen too many devs who are perfectly happy trying to write vastly complex Splunk queries to generate charts, and those charts tend to break in a production incident becausea bunch of people load them at once and blow up Splunk’s rate limiting. I’ve almost never had this problem with grafana. It’s true that you can make a dashboard with long-term trends that will fall over, but you wouldn’t use that dashboard for triage, unless you make one that tries to do both and the solution is split it into two dashboards.

If you want to make a successfully scaling organization, you need a way for new members to join your core of troubleshooters, without pulling resources away from solving the trouble. So they can’t demand time, resources or attention that are in short supply from the core group.

Grafana fits that yardstick much better than log analyzers.

hnlmorg•7mo ago

You’re arguing a different argument.

You’re making a case that cryptical logs messages are bad. And I agree.

You’re also making a case that logs are only piece of the telemetry ecosystem. And I agree there too.

What I’m arguing is that there isn’t a need to filter logs based on cost because you can still work with them in observability platforms in a cost effective way.

Lastly, I didn’t say everything should be instantly available. Long term logs shouldn’t be in the same expensive storage pool as recent logs. But there should be a convenient way to import from older log archives into your immediate log querying tools (statement here is intentionally vague because different observability platforms will engineer this differently and call this process by different names)

As for complex queries, regardless of how easy to use your observability platform is, however many saved queries and dashboards you have built, there’s always going to be a need for upskilling your staff. That’s an inescapable problem.

UltraSane•7mo ago

You really need to define how much you are willing to spend on logging/observability as a percentage of total budget. IMHO 5% is bare minimum 10% is better. I've worked for a company that had a dedicated storage array just for logging with Splunk and it was amazing and very much worth the money.

Good automatic tiering for logs is very useful as the most recent logs tend to be the most useful. I like NVMe -> hard disk -> tape library. LTO tape storage is cheap enough you don't need to delete data until it is VERY old.

nikolayasdf123•7mo ago

that looks like tail-error sampling

nikolayasdf123•7mo ago

yeah, same thoughts.

business events + error/tail-sampled traces + metrics

... and logs in rare cases when none of the above works. logs are dump of everyting. why would you want to have so many logs in first place? and then build whole infra to scale that? and who and how reads all those logs? they build metrics on top of that? so might as well just build metrics directly and purposefully? with such high volume, even LLMs would not read them (too slow and too costly).. and what would even LLM tell from those logs? (may be sparce/low signal, hard to decipher without tool-calling, like creating merics)

Spivak•7mo ago

> trace level chaos no one reads unless prod's on fire

God why do we keep these fire extinguishers around, they sit unused 99.999% of the time.

jiggawatts•7mo ago

“Just go back in time and turn on the specific log you will need!”

hinkley•7mo ago

That logging isn’t even free on the sending side, especially in languages where they are eager to get the logs to disk in case the final message reveals why the program crashed.

And there’s a lot of scanning blindness out there. Too much extraneous data can hide correlations between other logs entries. And there’s half life in value of logs written for bugs that are already closed, and it’s fairly short.

I prefer stats because of the way they get aggregated. Though for GIL languages some models like OTEL have higher overhead than they should.

nijave•7mo ago

In fairness, I think a lot of GIL languages already have high overload and I've never been under the impression OTEL was optimized for performance and efficiency.

hinkley•7mo ago

It really isn’t. The code reads like it was designed by SpringBoot users. You have to read three different docs to suss out how to use multiple calls together to get a desired approach, and some of the docs leave out critical details. I think people forget that folks use Google thinks is the top result isn’t necessarily what the creators would assume is the document people would find for a topic. I’ve been trying to explain this to the Elixir community for instance.

“Can’t to X, doesn’t work.”

“Look, it’s easy. Did you even RTFM? http://blah.example.com/doc/articleb#section2”

“Uh, no, because search engine took me to http://blah.example.com/doc/articleg#section7”

eddd-ddde•7mo ago

Is there any tools that does log/trace capture on error conditions? I.e. we capture all local events, but only upload them when something meaningful happens, like the server crashed / requests are returning 5xx.

mdaniel•7mo ago

I love this idea in principle, but in practice I would guess it means one of two sub-optimal things: either the node caches them for a window of time, in order to know whether to really transmit them, or the logs are mutated post-delivery as kind of a "tiny expiry"

Everything else I could write is just turning various trade-off knobs, which is why I'd guess you haven't seen an out-of-the-box offering that does what you're describing. There's not just one solution to it that would be reasonable for all audiences

Macha•7mo ago

I've been in a bunch of companies that have pushed for reducing logs in favour of metrics and a limited set of events, usually motivated by "we're using datadog and it's contract renewal time and the number is staggering".

The problem is, if you knew what was going to go wrong, you'd have fixed it already. So when there's a report that something did not operate correctly and you want to find out WTF happened, the detailed logs are useful, but you don't know which logs are useful for that unless you have reoccuring problems.

__MatrixMan__•7mo ago

> auto pruning logs that no alert ever looked at

I'm sure someone somewhere is working on an AI that predicts whether a given log is likely to get looked at based on previous logs that did get looked at. You could store everything for 24h, slightly less for 7d, pruning more aggressively as the data gets stale so that 1y out the story is pretty thin--just the catastrophes.

ethan_smith•7mo ago

The "attention weighted retention" concept is brilliant. You could implement this with a simple counter tracking query/alert hits per log pattern, then use that for TTL policies in most observability platforms. This approach reduced our storage costs by 70% while preserving all actionable data.

solatic•7mo ago

If you work for a large enterprise, there are so many dev teams supporting so many products that "we haven't figured out what's actually worth logging" is just disconnected from the developer incentives in those teams (ship features fast, fix your problems even faster because nobody has time for that BS) as well as ops incentives (the servers ARE on fire, and the devs didn't log enough). FinOps comes last, if there's even cost tracking per team in the observability suite.

You don't understand why DataDog has a $44 billion market cap. It's yet another instance of Finance complaining that the transition to The Cloud gave every engineer a corporate credit card with no spend controls or a way for Finance to turn off the spigot.

CoolCold•7mo ago

wut?

> As you’ll read below, this saves us millions of dollars a year and allows us to scale out our ClickHouse Cloud service without having to be concerned about observability costs, or make compromises on the log data we retain.

https://clickhouse.com/blog/building-a-logging-platform-with...

behemot•7mo ago

hey there! I work at ClickHouse. to clarify: the vast majority of this 100PB is structured events. in our case logs are supplementary.

jappgar•7mo ago

Observability maximalism is a cult. A very rich one.

k__•7mo ago

Well, if you wanna investigate unknown unknowns, there isn't much alternative.

hinkley•7mo ago

Funny how they give you a problem and solve it for you for a small monthly fee.

the_arun•7mo ago

I didn’t see how long logs are kept - retention time. After x months you may need summary/aggregated data but not sure about raw data.

behemot•7mo ago

we keep it for 180 days.

henning•7mo ago

Yes, this what the people who will curse you out and judge you for not using wide events omits: it will greatly increase storage costs compared to the normal metrics + traces + sample based logging that is conventional. It has both a benefit and a cost, and the cost part is always omitted.

valyala•7mo ago

Properly implemented wide events usually reduce storage costs comparing to typical chaotic logging of everything. It is expected that a single external request leads to exactly one wide event with all the information about this request, which may be needed for further debugging and analytics. See https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide... .

edzhelyov•7mo ago

How would you add an outgoing request you make to external system in the wide event? For example, I receive a request, in that request I make a HTTP call to http://example.com. In tracing that will be a separate span, but how you manage that in a single wide event?

behemot•7mo ago

ClickHouse is pretty good at compressing the wide events, so it's not that dramatic compared to the benefits of having high-cardinality telemetry. check this out: https://clickhouse.com/blog/optimize-clickhouse-codecs-compr...

Xcelerate•7mo ago

Do wide events really have to take up this much space? I mean, observability is to a large degree basically a sampling problem where the goal is to maximize the ability to reconstruct the state of the environment at a given time using a minimal amount of storage. You can accomplish that by either reducing the number of samples taken or by improving your compression capability.

For the latter, I have a very hard time believing we’ve squeezed most of the juice out of compression already. Surely there’s an absolutely massive amount of low-rank structure in all that redundant data. Yeah, I know these companies already use inverted indices and various sorts of trees, but I would have thought there are more research-y approaches (e.g. low rank tensor decomposition) that if we could figure out how to perform them efficiently would blow the existing methods out of the water. But IDK, I’m not in that industry so maybe I’m overlooking something.

behemot•7mo ago

> Do wide events really have to take up this much space?

100PB is the total volume of the raw, uncompressed data for the full retention period (180 days). compression is what makes it cost-efficient. on this dataset, we see ~15x compression, so we only store around 6.5PB at rest.

AntonCTO•7mo ago

There isn't much information about correlation. What are the state-of-the-art tools and techniques for observability in stateful use cases?

Let's take the example of an SFU-based video conferencing app, where user devices go through multiple API calls to join a session. Now imagine a user reports that they cannot see video from another participant. How can such problems be effectively traced?

Of course, I can manually filter logs and traces by the first user, then by the second user, and look at the signaling exchange and frontend/backend errors. But are there better approaches?

We Mourn Our Craft

Speed up responses with fast mode

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

We Mourn Our Craft

Speed up responses with fast mode

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Al Lowe on model trains, funny deaths and working with Disney

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

Selection Rather Than Prediction

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Software factories and the agentic moment

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

Scaling our observability platform by embracing wide events and replacing OTel

Comments