Are they just basically large hash tables?
What I am saying is that I really dislike working in Clickhouse with all of the weird foot guns. Unless you are using it in a very specific, and in my opinion, limited way, it feels worse than Postgres in every way.
> If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
Can you search log data in this volume? ElasticSearch has query capabilities for small scale log data I think.
Why would I use ClickHouse instead of storing log data as json file for historical log data?
In the article, they talk about needing 8k cpu to process their json logs, but only 90 cpu afterward.
(Context: I work at this scale)
Yes. However, as you can imagine, the processing costs can be potentially enormous. If your indexing/ordering/clustering strategy isn't set up well, a single query can easily end up costing you on the order of $1-$10 to do something as simple as "look for records containing this string".
My experiences line up with theirs: at the scale where you are moving petabytes of data, the best optimizations are, unsurprisingly, "touch as little data as few times as possible" and "move as little data as possible". Every time you have to serialize/de-serialize, and every time you have to perform disk/network I/O, you introduce a lot of performance cost and therefore overall cost to your wallet.
Naturally, this can put OTel directly at odds with efficiency because the OTel collector is an extra I/O and serialization hop. But then again, if you operate at the petabyte scale, the amount of money you save by throwing away a single hop can more than pay for an engineer whose only job is to write serializer/deserializer logic.
The problem is, people who created spec is just following trial and error approach, which is insane.
ofrzeta•2h ago
XorNot•2h ago
(one of my immense frustrations with kubernetes - none of the commands for viewing logs seem to accept logical aggregates like "show me everything from this deployment").
knutzui•2h ago
k9s (k9scli.io) supports this directly.
madduci•2h ago
brazzy•2h ago
Sayrus•2h ago
fc417fc802•1h ago
Would you delete a text file that's a few KB from a modern device in order to save space? It just doesn't make any sense.
sureglymop•1h ago
Even when it comes to logging in the first place, I have rarely seen developers do it well, instead logging things that make no sense just because it was convenient during development.
But that touches on something else. If your logs are important data, maybe logging is the wrong way to go about it. Instead think about how to clean, refine and persist the data you need like your other application data.
I see log and trace collecting in this way almost as a legacy compatibility thing, analog to how kubernetes and containerization allows you to wrap up any old legacy application process into a uniform format, just collecting all logs and traces is backwards compatible with every application. But in order to not be wasteful and only keep what is valuable, a significant effort would be required afterwards. Well, storage and memory happen to be cheap enough to never have to care about that.
AlecBG•2h ago
Sayrus•2h ago
[1] https://github.com/stern/stern
ofrzeta•2h ago
CSDude•2h ago
Sure, we should cut waste, but compression exists for a reason. Dropping valuable observability data to save space is usually shortsighted.
And storage isn't the bottleneck it used to be. Tiered storage with S3 or similar backends is cheap and lets you keep full-fidelity data without breaking the budget.
ofrzeta•2h ago
That's a bit of a blanket statement, too :) I've seen many systems where a lot of stuff is logged without much thought. "Connection to database successful" - does this need to be logged on every connection request? Log level info, warning, debug? Codebases are full of this.
throwaway0665•1h ago
citrin_ru•1h ago
jiggawatts•41m ago
My centrist take is that data can be represented wastefully, which is often ignored.
Most "wide" log formats are implemented... naively. Literally just JSON REST APIs or the equivalent.
Years ago I did some experiments where I captured every single metric Windows Server emits every second.
That's about 15K metrics, down to dozens of metrics per process, per disk, per everything!
There is a poorly documented API for grabbing everything ('*') as a binary blob of a bunch of 64-bit counters. My trick was that I then kept the previous such blob and simply took the binary difference. This set most values to zero, so then a trivial run length encoding (RLE) reduced a few hundred KB to a few hundred bytes. Collect an hour of that, compress, and you can store per-second metrics collected over a month for thousands of servers in a few terabytes. Then you can apply a simple "transpose" transformation to turn this into a bunch of columns and get 1000:1 compression ratios. The data just... crunches down into gigabytes that can be queried and graphed in real time.
I've experimented with Open Telemetry, and its flagrantly wasteful data representations make me depressed.
Why must everything be JSON!?