frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Start all of your commands with a comma

https://rhodesmill.org/brandon/2009/commands-with-comma/
143•theblazehen•2d ago•42 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
668•klaussilveira•14h ago•202 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
949•xnx•19h ago•551 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
122•matheusalmeida•2d ago•33 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
53•videotopia•4d ago•2 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
229•isitcontent•14h ago•25 comments

Jeffrey Snover: "Welcome to the Room"

https://www.jsnover.com/blog/2026/02/01/welcome-to-the-room/
16•kaonwarb•3d ago•19 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
28•jesperordrup•4h ago•16 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
223•dmpetrov•14h ago•117 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
330•vecti•16h ago•143 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
494•todsacerdoti•22h ago•243 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
381•ostacke•20h ago•95 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
359•aktau•20h ago•181 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
288•eljojo•17h ago•169 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
412•lstoll•20h ago•278 comments

Was Benoit Mandelbrot a hedgehog or a fox?

https://arxiv.org/abs/2602.01122
19•bikenaga•3d ago•4 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
63•kmm•5d ago•6 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
90•quibono•4d ago•21 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
256•i5heu•17h ago•196 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
32•romes•4d ago•3 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
44•helloplanets•4d ago•42 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
12•speckx•3d ago•5 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
59•gfortaine•12h ago•25 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
33•gmays•9h ago•12 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1066•cdrnsf•23h ago•446 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
150•vmatsiiako•19h ago•67 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
288•surprisetalk•3d ago•43 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
149•SerCe•10h ago•138 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
183•limoce•3d ago•98 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
73•phreda4•13h ago•14 comments
Open in hackernews

ClickHouse raises $350M Series C

https://clickhouse.com/blog/clickhouse-raises-350-million-series-c-to-power-analytics-for-ai-era
119•caust1c•8mo ago

Comments

bananapub•8mo ago
oof, that sucks [for everyone else]. I hope someone figures out how to make a sustainable business of this sort, eventually.
candiddevmike•8mo ago
Being a profitable database vendor is really, really hard. You absolutely have to lock down big customers during your hype cycle or you're done for. The time to value for customers is so long, it becomes such an investment and sales cycles become really laborious (as a former DB SE in the past).
hodgesrm•8mo ago
Or you focus on cost-efficient operation from very the beginning. Ironically databases are also one of the markets where it's possible to achieve profitability operating, extending, or supporting open source software. I did a talk at FOSDEM 2025 about how three specific companies (Percona, DBeaver, Altinity) achieved this. [0] It is possible because businesses depend on databases and are willing to pay real money to ensure they work properly.

[0] https://fosdem.org/2025/schedule/event/fosdem-2025-5320-buil...

Disclaimer: I run Altinity.

datavirtue•8mo ago
Yep, there will be an alternatives on Azure and AWS soon enough, if not already.
dangoodmanUT•8mo ago
I love clickhouse and a lot of the team members, but some of the "ClickHouse, Inc." people seem very counter to the original mission of CH, which has been unfortunately been reflected in some negative ways to both the overall OLAP ecosystem, and clickhouse itself.

I've shared many of those thoughts with their team directly out of love.

Also that's Series D-E, money isn't real anymore

jasonjmcghee•8mo ago
Depends on trajectory and capital need among other things. There are series B this size.
Octoth0rpe•8mo ago
> Also that's Series D-E, money isn't real anymore

Could you explain this? Is this commentary on voting power dilution or their class a/b share rules?

ko_pivot•8mo ago
I’m guessing what they mean is that the valuation is so inflated at this point that the high dollar amount more reflects the likelihood of acquisition or IPO in the near term rather than some sort of substantive demonstration of confidence in the company and its founders.
mooreds•8mo ago
Not OP, but I took it to mean that the round was absurdly large. The norms/expectations around size of rounds are not what they once were.

I had the same thought the first time I heard about a 12M "seed" round.

PeterZaitsev•8mo ago
What differences from original mission do you see ?
nasretdinov•8mo ago
I personally see ClickHouse still improving in terms of overall usability and becoming much more polished, introducing features like full-text indexing, JSON data type, etc, all open-source and completely free. The commercial offering deviates from the "bare-bones", "build-it-yourself" storage, but, again, in my opinion it makes perfect sense to commercialise this part of it, to allow the product overall to continue to evolve and be successful. Otherwise ClickHouse as an open-source database probably wouldn't be able to evolve so quickly since the needs of Yandex don't always align with the needs of other users of the database
quantumwoke•8mo ago
Is this money for growth or exiting employee options?
vb-8448•8mo ago
I'm wondering why ClickHouse need to raise more money? Aren't they profitable already?
jedberg•8mo ago
Usually by Series C, you're at a point where you could be breakeven or profitable, but it's because you're tackling a huge market with a lot of opportunities, so it makes sense to take on capital to accelerate growth to attack that market.
amazingamazing•8mo ago
How hard is it to self host clustered clickhouse? Is there parity with the hosted offering?
nasretdinov•8mo ago
It's quite easy to host your own instance, we've done it ~7 years ago and had a cluster of over 50 nodes without any major issues. What ClickHouse Cloud offers is "shared nothing" storage, via SharedMergeTree that has S3 as a backing store, and it allows to scale storage and compute separately. The implementation is closed source.
amazingamazing•8mo ago
Interesting - hardware is so cheap though, I guess most enterprises don’t want the hassle.

Personally I’d just go to a colo center buy a rack of super micro and call it a day. No way that’s more expensive after a year (per public pricing).

nasretdinov•8mo ago
Sharding in Open-Source version isn't automatic, so you have to manage it yourself, as in there is no automatic resharding and you need to insert data accordingly. IMO that's the biggest bottleneck in its adoption at larger scale. Previously you didn't have a choice in terms of whether or not to do sharding (and compute/storage separation if you want it), now you have more options, including one from ClickHouse authors themselves.
nine_k•8mo ago
Apparently it's not a bottleneck, it's a sales funnel.
nasretdinov•8mo ago
I don't see a contradiction here tbh. There's nothing wrong in not providing some extra functionality for free (especially for features that users will pay for). If you have engineering resources to manage sharding manually you're welcome to do so. Since ClickHouse is a commercial company and not part of Yandex they need to earn money one way or another to fund the database development.
marvinblum•8mo ago
It's not that hard, but there are a few pitfalls you can stumble into. I currently run three clusters for myself and have set some for clients in the past.

Some of the default config options are weird and SSL is something that needs to be addressed. Overall, still one of the easier DBs to maintain.

the__alchemist•8mo ago
Is there an ELI5 for this company? I'm having a difficult time understanding it from their website. Is it an alternative to Postgres etc? Something that runs on top of it? And analyzes your DB automatically?
doix•8mo ago
I guess you could say it's an alternative to postgres. It's a different database, that's column oriented which makes different tradeoffs. I'd say DuckDB is a better comparison, if you're familiar with it.
pythonaut_16•8mo ago
Expanding for the original question:

Roughly speaking, Postgres is to SQLite what Clickhouse is to DuckDB.

OLTP -> Online Transaction Processing. Postgres and traditional RDBMS. Mainly focused on transactions and addressing specific rows. Queries like "show me all orders for customer X".

OLAP -> Online Analytical Processing. Clickhouse and other columnar oriented. For analytical and calculation queries, like "show me the total value of all orders in March 2024". OLTP database typically store data by column rather than row, and usually have optimizations for storage space and query speed based on that. As a tradeoff they're typically slower for OLTP type queries. Often you'd bring in an OLAP db like Clickhouse when you have a huge volume of data and your OLTP database is struggling to keep up.

ksynwa•8mo ago
What's the significance of "online" in these acronyms?
stonemetal12•8mo ago
It is a rather old acronym. The other option was batch processing, you will get your results in the mail type thing.

Here "Online" means results while connected to the system, not real time since there is no time requirement for results.

edoceo•8mo ago
Live and real-time
IMTDb•8mo ago
Online means you expect the responses to come quickly (seconds) after launching the request. The opposite is "offline" where you expect the results to come a long time after making the request (hours / days).

ClickHouse is designed so you can build dashboard with it. Other offline system are designed so you can build reports that you send in PDF over email with them.

lbhdc•8mo ago
Its a db company that offers an open source database and cloud managed services.

The database is OLAP where Postgres is an OLTP database. Essentially it very fast at complex queries, and is targeted at analytics workloads.

datavirtue•8mo ago
Postgres has been used as the basis for several OLAP systems. These guys are probably using a modified Greenplum.
lbhdc•8mo ago
As far as I am aware it is not a derivative of another database.

https://dbdb.io/db/clickhouse

__s•8mo ago
I got to see Citus at Microsoft fail to close against ClickHouse for an internal project

ClickHouse spun out of Yandex & is open source, https://github.com/ClickHouse/clickhouse

Disclosure: I started at Citus & ended up at ClickHouse

jameslk•8mo ago
When Postgres takes a while to answer analytical questions like "what's the 75th percentile of response time for these 900 some billion requests rows, grouped by device, network, and date for the past 30 days", that's when you might want to try out ClickHouse
NewJazz•8mo ago
I'm struggling with TimescaleDB performance right now and wondering if the grass is greener.
applied_heat•8mo ago
What is the workload or query that is causing issues?
NewJazz•8mo ago
We denormalized some data then wanted to quickly filter by it. I managed to find a decent index to get us through, but now I'm stuck with another dimension in my data that I'd rather not have. I think I'll have to create a new table, migrate data, then rename it.
sukruh•8mo ago
It is.
whatevermom•8mo ago
Migrated from TimescaleDB to ClickHouse and it was like night and day. Naive reimplementation of the service performed wayyyy better than timescaledb. Self-hosted.
andness•8mo ago
Started migrating away from TimescaleDB some time ago too. Initially we self-hosted to test it out. It was very quickly clear that it was a lot better for our use case and we decided to go with Clickhouse Cloud to not have to worry about the ops. The pricing for the cloud offering is very good IMO. We use it for telemetry data from a fleet of IoT devices.
cluckindan•8mo ago
Or literally any other OLAP database.

Is it a surprise that OLTP is not efficient at aggregation and analytics?

swyx•8mo ago
maybe HTAP works for most people though
nasretdinov•8mo ago
ClickHouse also has great compression and it's easy to install and to try since it's open-source. Also it's typically much faster than even other OLAP, often by a _lot_
bandoti•8mo ago
Or if you have to use it because you’re self-hosting PostHog :)
jgalt212•8mo ago
I'm not sure storing 900B or 900MM records for analytics benefits anyone other than AWS. Why not sample?
sethhochberg•8mo ago
A use case where we reached for Clickhouse years ago at an old job was for streaming music royalty reporting. Days of runtime on our beefy MySQL cluster, minutes of runtime in a very naively optimized Clickhouse server. And sampling wasn't an option because rightholders like the exactly correct amount of money per stream instead of some approximation of the right amount of money :)

There's nothing Clickhouse does that other OLAP DBs can't do, but the killer feature for us was just how trivially easy it was to replicate InnoDB data into Clickhouse and get great general performance out of the box. It was a very accessible option for a bunch of Rails developers who were moonlighting as DBAs in a small company.

jgalt212•8mo ago
Yes, payments is an N=all scenario. Analytics is not, however.
antisthenes•8mo ago
Use-case dependent. For some analytics, you really want to see the tail ends (e.g. rare events) which sampling can sometimes omit or under-represent.
NunoSempere•8mo ago
That seems like the kind of problem that would be easily done through monte-carlo approximation? How hard is it to get 1M random rows in a postgres database?
sylvinus•8mo ago
ClickHouse has native support for sampling https://clickhouse.com/docs/sql-reference/statements/select/...
simantel•8mo ago
It's an alternative to Postgres in the sense that they're both databases. Read up on OLAP vs. OLTP to see the difference.
whobre•8mo ago
It's not like Postgres at all, except on the very superficial level. It is an analytical engine like BigQuery, Snowflake, Teradata, etc...
arecurrence•8mo ago
Clickhouse has a wide range of really interesting technologies that are not in Postgres; fundamentally, it's not an OLTP database like Postgres but more-so aimed at OLAP workloads. I really appreciate Clickhouse's focus on performance and quite a bit of work goes into optimizing the memory allocation and operations among different data types.

The heart of Clickhouse are these table engines (they don't exist in Postgres) https://clickhouse.com/docs/engines/table-engines . The primary column (or columns) is ordered in some way and adjacent values in memory are from the same column in the table. Index entries span wide areas (EG: By default there's only one key record in the primary index for every 8192 rows) because most operations in Clickhouse are aggregate in nature. Inserts are also expected to be in bulk (They are initially a new physical part that is later merged into the main table structure). A single DELETE is an ALTER TABLE operation in the MergeTree engine. :)

This structure allows it to literally crunch billions of values per second (brutally, not with pre-processing, erm, "tricks" although there is a lot of support for that in Clickhouse as well). I've had tables with hundreds of columns and 100+ billion rows that are nearly as performant as a million row table if I can structure the query to work with the table's physical ordering.

Clickhouse recommends not using nullable fields because of the performance implications (it requires storing a bit somewhere for each value). That's how much they care about perf and how close to the raw data type it is that their memory allocation uses. :)

porridgeraisin•8mo ago
> Inserts are also expected to be in bulk (They are initially a new physical part that is later merged into the main table structure). A single DELETE is an ALTER TABLE operation in the MergeTree engine.

> They are initially a new physical part that is later merged into the main table structure

> A single DELETE is an ALTER TABLE operation

Can you explain these two further?

arecurrence•8mo ago
The Clickhouse docs are so good that I'd point straight to them https://clickhouse.com/docs/sql-reference/statements/alter/d... .

The reason I mentioned it is because it's a huge surprise to some people that... from the docs: "The ALTER TABLE prefix makes this syntax different from most other systems supporting SQL. It is intended to signify that unlike similar queries in OLTP databases this is a heavy operation not designed for frequent use. ALTER TABLE is considered a heavyweight operation that requires the underlying data to be merged before it is deleted."

There's also a "lightweight delete" available in many circumstances https://clickhouse.com/docs/sql-reference/statements/delete. Something really nice about the ClickHouse docs is that they devote quite a bit of text to describing the design and performance implications of using an operation. It reiterates the focus on performance that is pervasive across the product.

Edit: Per the other part of your question, why inserts create new parts and how they are merged is best described here https://clickhouse.com/docs/engines/table-engines/mergetree-...

porridgeraisin•8mo ago
Thankyou!
Silasdev•8mo ago
SQL, OLAP, Primary use case is fast aggregations on append only data, like usage analytics.

It's fast, it's........ really fast!!

But you need to get comfortable with their extended SQL dialect that forces you to think a little different than with usual SQL if you want to keep perf high.

joshstrange•8mo ago
If you go into it with MySQL/Postgres knowledge you will probably hate it.

Source: me

I almost wish it didn’t use SQL so that it was clear how different it is. Nothing works like you are used to, footguns galore, and I hate zookeeper.

I’d replace it with Postgres in a heartbeat if I thought I could get away with it, I don’t think our data size really needs CH. Unfortunately, my options are “spin up a Custer on company resources to prove my point” or “spin it up on my own infra” (which is not possible since that would require pulling company data to my servers which I would never do). So instead I’m stuck dealing with CH.

whiskeytwolima•8mo ago
I honestly just want to know why they didn't steam their shirts.
winterbloom•8mo ago
so like iron? is it that important? asking legitimately
whiskeytwolima•8mo ago
Nah I don't think it's important. I just think it's funny.
ajcp•8mo ago
I'm really confused by the wrinkle pattern. Were they stored with half the shirt stuffed in a Pringles can? Or were these shot out of a air-cannon? The more I look at the picture the deeper the mystery gets.
tylerhannan•8mo ago
I actually took that picture a few years ago.

It's a fun story.

Our first swag shipment with the new colours had just arrived, the founders were in one place together for one of the first times, the weather wasn't terrible in Amsterdam for one day.

Not a pringles can. Rather they were stuffed in a shipping box that came from a warehouse, manhandled by customs, and thrown onto them for the purpose of taking the photo.

#startuplife eh?

whiskeytwolima•8mo ago
Love it, and I've definitely been there. Too funny though.
devops000•8mo ago
only 2k users?

with 200$/month I have a good database. $1-5M revenue?

noleary•8mo ago
My understanding is that those 2,000 represent some very large and enterprise-y contracts. The GitHub itself has almost 2,000 contributors: https://github.com/ClickHouse/ClickHouse
brettgriffin•8mo ago
the ACV for a data warehouse is orders of magnitude beyond $200. Snowflake's ACV is something like $300k/yr
arecurrence•8mo ago
I've worked at a number of companies using Clickhouse and they all self-hosted. I imagine Clickhouse corporate is focused on large customers.
wooque•8mo ago
we use smallest cluster and it's $450/month, most companies probably pay much more.
Boxxed•8mo ago
Does anyone use clickhouse in production? I was initially pretty impressed but when I really put it through its paces I could OoM it as soon as I actually started querying non-trivial amounts of data:

https://github.com/ClickHouse/ClickHouse/issues/79064

hodgesrm•8mo ago
It's used in production by many thousands of companies at this point. The ClickHouse Inc numbers are just a fraction of the total users.

p.s., It's also possible to break ClickHouse as you demonstrated. It used to be a lot easier.

Boxxed•8mo ago
I guess I'm curious how; I breathe on it wrong and it OoMs.
bathtub365•8mo ago
You don’t need a good product to have a lot of users, just good marketing and salespeople.
nasretdinov•8mo ago
One easy way to achieve this is to store really large values, e.g. 10 Mb per row. Since ClickHouse operates in large blocks you'd easily cause an OOM just by trying to read chunks of 8192 rows (the default) at a time, especially during merges, where it needs to read large blocks from several parts at once
hodgesrm•8mo ago
One of the tradeoffs for ClickHouse versus databases like Snowflake is that you have to have some knowledge about the internals to use it effectively. For example, Snowflake completely hides partitioning but on the other hand it does not deliver consistent, real-time response the way a well-tuned ClickHouse application can.

When you use INSERT ... SELECT in ClickHouse you do need to pay attention to the generated table partitions, as they coexist in memory before flushing to storage. The usual approach is to break up the insert into chunks so you can control how many parts are generated or to adjust the partitioning in the target table.

It's possible the problem might be somehow related to this behavior but that's just conjecture. It's usually pretty easy to work around. Meanwhile if it's a bug it will probably get fixed quickly.

datavirtue•8mo ago
You have to have knowledge of the internals of any database you use. Not knowing is going to cost someone a lot of money and/or performance.
hackitup7•8mo ago
Yes for relatively large workloads
fishtoaster•8mo ago
Yep. Clickhouse is absolutely great for tons of production use cases.

Unless you try to join tables in it, in which case it will immediately explode.

More seriously, it's a columnar data store, not a relational database. It'll definitely pretend to be "postgres but faster", but that's a very thin and very leaky facade. You want to do massively a complex set of selects and conditional sums over one table with 3b rows and tb of data? You'll get a result in tens of seconds without optimization. You want to join two tables that postgres could handle easily? You'll OOM a machine with TB of memory.

So: good for very specific use cases. If you have those usecases, it's great! If you don't, use something else. Many large companies have those use cases.

hodgesrm•8mo ago
> More seriously, it's a columnar data store, not a relational database.

Could you explain why you don't think ClickHouse is relational? The storage is an implementation detail. It affects how fast queries run but not the query model. Joins have already improved substantially and will continue to do so in future.

fishtoaster•8mo ago
The storage is not just an implementation detail because it affects how fast things run, which affects which tasks it's better or worse for. There's a reason people reach for a columnar datastore for some tasks and something like postgres or mysql for other tasks, even though both are technically capable of nearly the same queries.
Boxxed•8mo ago
Yeah I think that's a good summary. For instance, clickbench is comprised of >40 queries and there's not a single join in them: https://github.com/ClickHouse/ClickBench/blob/main/clickhous...
zX41ZdbW•8mo ago
There is the "versions benchmark," which includes a lot of queries with JOINs and compares ClickHouse performance on them: https://benchmark.clickhouse.com/versions/
Boxxed•8mo ago
I don't think that's right, it looks to be a set of 43 queries with zero joins: https://github.com/ClickHouse/ClickBench/blob/main/versions/...
zX41ZdbW•8mo ago
Here are 75 queries from various benchmarks, that form the version benchmark: https://benchmark.clickhouse.com/versions/
Boxxed•8mo ago
Did you look at the queries? There is not a single join in any of them.
adrian17•8mo ago
The majority of our queries have joins (plus our core logic often depends on fact table expansion with `arrayJoin()`s) before aggregations and we're doing fine. AFAIK whenever we hit memory issues, they are mostly due to high-cardinality aggregations (especially with uniqExact), not joins. But I'm sure it can depend on the specifics.
legorobot•8mo ago
Definitely agree with this, I think ClickHouse can do a lot with joins if you don't implement them naively. Keeping the server up-to-date is a part of it too.

They've made strides in the last year or two to implement more join algorithms, and re-order your joins automatically (including whats on the "left" and "right" of the join, relating to performance of the algorithm).

Their release notes cover a lot of the highlights, and they have dedicated documentation regarding joins[1]. But we've made improvements by an order-of-magnitude before by just reordering our joins to align with how ClickHouse processes them.

[1]: https://clickhouse.com/docs/guides/joining-tables

owenthejumper•8mo ago
I find Clickhouse fascinating, really good, and also really tough to run. It's a non-linear memory hog. It probably needs 32GB RAM for basics to run, otherwise it will OOM on minimal amount of data. That said, it won't "OOM", as in crash. It will just report the query would use too much memory, so it aborted the query.
david38•8mo ago
It’s fantastic but it’s a columnar store. It’s not a Postgres replacement.
mplanchard•8mo ago
Yes (via Clickhouse Cloud, which is pretty reasonably priced).

It’s important to structure your tables and queries in a way that aligns with the ordering keys, in order to optimize how much data needs to be loaded into RAM. You absolutely CANNOT just replicate your existing postgres DB and its primary keys or whatever over to CH. There are tricks like projections and incremental materialized views that can help to get the appropriate “lenses” for your queries. We use incremental MVs to, for example, continuously aggregate all-time stats about tens of billions of records. In general, for CH, space is cheap and RAM is expensive, so it’s better to duplicate a table’s data with a different ordering key than to make an inefficient query.

As long as the queries align with the ordering keys, it is insanely fast and able to enable analytics queries for truly massive amounts of data. We’ve been very impressed.

Boxxed•8mo ago
Well that's exactly my complaint. The bug I filed above was pretty much the optimal case (one huge table, one very small table, both ordered by the join key) and it still OoMs.
mplanchard•8mo ago
Yeah it sucks at joins! If you can restructure your query to use an IN, or first limit the large table in a CTE and then JOIN, you may see better results. So far we haven’t found any cases where it couldn’t manage the task, but we have often had to be clever about join strategies.

Depending on your use case, an incremental materialized view can also be really effective: when new rows for one table come in, query for related rows in a secondary table and populate the combination into a MV for efficient querying.

You can also specify specific join strategies for queries, but we haven’t had as much luck with that so far.

The JOIN thing is definitely the biggest pain point, though, I’ll not debate that at all.

_gmax0•8mo ago
Heard from the grapevine that CloudFlare uses it for their analytics.
tveita•8mo ago
They don't make a secret of it: https://blog.cloudflare.com/log-analytics-using-clickhouse/

Clickhouse is great, but like any database if you run it at scale someone must tend to it.

_gmax0•8mo ago
Thanks for this, I hadn't come across it before.
lossolo•8mo ago
7 years, 24/7 high volume, self hosted, no issues really.
AlexClickHouse•8mo ago
Thanks for creating this issue, it is worth investigating!

I see you also created similar issues in Polars: https://github.com/pola-rs/polars/issues/17932 and DuckDB: https://github.com/duckdb/duckdb/issues/17066

ClickHouse has a built-in memory tracker, so even if there is not enough memory, it will stop the query and send an exception to the client, instead of crashing. It also allows fair sharing of memory between different workloads.

You need to provide more info on the issue for reproduction, e.g., how to fill the tables. 16 GB of memory should be enough even for a CROSS JOIN between a 10 billion-row and a 100-row table, because it is processed in a streaming fashion without accumulating a large amount of data in memory. The same should be true for a merge join.

However, there are places when a large buffer might be needed. For example, if you insert data into a table backed by S3 storage, it requires a buffer that can be in the order of 500 MB.

There is a possibility that your machine has 16 GB of memory, but most of it is consumed by Chrome, Slack, or Safari, and not much is left for ClickHouse server.

Boxxed•8mo ago
Yeah I feel like I'm on crazy pills, I'm OoM'ing all these big data tools that everyone loves very trivially -- duckdb OoM'd just loading a CSV file, and Polars OoM'd just reading the first couple rows of a parquet file?

I do want to get a better reproduction on CH because it seems like it's an interplay between the INSERT INTO...SELECT. It's just a bit of work to generate synthetic data with the same profile as my production data (for what it's worth I did put quite a bit of effort into following the doc guidelines for dealing with low-memory machines).