Optimizing ClickHouse for Intel's ultra-high core count processors

https://clickhouse.com/blog/optimizing-clickhouse-intel-high-core-count-cpu

225•ashvardanian•4mo ago

Comments

epistasis•4mo ago

This is my favorite type of HN post, and definitely going to be a classic in the genre for me.

> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.

In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.

ashvardanian•4mo ago

My expectation, they will perform great! I’m now mostly benchmarking on 192 core Intel, AMD, and Arm instances on AWS, and in some workloads they come surprisingly close to GPUs even on GPU-friendly workloads, once you get the SIMD and NUMA pinning parts right.

For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs

bob1029•4mo ago

The most ideal arrangement is one in which you do not need to use the memory subsystem in the first place. If two threads need to communicate back-forth with each other in a very tight loop in order to get some kind of job done, there is almost certainly a much faster technique that could be ran on a single thread. Physically moving the information between the cores of processing is the most expensive part. You can totally saturate the memory bandwidth of a Zen chip with somewhere around 8-10 cores if they're all going at a shared working set really aggressively.

Core-to-Core communication across infinity fabric is on the order of 50~100x slower than L1 access. Figuring out how to arrange your problem to meet this reality is the quickest path to success if you intend to leverage this kind of hardware. Recognizing that your problem is incompatible can also save you a lot of frustration. If your working sets must be massive monoliths and hierarchical in nature, it's unlikely you will be able to use a 256+ core monster part very effectively.

jeffbee•4mo ago

Note that none of the CPUs in the article have that Zen architecture.

One of the most interesting and poorly exploited features of these new Intel chips is that four cores share an L2 cache, so cooperation among 4 threads can have excellent efficiency.

They also have user-mode address monitoring, which should be awesome for certain tricks, but unfortunately like so many other ISA extentions, it doesn't work. https://www.intel.com/content/www/us/en/developer/articles/t...

Moto7451•4mo ago

One of the use cases for Clickhouse and related columnar stores is simply to process all your data as quickly as possible where “all” is certainly more than what will fit in memory and in some cases more than what will fit on a single disk. For these I’d expect the allocator issue is contention when working with the MMU, TLB, or simply allocators that are not lock free (like the standard glibc allocator). Where possible one trick is to pre-allocate as much as possible for your worker pool so you get that out of the way and stop calling malloc once you begin processing. If you can swing it you replace chunks of processed data with new data within the same allocated area. At a previous job our custom search engine did just this to scale out better on the AWS X1 instances we were using for processing data.

pixelpoet•4mo ago

This post looks like excellent low-level optimisation writing just in the first sections, and (I know this is kinda petty, but...) my heart absolutely sings at their use of my preferred C++ coding convention where & (ref) neither belongs to the type nor the variable name!

nivertech•4mo ago

I think it belongs to type, but since they use “auto” it looks standalone and can be confused with the “&” operator. I personally always used * and & as a prefix of the variable name, not as a suffix in the type name, except when used to specify types in templates.

pixelpoet•4mo ago

IMO it's a separate category of modifiers/decorators to the type, like how adjectives and nouns are distinguished, and the only reason we have the false-choice in C/C++ is because it's not alphanumeric (if the token were e.g. "ref" it would interfere with the type or variable name in either other convention).

If I were forced at gunpoint to choose one of the type or name, "obviously" I would also choose type.

bee_rider•4mo ago

288 cores is an absurd number of cores.

Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…

That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).

ashvardanian•4mo ago

Sadly, no! On the bright side, they support new AVX2 VNNI extensions, that help with low precision integer dot products for Vector Search!

SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(

yvdriess•4mo ago

Something that could help is to use llvm-mca or similar to get an idea of the potential speedup.

Sesse__•4mo ago

A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.

pclmulqdq•4mo ago

AVX-512 is on the P-cores only (along with AMX now). The E-cores only support 256-bit vectors.

If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.

sdairs•4mo ago

how long until I have 288 cores under my desk I wonder?

zokier•4mo ago

Does 2x160 cores count?

https://www.titancomputers.com/Titan-A900-Octane-Dual-AMD-EP...

sbarre•4mo ago

Damn, when I first landed on the page I saw $7,600 and thought "for 320 cores that's pretty amazing!" but that's the default configuration with 32 cores & 64GB of memory.

320 cores starts at $28,000.. $34k with 1TB of memory..

mrheosuper•4mo ago

The CPU has launch price of $13k already, so $28k is a good deal imo

sbarre•4mo ago

It's not a good deal for me though. ;-)

bri3d•4mo ago

Sierra Forest (the 288-core one) does not have AVX512.

Intel split their server product line in two:

* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.

* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.

On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.

ashvardanian•4mo ago

Largely true, but there is always a caveat.

On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…

Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(

adgjlsfhk1•4mo ago

512 bits is the least important part of AVX-512. You still get all the masks and the fancy functions.

adrian_b•4mo ago

This is an often repeated myth, which is only half true.

On Zen 4 and Zen 4c, for most vector instructions the vector datapaths have the same width as in Intel's best Xeons, i.e. they can do two 512-bit instructions per clock cycle.

The exceptions where AMD has half throughput are the vector load and store instructions from the first level cache memory and the FMUL and FMA instructions, where the most expensive Intel Xeons can do two FMUL/FMA per clock cycle while Zen 4/4c can do only 1 FMUL/FMA + 1 FADD per clock cycle.

So only the link between the L1 cache and the vector registers and also the floating-point multiplier have half-width on Zen 4/4c, while the rest of the datapaths have the same width (2 x 512-bit) on both Zen 4/4c and Intel's Xeons.

The server and desktop variants of Zen 5/5c (and also the laptop Fire Range and Strix Halo CPUs) double the width of all vector datapaths, exceeding the throughput of all past or current Intel CPUs. Only the server CPUs expected to be launched in 2026 by Intel (Diamond Rapids) are likely to be faster than Zen 5, but by then AMD might also launch Zen 6, so it remains to be seen which will be better by the end of 2026.

jsheard•4mo ago

It is pretty wide, but 288 cores with 8x FP32 lanes each is still only about a tenth of the lanes on an RTX 5090. GPUs are really, really, really wide.

rkagerer•4mo ago

640k of RAM is totally absurd.

So is 2 GB of storage.

And 2K of years.

NortySpock•4mo ago

I mean, yeah, it's "a lot" because we've been starved for so long, but having run analytics aggregation workloads I now sometimes wonder if 1k or 10k cores with a lot of memory bandwidth could be useful for some ad-hoc queries, or just being able to serve an absurd number of website requests...

CPU on PCIe card seems like it matches with the Intel Xeon Phi... I've wondered if that could boost something like an Erlang mesh cluster...

https://en.m.wikipedia.org/wiki/Xeon_Phi

singhrac•4mo ago

The 288 core SKU (I believe 6900E) isn't very widely available, I think only to big clouds?

bigiain•4mo ago

> 288 cores is an absurd number of cores.

Way back in the day, I built and ran the platform for a business on Pentium grade web & database servers which gave me 1 "core" in 2 rack units.

That's 24 cores per 48 unit rack, so 288 cores would be a dozen racks or pretty much an entire aisle of a typical data center.

I guess all of Palo Alto Internet eXchange (where two of my boxen lived) didn't have much more than a couple of thousand cores back in 98/99. I'm guessing there are homelabs with more cores than that entire PAIX data center had back then.

bee_rider•4mo ago

Oh yeah, it is not that many cores for the cluster-universe. Just neat to see the number of cores per socket increase.

A while ago I had access to an 8-socket shared memory machine… but this was the semi-olden days, so it was “only” 80 cores. It was a fun machine at the time! We’re so spoiled these days, haha.

jiehong•4mo ago

Great work!

I like duckdb, but clickhouse seems more focused on large scale performance.

I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?

hobo_in_library•4mo ago

Not sure what happened here, but it's not uncommon for a post to have one primary author and then multiple reviewers/supporters also credited

sdairs•4mo ago

Yep that's pretty much the case here!

sdairs•4mo ago

ClickHouse works in-process and on the CLI just like DuckDB, but also scales to hundreds of nodes - so it's really not limited to just large scale. Handling those smaller cases with a great experience is still a big focus for us

_jsmh•4mo ago

I'd like to see ClickHouse change its query engine to use Optimistic Concurrency Control.

secondcoming•4mo ago

Those ClickHouse people get to work on some cool stuff

sdairs•4mo ago

We do! (and we're hiring!)

vlovich123•4mo ago

I'm generally surprised they're still using the unmaintained old version of jemalloc instead of a newer allocator like the Bazel-based TCMalloc or mimalloc which have significantly better techniques due to better OS primitives & about a decade or so of R&D behind them.

mrits•4mo ago

besides jemalloc also being used by other columnar databases it has a lot of control and telemetry built in. I don't closely follow tcmalloc but I'm not sure it focuses on large objects and fragmentation over months/years.

jeffbee•4mo ago

TCMalloc has an absurd amount of bookkeeping and stats, but you have to understand the implementation deeply to make sense of the stats. https://github.com/google/tcmalloc/blob/master/docs/stats.md

drchaim•4mo ago

it seems they have tested it: https://github.com/ClickHouse/ClickHouse/issues/34157

lordnacho•4mo ago

Clickhouse is excellent btw. I took it for a spin, loading a few TB of orderbook changes into it as entire snapshots. The double compression (type-aware and generic) does wonders. It's amazing how you get both the benefit of small size and quick querying, with minimal tweaks. I don't think I changed any system level defaults, yet I can aggregate through the entire few billion snapshots in a few minutes.

fibers•4mo ago

By snapshots do you mean the entire orderbook in a specific point in time or the entire history that gets instiantiated?

lordnacho•4mo ago

At each point in time, the entire orderbook at that time.

So you could replay the entire history of the book just by stepping through the rows.

kookamamie•4mo ago

NUMA is satan. Source: Working in real-time computer vision.

DeathArrow•4mo ago

>Intel's latest processor generations are pushing the number of cores in a server to unprecedented levels - from 128 P-cores per socket in Granite Rapids to 288 E-cores per socket in Sierra Forest, with future roadmaps targeting 200+ cores per socket.

It seems today's Intel CPU can replace yesteryear's data center.

May someone can try for fun running 1000 Red Hat Linux 6.2 in parallel on one CPU, like it's year 2000 again.

adrian_b•4mo ago

Due to a typo, the title is confusing, at the first glance I thought that "Intel 280" might be some kind of Arrow Lake CPU (intermediate between Intel 275 and Intel 285), but the correct title should have said "Intel's 288-core processors", making clear that this is about the server CPUs with 288 E-cores, Sierra Forest and the future Clearwater Forest.

scott_w•4mo ago

Same here. I don't know if the linked title changed but it's now:

> Optimizing ClickHouse for Intel's ultra-high core count processors

Which is pretty unambiguous.

sdairs•4mo ago

Yeah "Optimizing ClickHouse for Intel's ultra-high core count processors" is the original, unchanged article title, it's just been submitted slightly differently on HN (surprised mods haven't changed it here actually)

tomhow•4mo ago

Thanks, we've just changed the title to match the article's title, which the guidelines ask us to do.

menaerus•4mo ago

    Two-character SIMD filtering improved performance significantly:
    ClickBench query Q20 sped up by 35%
    Other queries which perform substring matching saw an overall improvement of ~10%
    The geometric mean of all queries improved by 4.1%

ClickBench dataset is ~70G IIRC so I find it interesting that they measured such a substantial speedup while only using SSE4.1 (128-bit) - so, not even AVX2 and much less AVX-512. I wonder what the results would be if latter had been the case.

And I also wonder if this is (partly) an artifact of more laser-focused utilization of a CPU core ALU and memory subsystem. E.g. crunching more work into a single or pair of instructions are now leaving more space for other unrelated instructions to be retired.

kwillets•4mo ago

The SIMD string matching optimization unfortunately missed a trick -- it's more selective to match the first and last characters of a pattern than the first two, and it's the same cost.

Credit to Muła for this one: http://0x80.pl/notesen/2016-11-28-simd-strfind.html#generic-... .

fuy•4mo ago

Looking at first optimization, I wonder if double-checking after acquiring exclusive lock brings any performance benefits. The whole premise is that cache access is read-heavy, so not acquiring exclusive locks for reads eliminates by far the biggest problem.

Rare (I presume) cases of overlapping updates from different threads (considering updates themselves are also infrequent) don't seem like a big deal compared to lock elimination. Would be interesting to see benchmark numbers for those optimizations separately.

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

First Proof

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

Software Factories and the Agentic Moment

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

StrongDM's AI team build serious software without even looking at the code

Ga68, a GNU Algol 68 Compiler

Making geo joins faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: I spent 4 years building a UI design tool with only the features I use

What Is Ruliology?

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

Google staff call for firm to cut ties with ICE

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

First Proof

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

Software Factories and the Agentic Moment

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

StrongDM's AI team build serious software without even looking at the code

Ga68, a GNU Algol 68 Compiler

Making geo joins faster with H3 indexes

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Show HN: I spent 4 years building a UI design tool with only the features I use

What Is Ruliology?

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

Google staff call for firm to cut ties with ICE

Optimizing ClickHouse for Intel's ultra-high core count processors

Comments