Ask HN: Scheduling stateful nodes when MMAP makes memory accounting a lie

24•leo_e•2mo ago

We’re hitting a classic distributed systems wall and I’m looking for war stories or "least worst" practices.

The Context: We maintain a distributed stateful engine (think search/analytics). The architecture is standard: a Control Plane (Coordinator) assigns data segments to Worker Nodes. The workload involves heavy use of mmap and lazy loading for large datasets.

The Incident: We had a cascading failure where the Coordinator got stuck in a loop, DDOS-ing a specific node.

The Signal: Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."

The Action: Coordinator attempts to rebalance/load new segments onto Node A.

The Reality: Node A is actually sitting at 197GB RAM usage (near OOM). The data on it happens to be extremely wide (fat rows, huge blobs), so its logical row count is low, but physical footprint is massive.

The Loop: Node A rejects the load (or times out). The Coordinator ignores the backpressure, sees the low row count again, and retries immediately.

The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.

Now we are staring at mmap. Because the OS manages the page cache, the application-level RSS is noisy and doesn't strictly reflect "required" memory vs "reclaimable" cache.

The Question: Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.

How do you handle placement in systems where memory usage is opaque/dynamic?

Dumb Coordinator, Smart Nodes: Should we just let the Coordinator blind-fire based on disk space, and rely 100% on the Node to return hard 429 Too Many Requests based on local pressure?

Cost Estimation: Do we try to build a synthetic "cost model" per segment (e.g., predicted memory footprint) and schedule based on credits, ignoring actual OS metrics?

Control Plane Decoupling: Separate storage balancing (disk) from query balancing (mem)?

Feels like we are reinventing the wheel. References to papers or similar architecture post-mortems appreciated.

Comments

otterley•2mo ago

It's not clear whether you're using Kubernetes, but the Kubernetes way of dealing with this problem is to declare a memory reservation (i.e., a request) along with the container specification. The amount of the reservation will be deducted from the host's available memory for scheduling purposes, regardless of whether the container actually consumes the reserved amount. It's also a best practice to configure the memory limit to be identical to the reservation, so if the container exceeds the reserved amount, the kernel will terminate it via the OOM killer.

Of course, for this to work, you have to figure out what that reserved amount should be. That is an exercise for the implementer (i.e., you).

See https://kubernetes.io/docs/concepts/configuration/manage-res...

> Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.

Yeah, don't do that. Figure out what resources your applications need and the declare them, and let the scheduler find the best node based on the requirements you've specified.

> We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.

A few things come to mind...

First, you're talking about a load balancer, but it's not clear that you're trying to balance load! A good metric to use for load balancing is one whose value is proportional to response latency.

It smells like you're trying to provision resources based on an optimistic prediction of your working set size. Perhaps you need a more pessimistic prediction. It might also be that you're relying too heavily on the kernel to handle paging, when what you really need is a cache tuned for your application that is scan-resistant, coupled with O_DIRECT for I/O.

majke•2mo ago

> Coordinator sees Node A has significantly fewer rows (logical count) than the cluster average. It flags Node A as "underutilized."

Ok, so you are dealing with a classic - you measure A, but what matters is B. For "load" balancing a decent metric is, well, response time (and jitter).

For data partitioning - I guess number of rows is not the right metric? Change it to number*avg_size or something?

If you can't measure the thing directly, then take a look at stuff like "PID controller". This can be approach as a typical controller loop problem, although in 99% doing PID for software systems is an overkill.

leo_e•2mo ago

The trouble with mmap is the performance cliff. A node goes from 'fine' to 'dead' almost instantly, which breaks our balancing logic.

You are right that we need better backpressure. Instead of a smarter coordinator, we probably need 'dumber' nodes that aggressively shed load (return 429s) the moment local pressure spikes, rather than waiting for a re-balance.

bcoates•2mo ago

Memory pressure (and a lot of other overload conditions) usually makes latency worse--does that show up in your system? Latency backpressure is a pretty conventional thing to do. You're going to want some way to close the loop back to your load balancer, if you're doing open-loop control (sending a "fair share" of traffic to each node and assuming it can handle it) issues like you describe will keep coming up.

This is a Hard Problem and you might be trying to get away with an unrealistically small amount of overprovisioning.

wmf•2mo ago

Have you measured Pressure Stall Information or active pages from /proc/meminfo?

Attempting to enumerate every resource variable (CPU, IOPS, RSS, Disk, logical count) into a single scoring function feels like an NP-hard trap.

That's perfect for machine learning.

leo_e•2mo ago

I admit PSI wasn't on our radar for this specific issue. We've been staring at RSS and page fault counters, but they are indeed too noisy in an mmap-heavy workload.

Checking /proc/pressure/memory to distinguish between 'healthy caching' and 'thrashing' sounds exactly like the signal we are missing. We will try to incorporate some pressure metrics into the node's health report. Thanks for the pointer.

But still, too many metrics for us to balance

man8alexd•2mo ago

Don't use Active/Inactive pages from /proc/meminfo. They don't represent the actual size of active/inactive memory.

shanemhansen•2mo ago

This actually seems like a simple example of memory request vs limit.

Request the amount of memory needed to be healthy, you can potentially set the limit higher to account for "reclaimable cache".

Another way to approach it if you find that there are too many limiting metrics to accurately model things: is you let the workers grab more segments until you determine that they are overloaded. Ideally for this to work though you have some idea that the node is approaching saturation. So for example: keep adding segments as long as the nth percentile response time is under some threshold.

The advantage of this approach is you don't necessarily have to know which resource (memory, filehandles, etc) is at capacity. You don't even necessarily have to have deep knowledge of linux memory management. You just have to be able to probe the system to determine if it's healthy.

I can even go backwards with a binary split mechanism. You sort of bring up a node that owns [A-H] (8 segments in this case). If that fails bring up 2 nodes that own [A-D],[E-H], if that fails, all the way down to one segment per node.

man8alexd•2mo ago

mmap'ed memory counts as that "reclaimable cache", which isn't always reclaimable (dirty or active pages are not immediately reclaimable). But Kubernetes memory accounting assumes that the page cache is always reclaimable. This creates a lot of surprises and unexpected OOMs. https://github.com/kubernetes/kubernetes/issues/43916

bcrl•2mo ago

There's a simple solution: don't use mmap(). There's a reason that databases use O_DIRECT to read into their own in memory cache. If it was Good Enough for Oracle in the 1990s, it's probably Good Enough for you.

mmap() is one of those things that looks like it's an easy solution when you start writing an application, but that's only because you don't know the complexity time bomb of what you're undertaking.

The entire point of the various ways of performing asynchronous disk I/O using APIs like io_uring is to manage when and where blocking of tasks for I/O occurs. When you know where blocking I/O gets done, you can make it part of your main event loop.

If you don't know when or where blocking occurs (be it on I/O or mutexes or other such things), you're forced to make up for it by increasing the size of your thread pool. But larger thread pools come with a penalty: task switches are expensive! Scheduling is expensive! AVX 512 registers alone are 2KB of state per task, and if a thread hasn't run for a while, you're probably missing on your L1 and L2 caches. That's pure overhead baked into the thread pool architecture that you can entirely avoid by using an event driven architecture.

All the high performance systems I've worked on use event driven architectures -- from various network protocol implementations (protocols like BGP on JunOS, the HA functionality) to high speed (persistent and non-persistent) messaging (at Solace). It just makes everything easier when you're able to keep threads hot on locked to a single core. Bonus: when the system is at maximum load, you remain at pretty much the same number of requests per second rather than degrading as the number of threads ready to run starts increasing and wasting your CPU resources needlessly when you need them most.

It's hard to believe that the event queue architecture I first encountered on an Amiga in the late 1980s when I was just a kid is still worth knowing today.

grep_it•2mo ago

Relevant: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

man8alexd•2mo ago

There is a database that uses `mmap()` - RavenDB. Their memory accounting is utter horror - they somehow use Commited_AS from /proc/meminfo in their calculations. Their recommendation to avoid OOMs is to have swap twice the size of RAM. Their Jepsen test results are pure comedy.

otterley•2mo ago

LMDB uses mmap() as well, but it only supports one process holding the database open at a time. It's also not intended for working sets larger than available RAM.

hyc_symas•2mo ago

Wrong, LMDB fully supports multiprocess concurrency as well as DBs multiple orders of magnitude larger than RAM. Wherever you got your info from is dead wrong.

Among embedded key/value stores, only LMDB and BerkeleyDB support multiprocess access. RocksDB, LevelDB, etc. are all single process.

otterley•2mo ago

My mistake. Doesn’t it have a global lock though?

Also, even if LMDB supports databases larger than RAM, that’s it doesn’t mean it’s a good idea to have a working set that exceeds that size. Unless you’re claiming it’s scan resistant?

hyc_symas•2mo ago

It has a single writer transaction mutex, yes. But it's a process-shared mutex, so it will serialize write transactions across an arbitrary number of processes. And of course, read transactions are completely lockfree/waitfree across arbitrarily many processes.

As for working set size, that is always merely the height of the B+tree. Scans won't change that. It will always be far more efficient than any other DB under the same conditions.

otterley•2mo ago

> As for working set size, that is always merely the height of the B+tree.

This statement makes no sense to me. Are you using a different definition of "working set" than the rest of us? A working set size is application and access pattern dependent.

> It will always be far more efficient than any other DB under the same conditions

That depends on how broadly or narrowly one defines "same conditions" :-)

hyc_symas•2mo ago

Identical hardware, same RAM size, same data volume.

otterley•2mo ago

That’s a bold claim. Are you saying that LMDB outperforms every other database on the same hardware, regardless of access pattern? And if so, is there proof of this?

hyc_symas•2mo ago

Plenty of proof. http://www.lmdb.tech/bench/

otterley•2mo ago

Since the first question of my two-part inquiry not explicitly answered in the affirmative: To be absolutely clear, you are claiming, in writing, that LMDB outperforms every other database there is, regardless of access pattern, using the same hardware?

hyc_symas•2mo ago

Not every.

LMDB is optimized for read-heavy workloads. I make no particular claims about write-heavy workloads.

Because it's so efficient, it can retain more useful data in-memory than other DBs for a given RAM size. For DBs much larger than RAM it will get more useful work done with the available RAM than other DBs. You can examine the benchmark reports linked above, they provide not just the data but also the analysis of why the results are as they are.

hyc_symas•2mo ago

You don't have to take my word for it. Plenty of other developers know. https://www.youtube.com/watch?v=CfiQ0h4bGWM

leo_e•2mo ago

You're right. O_DIRECT is the endgame, but that's a full engine rewrite for us.

We're trying to stabilize the current architecture first. The complexity of hidden page fault blocking is definitely what's killing us, but we have to live with mmap for now.

bcrl•2mo ago

I am curious -- what is the application and the language it's written in?

There are insanely dirty hacks that you could do to start controlling the fallout of the page faults (like playing games with userfaultfd), but they're unmaintainable in the long term as they introduce a fragility that results in unexpected complexity at the worst possible times (bugs). Rewriting / refactoring is not that hard once one understands the pattern, and I've done that quite a few times. Depending on the language, there may be other options. Doing an mlock() on the memory being used could help, but then it's absolutely necessary to carefully limit how much memory is pinned by such mappings.

Having been a kernel developer for a long time makes it a lot easier to spot what will work well for applications versus what can be considered glass jaws.

man8alexd•2mo ago

I can suggest measuring working set size (WSS) instead of RSS. See https://docs.kernel.org/admin-guide/mm/multigen_lru.html and https://docs.kernel.org/mm/damon/index.html

toast0•2mo ago

This is where having a little bit of swap can help you out. Not because you need swap, but because swap use % and swap I/O rates are good indicators. Something like 512 MB to maybe 1 G; not something like 2x your memory (unless you're on a very small system, and then use min(2x memory, 512 MB); having too much swap extends the amount of time your system can be swapping to death before it actually dies.

If your swap use jumps 10 points in a small time frame, you are running out of memory quickly. If your swap use hits 50 % or 80% or [whatever threshold], without any big jumps you're running out of memory slowly.

If your swap I/O is all output, not a huge deal... you're swapping stuff you never read. If you've got a lot of swapping in, chances are you're swapping to death.

> The Core Problem: We are trying to write a "God Equation" for our load balancer. We started with row_count, which failed. We looked at disk usage, but that doesn't correlate with RAM because of lazy loading.

I'm a big fan of straight up even distribution of requests. It's simple and predictable, although it's not going to get you the best throughput, predictability and simplicity is often better than perfection. If you always send each node 1/Nth of requests, worst case of a node that is broken but looks up is that you're still sending it a share when it should get nothing; if you have some sort of utilization based metric, if it looks underutilized because it's just dropping requests and responding with success but empty, it sucks up all your requests. Alternatively, people have good results with select M nodes by metrics, and then random selection between those. But also, IMHO, you want to reduce the work your load balancer(s) do, because load balancing load balancers is hard.

DoNotNotify is now Open Source

Matchlock: Linux-based sandboxing for AI agents

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

Modern and Antique Technologies Reveal a Dynamic Cosmos

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Rabbit Ear "Origami": programmable origami in the browser (JS)

(AI) Slop Terrifies Me

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

LineageOS 23.2

Stories from 25 Years of Software Development

uLauncher

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

First Proof

In the Australian outback, we're listening for nuclear tests

Start all of your commands with a comma (2009)

Substack confirms data breach affects users’ email addresses and phone numbers

Al Lowe on model trains, funny deaths and working with Disney

Where did all the starships go?

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

DoNotNotify is now Open Source

Matchlock: Linux-based sandboxing for AI agents

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

Reverse Engineering Raiders of the Lost Ark for the Atari 2600

Haskell for all: Beyond agentic coding

SectorC: A C Compiler in 512 bytes (2023)

LLMs as the new high level language

Modern and Antique Technologies Reveal a Dynamic Cosmos

The Architecture of Open Source Applications (Volume 1) Berkeley DB

Software factories and the agentic moment

Rabbit Ear "Origami": programmable origami in the browser (JS)

(AI) Slop Terrifies Me

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

LineageOS 23.2

Stories from 25 Years of Software Development

uLauncher

Vocal Guide – belt sing without killing yourself

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Wood Gas Vehicles: Firewood in the Fuel Tank (2010)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

First Proof

In the Australian outback, we're listening for nuclear tests

Start all of your commands with a comma (2009)

Substack confirms data breach affects users’ email addresses and phone numbers

Al Lowe on model trains, funny deaths and working with Disney

Where did all the starships go?

The AI boom is causing shortages everywhere else

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Ask HN: Scheduling stateful nodes when MMAP makes memory accounting a lie

Comments