How We Found 7 TiB of Memory Just Sitting Around

https://render.com/blog/how-we-found-7-tib-of-memory-just-sitting-around

120•anurag•1d ago

Comments

shanemhansen•1d ago

The unreasonable effectiveness of profiling and digging deep strikes again.

hinkley•7h ago

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

zahlman•6h ago

> The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

The sympathy is also needed. Problems aren't found when people don't care, or consider the current performance acceptable.

> There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried.

It's hard for profilers to identify slowdowns that are due to the architecture. Making the function do less work to get its result feels different from determining that the function's result is unnecessary.

hinkley•5h ago

Architecture, cache eviction, memory bandwidth, thermal throttling.

All of which have gotten perhaps an order of magnitude worse in the time since I started on this theory.

Negitivefrags•6h ago

I think the biggest tool is higher expectations. Most programmers really haven't come to grips with the idea that computers are fast.

If you see a database query that takes 1 hour to run, and only touches a few gb of data, you should be thinking "Well nvme bandwidth is multiple gigabytes per second, why can't it run in 1 second or less?"

The idea that anyone would accept a request to a website taking longer than 30ms, (the time it takes for a game to render it's entire world including both the CPU and GPU parts at 60fps) is insane, and nobody should really accept it, but we commonly do.

javier2•5h ago

its also about cost. My game computer has 8 cores + 1 expensive gpu + 32GB ram for me alone. We dont have that per customer.

avidiax•5h ago

It's also about revenue.

Uber could run the complete global rider/driver flow from a single server.

It doesn't, in part because all of those individual trips earn $1 or more each, so it's perfectly acceptable to the business to be more more inefficient and use hundreds of servers for this task.

Similarly, a small website taking 150ms to render the page only matters if the lost productivity costs less than the engineering time to fix it, and even then, only makes sense if that engineering time isn't more productively used to add features or reliability.

oivey•5h ago

This is again a problem understanding that computers are fast. A toaster can run an old 3D game like Quake at hundreds of FPS. A website primarily displaying text should be way faster. The reasons websites often aren’t have nothing to do with the user’s computer.

paulryanrogers•4h ago

That's a dedicated toaster serving only one client. Websites usually aren't backed by bare metal per visitor.

oivey•4h ago

Right. I’m replying to someone talking about their personal computer.

Aeolun•1h ago

If your websites take less than 16ms to serve, you can serve 60 customers per second with that. So you sorta do have it per customer?

vlovich123•32m ago

That’s per core assuming the 16ms is CPU bound activity (so 100 cores would serve 100 customers). If it’s I/O you can overlap a lot of customers since a single core could easily keep track of thousands of in flight requests.

azornathogron•5h ago

Pedantic nit: At 60 fps the per frame time is 16.66... ms, not 30 ms. Having said that a lot of games run at 30 fps, or run different parts of their logic at different frequencies, or do other tricks that mean there isn't exactly one FPS rate that the thing is running at.

Negitivefrags•5h ago

The CPU part happens on one frame, the GPU part happens on the next frame. If you want to talk about the total time for a game to render a frame, it needs to count two frames.

wizzwizz4•4h ago

Computers are fast. Why do you accept a frame of lag? The average game for a PC from the 1980s ran with less lag than that. Super Mario Bros had less than a frame between controller input and character movement on the screen. (Technically, it could be more than a frame, but only if there were enough objects in play that the processor couldn't handle all the physics updates in time and missed the v-blank interval.)

Negitivefrags•3h ago

If Vsync is on which was my assumption from my previous comment, then if your computer is fast enough, you might be able to run CPU and GPU work entirely in a single frame if you use Reflex to delay when simulation starts to lower latency, but regardless, you still have a total time budget of 1/30th of a second to do all your combined CPU and GPU work to get to 60fps.

hinkley•5h ago

Lowered expectations are come in part from people giving up on theirs. Accepting versus pushing back.

antonymoose•5h ago

I have high hopes and expectations, unfortunately my chain of command does not, and is often an immovable force.

hinkley•4h ago

This is a terrible time to tell someone to find a movable object in another part of the org or elsewhere. :/

I always liked Shaw’s “The reasonable man adapts himself to the world: the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man.”

jesse__•5h ago

Broadly agree.

I'm curious, what're the profilers you know of that tried to be better? I have a little homebrew game engine with an integrated profiler that I'm always looking for ideas to make more effective.

hinkley•5h ago

Clinic.js tried and lost steam. I have a recollection of a profiler called JProfiler that represented space and time as a graph, but also a recollection they went under. And there is a company selling a product of that name that has been around since that time, but doesn’t quite look how I recalled and so I don’t know if I was mistaken about their demise or I’ve swapped product names in my brain. It was 20 years ago which is a long time for mush to happen.

The common element between attempts is new visualizations. And like drawing a projection of an object in a mechanical engineering drawing, there is no one projection that contains the entire description of the problem. You need to present several and let brain synthesize the data missing in each individual projection into an accurate model.

nitinreddy88•1d ago

The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

bstack•12h ago

Author here: yeah that's a good point. tbh I was mostly unfamiliar with Vector so I took the shortest path to the goal but that could be interesting followup. It does seem like there's a lot of bytes per namespace!

stackskipton•1h ago

You mentioned in the blog article that it's doing listwatch. List Watch registers with Kubernetes API that get a list of all objects AND get a notification when anything in object you have registered with changes. A bunch of Vector Pods saying "Hey, send me a notification when anything with namespaces changes" and poof goes your Memory keeping track of who needs to know what.

At this point, I wonder if instead of relying on daemonsets, you just gave every namespace a vector instance that was responsible for that namespace and pods within. ElasticSearch or whatever you pipe logging data to might not be happy with all those TCP connections.

Just my SRE brain thoughts.

fells•34m ago

>you just gave every namespace a vector instance that was responsible for that namespace and pods within.

Vector is a daemonset, because it needs to tail the log files on each node. A single vector per namespace might not reside on the nodes that each pod is on.

hinkley•7h ago

Keys require O(logn) space per key or nlogn for the entire data set, simply to avoid key collisions. But human friendly key spaces grow much, much faster and I don’t think many people have looked too hard at that.

There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.

Lookups are rarely O(1), even in hash tables.

I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.

vlovich123•3m ago

Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be globally unique and don’t require so much spacex

Aeolun•1h ago

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

Show HN: Strange Attractors

The Profitable Startup

S.A.R.C.A.S.M: Slightly Annoying Rubik's Cube Automatic Solving Machine

Futurelock: A subtle risk in async Rust

Why Should I Care What Color the Bikeshed Is?

Introducing architecture variants

Viagrid – PCB template for rapid PCB prototyping with factory-made vias [video]

Addiction Markets

My Impressions of the MacBook Pro M4

Active listening: the Swiss Army Knife of communication

Hacking India's largest automaker: Tata Motors

A theoretical way to circumvent Android developer verification

Use DuckDB-WASM to query TB of data in browser

How We Found 7 TiB of Memory Just Sitting Around

Perfetto: Swiss army knife for Linux client tracing

Kerkship St. Jozef, Antwerp – WWII German Concrete Tanker

Fungus: The Befunge CPU(2015)

New analog chip that is 1k times faster than high-end Nvidia GPUs

Signs of introspection in large language models

Nix Derivation Madness

Show HN: Pipelex – Declarative language for repeatable AI workflows

Value-pool based caching for Java applications

The cryptography behind electronic passports

Photographing the rare brown hyena stalking a diamond mining ghost town

Sustainable memristors from shiitake mycelium for high-frequency bioelectronics

AI scrapers request commented scripts

Llamafile Returns

Leaker reveals which Pixels are vulnerable to Cellebrite phone hacking

Pangolin (YC S25) is hiring a full stack software engineer (open-source)

Apple reports fourth quarter results

How We Found 7 TiB of Memory Just Sitting Around

Comments

Show HN: Strange Attractors

The Profitable Startup

S.A.R.C.A.S.M: Slightly Annoying Rubik's Cube Automatic Solving Machine

Futurelock: A subtle risk in async Rust

Why Should I Care What Color the Bikeshed Is?

Introducing architecture variants

Viagrid – PCB template for rapid PCB prototyping with factory-made vias [video]

Addiction Markets

My Impressions of the MacBook Pro M4

Active listening: the Swiss Army Knife of communication

Hacking India's largest automaker: Tata Motors

A theoretical way to circumvent Android developer verification

Use DuckDB-WASM to query TB of data in browser

How We Found 7 TiB of Memory Just Sitting Around

Perfetto: Swiss army knife for Linux client tracing

Kerkship St. Jozef, Antwerp – WWII German Concrete Tanker

Fungus: The Befunge CPU(2015)

New analog chip that is 1k times faster than high-end Nvidia GPUs

Signs of introspection in large language models

Nix Derivation Madness

Show HN: Pipelex – Declarative language for repeatable AI workflows

Value-pool based caching for Java applications

The cryptography behind electronic passports

Photographing the rare brown hyena stalking a diamond mining ghost town

Sustainable memristors from shiitake mycelium for high-frequency bioelectronics

AI scrapers request commented scripts

Llamafile Returns

Leaker reveals which Pixels are vulnerable to Cellebrite phone hacking

Pangolin (YC S25) is hiring a full stack software engineer (open-source)

Apple reports fourth quarter results