Show HN: I audited 500 K8s pods. Java wastes ~48% RAM, Go ~18%

36•wozzio•1mo ago

Comments

wozzio•1mo ago

I've been consulting on EKS/GKE cost optimization for a few mid-sized companies and kept seeing the same pattern: massive over-provisioning of memory just to be safe.

I wrote a simple CLI tool (bash wrapper around kubectl) to automate diffing kubectl top metrics against the declared requests in the deployment YAML.

I ran it across ~500 pods in production. The "waste" (allocated vs. used) average by language was interesting:

Python: ~60% waste (Mostly sized for startup spikes, then idles empty).

Java: ~48% waste (Devs seem terrified to give the JVM less than 4Gi).

Go: ~18% waste.

The tool is called Wozz. It runs locally, installs no agents, and just uses your current kubecontext to find the gap between what you pay for (Requests) and what you use (Usage).

It's open source. Feedback welcome.

(Note: The install is curl | bash for convenience, but the script is readable if you want to audit it first).

rvz•1mo ago

> I've been consulting on EKS/GKE cost optimization for a few mid-sized companies and kept seeing the same pattern: massive over-provisioning of memory just to be safe.

Correct.

Developers keep over-provisioning as they need enough memory for the app to continue running as demand scales up. Since these languages have their own runtimes and GCs to manage memory, it already pre-allocates lots of RAM before running the app; adding to the bloat.

Part of the problem is not only a technical one (the language may be bloated and inefficient) but it is completely psychological as developers are scared of their app getting an out-of-memory exception in production.

As you can see, the languages with the most waste are the ones that are inefficient both runtime (speed) and space (memory) complexity and take up the most memory and are slower (Python, and Java) and costs a lot of money to continue maintaining them.

I got downvoted over questioning the microservice cargo cult [0] with Java being the darling of that cult. If you imagine a K8s cluster with any of these runtimes, you can see which one will cost the most as you scale up with demand + provisioning.

Languages like Go, and Rust are the clear winners if you want to save lots of money and are looking for efficiency.

[0] https://news.ycombinator.com/item?id=44950060

pestatije•1mo ago

is this an instantaneous measure or goes over the whole duration of the process?

dboreham•1mo ago

Typically they're overprovisioning through prior experience. In the past something fell over because it didn't have enough memory. So they gave it more. That practice stuck in the brain-model. Perhaps it's no longer valid, but who wants to bring down the service doing Chernobyl experiments?

PaulKeeble•1mo ago

You run these tools and you find all the maximums for weeks of traffic and so you set them down to minimise cost and all is well until the event. The event doesn't really matter, it causes an increase in traffic processing time and suddenly every service needs more memory to hold all transactions, now instead they fail with out of memory and disappear and suddenly all your pods are in restart loops unable to cope and you have an outage.

The company wasting 20% extra memory on the other hand is still selling and copes with the slower transaction speed just fine.

Not sure over provisioning memory is really just waste when we have dynamic memory based languages, which is all modern languages not in real time safety critical environments.

esseph•1mo ago

This acts like one app is the only app running on device, which in the case of k8s, clearly isn't the case.

If you want to get scheduled on a node for execution after a node failure, your resource requests need to fit / pack somewhere.

The more accurately modeled application limits are, the better cluster packing gets.

This impacts cost.

gopher_space•1mo ago

> Perhaps it's no longer valid, but who wants to bring down the service

I'm thinking more like getting a junior to do efficiency passes on a year's worth of data.

wozzio•1mo ago

Exactly we call it sleep insurance. It is rational for the on call engineer to pad the numbers but it's just irrational for the finance team to pay for it.

cogman10•1mo ago

Actually, this sounds like your java devs have misconfigured containers.

Java will happily eat all the memory you throw at it. The fact that it isn't means they are probably relying on the default JVM settings. Those are too conservative inside a container, especially if they are running on older JVM versions.

What you'll find is the JVM will constantly have "waste" due to it's default configuration. The question is if it ends up throwing out of memory exceptions due to the lower memory limit.

If your devs haven't looked at the heap allocation of a running application, there's no way to know if this is too much or too little memory.

Go/python much more readily give memory back to the OS.

wozzio•1mo ago

The JVM expands to fill the container, but the scheduler still counts that 8GB request as used when packing the node. Even if the app only needs 2GB of working set, we are blocked from scheduling other pods on that wasted 6GB buffer.

cogman10•1mo ago

No, it doesn't. Believe me, I'm an expert this.

The JVM determines the maximum allocation it will take at startup. It will not grow beyond what that determination is. Effectively it sets the `XMX` setting once whether explicitly or implicitly from various memory settings.

The JVM, without explicit configuration, is also reluctant to give back memory to the OS. Most of the collectors will keep whatever memory they have requested.

And now that you've posted 8GB and 2GB, that pretty much confirms to me that you are both running older JVMs and using the default settings. For older JVMs the default was to take 25% of the available memory without further configuration.

Here's an article describing exactly this problem [1].

Your devs likely ran into OOME problems in production and merely increased their pod request. Cutting it down to "save" memory is a bad move, you need to do that in tandom with having the devs correctly set the JVM settings in their container. Otherwise, you reducing it to 2gb will cause the app to run with 512MB max heap size (almost certainly causing problems).

You may have seen a pod exceed that 2GB. That is simply because there are a lot of situations where the JVM can do "off heap" allocations, application dependent. The 2GB max is for the heap and not all the off heap allocations.

For Java, off heap allocations are somewhat rare. But you should know about them just so you don't set the JVM heap to 100% the pod memory. You need to leave a buffer big enough to accommodate those off heap scenarios (including garbage collections). For smaller heaps (<1gb) 60% is probably a good target. For larger heaps (10+gb) 80 or even 90% is a pretty good target.

[1] https://www.baeldung.com/java-docker-jvm-heap-size

karolinepauls•1mo ago

> Python: ~60% waste (Mostly sized for startup spikes, then idles empty).

I understand we're talking about CPU in case of Python and memory for Java and Go. While anxious overprovisioning of memory is understandable, doing the same for CPU probably means lack of understanding of the difference between CPU limits and CPU requests.

Since I've been out of DevOps for a few years, is there ever a reason not to give each container the ability to spike up to 100% of 1 core? Scheduling of mass container startup should be a solved problem by now.

cogman10•1mo ago

I don't think there is. You should set both and limit doesn't need to match request for CPU.

Your limit should roughly be "what should this application use if it goes full bore" and your request should be "what does this use at steady state".

At least at my company, the cluster barely uses any CPU even during the busiest hours. We are fairly over provisioned there because a lot of devs are keeping limit and request the same.

Narishma•1mo ago

What makes you think they're talking about CPU? It reads to me like it's memory.

karolinepauls•1mo ago

Two things - the word "idles" and the nature of CPython's allocator which generally doesn't return memory to the OS but reuses it internally. So you cannot really "spike" memory usage, only grow it.

wozzio•1mo ago

CPU bursting is safe you just get throttled. Memory bursting is dangerous you get OOMKilled.

That's why Python numbers look so bad here devs set the request high enough to cover that initial model loading spike so they don't crash during a rollout, even if they idle at 10% usage afterwards.

Marazan•1mo ago

Interesting, at the company I work for Java pods are (historically) over provisioned with CPU but quite tightly provisioned with memory

wozzio•1mo ago

That is the opposite of what I usually see. Are you trading CPU for RAM by running a more aggressive GC like ZGC or Shenandoah? Usually, people starve the CPU to buy more RAM.

stefan_•1mo ago

4GiB oh no we gave it Raspberry Pi memory?

This is truly one of the dumbest outcomes of the whole "resume driven cloud deployment" era. Clouds "market segmenting" with laughable memory, resume engineers want their line but not the recurring cost, engineers waste weeks investigating and working around out of memory issues that are purely some guy provisioning services with 2003 levels of memory. It's all a giant waste of time.

perrygeo•1mo ago

I'm not sure I would frame unused memory as "waste". At least not necessarily. If the system is operating below max capacity, that memory isn't wasted - it's a reserve capacity to handle surges. The behavior of the system under load (when it matters) might very well depend on that "wasted" memory.

You want to compare the _maximum_ memory under load - the high-water mark. If you combine Wozz with a benchmark to drive heavy traffic, you could more confidently say any unused memory was truly wasted.

wozzio•1mo ago

Ideally we would measure Request minus Max_Usage_Over_7_Days.

Since this is a snapshot tool (wrapping kubectl top) it can't see historical peaks so it def leans towards being a current state audit. Love the idea of pairing this with a load test benchmark that would actually give you a calculated Safe Request Size rather than just a guess.

karianna•1mo ago

Usual caveats of “you should run load to see if it’s truly wasted”, but we do know that Java defaults are not ideal out of the box and so we analysed a ton of workloads on Azure and came up with this: https://learn.microsoft.com/en-us/java/jaz - better defaults out of the box for the JVM in containers at runtime.

ice3•1mo ago

No source, useless outside of azure.

jeroenhd•1mo ago

You can just install it outside of Azure: https://learn.microsoft.com/en-us/java/jaz/overview#install-...

Note: the tool calls out to Microsoft but the docs don't say what data is shared ("telemetry"), better make sure your firewalls are functioning before running this tool.

Not sure what the source license story is.

wozzio•1mo ago

That doc is gold thanks for linking.

Yeah defaults are the enemy here. Most of the waste I'm seeing in the data comes from generic Spring Boot apps running with out of the box settings where the JVM assumes it owns the entire node.

dboreham•1mo ago

A tool that claims to solve a clearly unsolvable problem!

wozzio•1mo ago

It feels unsolvable because we've been trying to solve it with static spreadsheets and guesses. It's actually very solvable if you treat it as a control loop problem continuous adjustment rather than a set it and forget it config.

arjie•1mo ago

I often end up over-provisioning because it's got a cliff effect. If I under-provision, I end up losing the entire program and its work. If I over-provision, I see a smoother increase in running cost. An amusing note:

    curl -o wozz-audit.sh https://wozz.io/audit.sh
    cat wozz-audit.sh
    bash wozz-audit.sh

That definitely doesn't work on `curl 8.7.1 (x86_64-apple-darwin25.0) libcurl/8.7.1 (SecureTransport)` because you return a 307 with 'Redirecting...' in the body so `wozz-audit.sh` is just the string 'Redirecting...'. I have to pass curl the `-L` flag to follow the redirect. Then everything works.

xav0989•1mo ago

Curl always requires -L to follow redirects.

wozzio•1mo ago

I’ll update the README instruction to include curl -L immediately thanks for flagging

seabrookmx•1mo ago

Neat. Might have to try this on .NET, especially since v10 just got a new (default) GC that claims to use a lot less memory at idle.

BatteryMountain•1mo ago

Its not just idle, it is a bit more aggressive on cleaning up sooner after objects are de-referenced, meaning the OS gets the memory back sooner. Very useful in situation where you have many small containers/vm's running dotnet stuff, or on large applications where memory pressure is an issue and your usage pattern can benefit from memory releasing earlier.

In the old days you could tune IIS + Garbage collector manually to get similar behaviour but was usually not worth it. Time was better spent elsewhere to optimize other things in the pipe and live with the GC freezes. I suspect GC hickups should now be much smaller/shorter too, as the overall load will be lower with new GC.

wozzio•1mo ago

That aggressive release behavior is exactly what we need more of most runtimes (looking at you legacy Java) just hoard the heap forever because they assume they are the only tenant on the server.

BatteryMountain•1mo ago

In C#'s dependency injector you basically have to choose from 3 lifetimes for a class/service: scoped, transient or singletons.

Scoped & transient lifetimes along with the new GC will make the runtime much leaner.

Some application are singleton heavy or misuse the MemoryCache (or wrap it as a singleton... facepalm) - these will still mess the GC situation up.

If you build web/api project it pays dividends to educate yourself on the 3 life times, how the GC works, how async/await/cancellation tokens/disposables work, how MemoryCache work (and when to go out-of-process / other machine aka Redis), how the built-in mechanisms in aspnet works to cache html outputs etc. A lot of developers just wing it and then wonder why they have memory issues.

And for for the dinosaurs: yes we can use the dependency injector in Windows Form, WPF, Console Apps and so on - those packages aren't limited to web projects alone.

wozzio•1mo ago

That new .NET behavior is the goal smart runtimes that yield memory back to the OS so we don't have to play guess the request in YAML.

Unfortunately most legacy Java/Python workloads I see in the wild are doing the exact opposite: hoarding RAM just in case. Until the runtimes get smarter, we're stuck fixing the configs.

dev_l1x_be•1mo ago

How much is the k8s waste?

wozzio•1mo ago

In the clusters I've audited so far typically 30% to 50% of the total requested capacity is allocatable waste resources that are reserved/billed but never actually touched.

dantillberg•1mo ago

This has to be malware in poor disguise.

Curl-bash without hash-checking from a four-month-old domain with full k8s cluster access? All the identities connected are brand new, including the brand-new HN account that posted this. There are four commits in the repo, and three are back-dated to exactly 1 year ago.

unsnap_biceps•1mo ago

And it's possible to serve different content if the curl is being piped to a shell vs if it's being piped to a file or stdout

https://github.com/Stijn-K/curlbash_detect

wozzio•1mo ago

The curl | bash is just for convenience; the README explicitly advises to Download and inspect wozz.sh first if you aren't comfortable piping to shell.

As for the newness I just open-sourced this from my personal scripts collection this week, so yes, the Org and Account are new. It runs entirely locally using your active kubeconfig it doesn't collect credentials or send secrets anywhere. You can cat the script to verify that it's just a wrapper around kubectl top and kubectl get.

nyrikki•1mo ago

IMHO there are some serious problems here that won't relate to many situations, and is not really "waste" in the way claimed and will actually probably result in greater spends.

> Memory waste: request - actual usage [0]

Memory "requests" are hints to the kube-scheduler for placement, not a target for expected usage.

> # Memory over-provisioning: limit > 2x request [1]

Memory limits are for enforcement, typically when to call the OOM killer

Niether placement nor oomkilling limits should have anything to do with normal operating parameters.

> The memory request is mainly used during (Kubernetes) Pod scheduling. On a node that uses cgroups v2, the container runtime might use the memory request as a hint to set memory.min and memory.lo [2]

By choosing to label the delta between these two as "waste" you will absolutely suffer from Goodhart's law and you will teach your dev team to not just request, but allocate memory and don't free it so that they can fit inside this invalid metric's assumptions.

It is going to work against the more reasonable goals of having developers set their limits as low as possible without negative effects, while also protecting the node and pod from memory leaks, while still gaining the advantages of over-provisioning, which is where the big gains are to be made.

[0] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [1] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [2] https://kubernetes.io/docs/concepts/configuration/manage-res...

wozzio•1mo ago

you are technically right that requests are scheduling hints but in a cluster autoscaler world, requests=bill.

If I request 8GB for a pod that uses 1GB, the autoscaler spins up nodes to accommodate that 8GB reservation. That 7GB gap is capacity the company is paying for but cannot use for other workloads.

Valid point on Goodhart's Law, tho the goal shouldn't be fill the RAM, but rather lower the request to match the working set so we can bin-pack tighter.

nyrikki•1mo ago

This script does nothing to solve that and actually exasperates problem of people over specing.

It makes assumptions about pricing[0], when if you do need a peak of 8GB it would force you into launching and consuming that 8GB immediately, because it is just reading a current snapshot from /proc/$pid/status:VmSize [1] and says you are waisting memory if "request - actual usage (MiB)" [2]

What if once a week you need to reconcile and need that 8GB, what if you only need 8GB once every 10 seconds? There script won't see that; so to be defensive you can't release that memory, or you will be 'wasting' resource despite that peek need.

What if your program only uses 1GB, but you are working on a lot of parquet files, and with less ram you start to hit EBS IOPS limits or don't finish the nightly DW run because you have to hit disk vs working from the buffer with headroom etc..

This is how bad metrics wreck corporate cultures, the ones in this case encourage overspending. If I use all that ram I will never hit the "top_offender" list[3] even if I cause 100 extra nodes to be launched.

Without context, and far more complicated analytics "request - actual usage (MiB)" is meaningless, and trivial to game.

What incentive except making sure that your pods request ~= RES 24x7x356 ~= OOM_KILL limits/2, to avoid being in the "top_offender" does this metric accomplish?

Once your skip's-skip's-skip sees some consultant labeled you as a "top_offender" despite your transient memory needs etc... how do you work that through? How do you "prove" that against a team gaming the metric?

Also as a developer you don't have control over the clusters placement decisions, nor typically directly choosing the machine types. So blaming the platform user on the platform teams' inappropriate choice of instance types, while shutting down many chances of collaboration by blaming the platform user typically also isn't a very productive path.

Minimizing cloud spend is a very challenging problem, which typically depends on collaboration more than anything else.

The point is that these scripts are not providing a valid metric, and absolutely presenting that metric in a hostile way. It could be changed to help a discovery process, but absolutely will not in the current form.

[0] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [1] https://github.com/google/cadvisor/blob/master/cmd/internal/... [2] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit.... [3] https://github.com/WozzHQ/wozz/blob/main/scripts/wozz-audit....

wozzio•1mo ago

Really fair critique regarding the snapshot approach. You're right optimizing limits based on a single point in time is dangerous for bursty workloads the need 8GB for 10 seconds scenario.

The intent of this script isn't to replace long-term metric analysis like Prometheus/Datadog trends, but to act as a smoke test for gross over-provisioning, the dev who requested 16GB for a sidecar that has flatlined at 100MB for weeks.

You make a great point about the hostile framing of the word waste. I definitely don't want to encourage OOM risks. I'll update the readme to clarify that this delta represents potential capacity to investigate rather than guaranteed waste.

Appreciate the detailed breakdown on the safety buffer nuances.

linuxftw•1mo ago

Let's do the math:

A 64GB instance in Azure is $156.28/month reserved. That's $2.44/GB/month.

Let's say you use an extra 4GB RAM for a safety margin or laziness, and you have 50 replicas. That's $488.375 per month 'wasted'. Or $5860/year.

You'll never recoup the money it takes to get those replicas perfectly sized. Just give every app 4GB more RAM than you think it needs, and move on with your life.

wozzio•1mo ago

If you're small startup burning $5k/year in lazy tax is smarter than distracting your engineers the math flips when you hit scale.

For some that safety margin isn't small. At a certain volume, waste exceeds the cost of a full-time engineer so optimizing becomes profit.

esseph•1mo ago

You have to pay for that ram, which, by the way, is under a giant cost premium right now.

Multiple the number of apps by the number of desired replicas, across the number of clusters, etc. you could easily be paying 2x-3x in pricing tiers.

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: PalettePoint – AI color palette generator from text or images

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Show HN: Slack CLI for Agents

Show HN: Stacky – certain block game clone

Show HN: A toy compiler I built in high school (runs in browser)

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: ARM64 Android Dev Kit

Show HN: Env-shelf – Open-source desktop app to manage .env files

Show HN: Nginx-defender – realtime abuse blocking for Nginx

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: MCP App to play backgammon with your LLM

Show HN: Horizons – OSS agent execution engine

Show HN: Daily-updated database of malicious browser extensions

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Slop News – HN front page now, but it's all slop

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: Smooth CLI – Token-efficient browser for AI agents

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Show HN: PalettePoint – AI color palette generator from text or images

Show HN: Artifact Keeper – Open-Source Artifactory/Nexus Alternative in Rust

Show HN: BioTradingArena – Benchmark for LLMs to predict biotech stock movements

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Show HN: Slack CLI for Agents

Show HN: Stacky – certain block game clone

Show HN: A toy compiler I built in high school (runs in browser)

Show HN: Gigacode – Use OpenCode's UI with Claude Code/Codex/Amp

Show HN: ARM64 Android Dev Kit

Show HN: Env-shelf – Open-source desktop app to manage .env files

Show HN: Nginx-defender – realtime abuse blocking for Nginx

Show HN: Micropolis/SimCity Clone in Emacs Lisp

Show HN: MCP App to play backgammon with your LLM

Show HN: Horizons – OSS agent execution engine

Show HN: Daily-updated database of malicious browser extensions

Show HN: I'm 75, building an OSS Virtual Protest Protocol for digital activism

Show HN: I built Divvy to split restaurant bills from a photo

Show HN: Falcon's Eye (isometric NetHack) running in the browser via WebAssembly

Show HN: Local task classifier and dispatcher on RTX 3080

Show HN: Slop News – HN front page now, but it's all slop

Show HN: I audited 500 K8s pods. Java wastes ~48% RAM, Go ~18%

Comments