Automatic K8s pod placement to match external service zones

https://github.com/toredash/automatic-zone-placement

67•toredash•6d ago

Comments

toredash•6d ago

Hi HN,

I wanted to share something I've worked a bit to solve regarding Kubernetes: its scheduler has no awareness of the network topology for external services that workloads communicate with. If a pod talks to a database (e.g AWS RDS), K8s does not know it should schedule it in the same AZ as the database. If placed in the wrong AZ, it leads to unnecessary cross-AZ network traffic, adding latency (and costs $).

I've made a tool I've called "Automatic Zone Placement", which automatically aligns Pod placements with their external dependencies.

Testing shows that placing the pod in the same AZ resulted in a ~175-375% performance increase. Measured with small, frequent SQL requests. It's not really that strange, same AZ latency is much lower than cross-AZ. Lower latency = increased performance.

The tool has two components:

1) A lightweight lookup service: A dependency-free Python service that takes a domain name (e.g., your RDS endpoint) and resolves its IP to a specific AZ.

2 ) A Kyverno mutating webhook: This policy intercepts pod creation requests. If a pod has a specific annotation, the webhook calls the lookup service and injects the required nodeAffinity to schedule the pod onto a node in the correct AZ.

The goal is to make this an automatic process, the alternative is to manually add a nodeAffinity spec to your workloads. But resources moves between AZ, e.g. during maintenance events for RDS instances. I built this with AWS services in mind, the concept is generic enough to be used for on-premise clusters to make scheduling decisions based on rack, row, or data center properties.

I'd love some feedback on this, happy to answer questions :)

darkwater•4h ago

Interesting project! Kudos for the release. One question: how are the failure scenario managed, i.e. AZP fails for whatever reason and it's in a crash loop? Just "no hints" to the scheduler, and that's it?

toredash•4h ago

If the AZP deployment fails, yes your correct there is no hints anywhere. If the lookup to AZP fails for whatever reason, it would be noted in the Kyverno logs. And based on if you -require- this policy to take affect or not, you have to decide if it you want pods to fail or not in the scheduling step. In most cases, you don't want to stop scheduling :)

mathverse•4h ago

Typically you have multi-az setup for app deployment for HA. How would you without traffic management controll solve this?

toredash•4h ago

I'm not sure I follow. Are you talking about the AZP service, or ... ?

dserodio•3h ago

It's a best practice to have a Deployment run multiple Pods in separate AZs to increase availability

toredash•3h ago

Yes I get that. But are we talking HA for this lookup service that I've made?

If yes, that's a simple update of the manifest to have 3 replicas with ab affinity setting to spread that out over different AZ. Kyverno would use the internal Service object this service provide to have a HA endpoint to send queries to.

If we are not talking about this AZP service, I don't understand what we are talkin about.

stackskipton•2h ago

How do you handle RDS failovers? Mutating Webhook is only fired when Pods are created so if AZ zone does not fail, there is no pods to be created and affinity rules to be changed.

toredash•2h ago

As it stands now, it doesn't. Unless you modify the Kyverno Policy to be of a background scanning.

I would create a similar policy where Kyverno at intervals would check the Deployment spec to see if the endpoint is changed, and alter the affinity rules. It would then be a traditional update of the Deployment spec to reflect the desire to run in another AZ, if that made sense?

ruuda•3h ago

> Have you considered alternative solutions?

How about, don't use Kubernetes? The lack of control over where the workload runs is a problem caused by Kubernetes. If you deploy an application as e.g. systemd services, you can pick the optimal host for the workload, and it will not suddenly jump around.

arccy•3h ago

k8s doesn't lack control, you can select individual nodes, AZs, regions, etc with the standard affinity settings.

mystifyingpoi•3h ago

> The lack of control

This project literally sets the affinity. That's precisely the control you seem to negate.

toredash•2h ago

Agree, Kubernetes isn't for everyone. This solution came from an specific issue with a client which had ad hoc performance problems when a Pod was placed in the "in-correct" AZ. So this solution was created to place the Pods in the most optimal zone when they were created.

glennpratt•1h ago

Comparing systemd and Kubernetes for this scenario is like comparing an apple tree to a citrus grove.

You can specify just about anything, including exact nodes, for Kubernetes workloads.

This is just injecting some of that automatically.

I'm not knocking systemd, it's just not relevant.

Spivak•1h ago

You need it to jump around because your RDS database might fail over to a different AZ.

Being able to move workloads around is kinda the point. The need exists irrespective of what you use to deploy your app.

toredash•45m ago

The nice thing about this solution, its not limited to RDS. I used RDS as an example as many are familiar with it and are known to the fact that it will change AZ during maintenance events.

Any hostname for a service in AWS that can relocate to another AZ (for whatever reason), can use this.

aduwah•1h ago

Mind you, that you are facing the same problem with any Autoscaling group that lives in multiple AZs. You don't need kubernetes for this

indemnity•1h ago

> The lack of control over where the workload runs is a problem caused by Kubernetes.

Fine grained control over workload scheduling is one of the K8s core features?

Affinity, anti-affinity, priority classes, node selectors, scheduling gates - all of which affect scheduling for different use cases, and all under the operator's control.

kentm•21m ago

Sure, but there are scenarios and architectures where you do want the workload to jump around, but just to a subset of hosts matching certain criteria. Kubernetes does solve that problem.

mystifyingpoi•3h ago

Cool idea, I like that. Though I'm curious about the lookup service. You say:

> To gather zone information, use this command ...

Why couldn't most of this information be gathered by lookup service itself? A point could be made about excessive IAM, but a simple case of RDS reader residing in a given AZ could be easily handled by simply listing the subnets and finding where a given IP belongs.

toredash•2h ago

Totally agree!

This service is published more as a concept to be built on top of, than a complete solution.

You wouldn't even need IAM rights to read RDS information, you need subnet information. As subnets are zonal, it does not if the service is RDS or Redis/ElastiCache. The IP returned from the hostname lookup, at the time your pod is scheduled, determines which AZ that Pod should (optimally) be deployed to.

Where this solution was created, was in a multi AWS account environment. Doing describe subnets API calls across multiple accounts is a hassle. It was "good enough" to have a static mapping of subnets, as they didn't change frequently.

stackskipton•2h ago

This is one of those ideas that sounds great and appears simple at first but can grow into mad monster. Here my potential thoughts after 5 minutes.

Kyverno requirement makes it limited. There is no "automatic-zone-placement-disabled" function in case someone wants to temporarily disable zone placement but not remove the label. How do we handle RDS Zone changing after workload scheduling? No automatic look up of IPs and Zones. What if we only have one node in specific zone? Are we willing to handle EC2 failure or should we trigger scale out?

toredash•47m ago

> Kyverno requirement makes it limited.

You don't have to use Kyverno. You could use a standard mutating webhook, but you would have to generate your own certificate and mutate on every Pod.CREATE operations. Not really a problem but, it depends.

> There is no "automatic-zone-placement-disabled"

True. Thats why I choose to use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution. In my case, where this solutions originated from, Kubernetes was already a multi AZ solution where there was always at least one node in each AZ. It was nice if the Pod could be scheduled into the same AZ, but it was not a hard requirement,

> No automatic look up of IPs and Zones. Yup, it would generate a lot of extra "stuff" to mess with: IAM Roles, how to lookup IP/subnet information from multi account AWS setup with VPC Peerings. In our case it was "good enough" with a static approach. Subnet/network topology didnt change frequently enough to add another layer of complexity.

> What if we only have one node in specific zone?

Thats why we defaulted to preferredDuringSchedulingIgnoredDuringExecution and not required.

pikdum•39m ago

Wasn't aware that there was noticeably higher latency between availability zones in the same AWS region. Kinda thought the whole point was to run replicas of your application in multiple to achieve higher availability.

toredash•20m ago

I was surprised to. Of course it makes sense when you look at it hard enough, two seperate DCs won't have the same latency than internal DC communication. It might have the same physical wire-speed, but physical distance matter.

stackskipton•2m ago

It's generally sub 2MS. Most people take slight latency increase for higher availability, but I guess in this case, that was not acceptable.

Why your boss isn't worried about AI – "can't you just turn it off?"

How bad can a $2.97 ADC be?

How to Turn Liquid Glass into a Solid Interface – TidBITS

How AI hears accents: An audible visualization of accent clusters

America Is Sliding Toward Illiteracy

What do Americans die from vs. what the news report on

Prefix sum: 20 GB/s (2.6x baseline)

Astronomers 'image' a mysterious dark object in the distant Universe

Show HN: Wispbit – Keep codebase standards alive

The Day My Smart Vacuum Turned Against Me

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

Ultrasound is ushering a new era of surgery-free cancer treatment

ADS-B Exposed

Automatic K8s pod placement to match external service zones

Zoo of array languages

Beyond the SQLite Single-Writer Limitation with Concurrent Writes

New lab-grown human embryo model produces blood cells

Don’t Look Up: Sensitive internal links in the clear on GEO satellites [pdf]

Show HN: Metorial (YC F25) – Vercel for MCP

Pyrefly: Python type checker and language server in Rust

Kyber (YC W23) Is Hiring an Enterprise AE

Intel Announces Inference-Optimized Xe3P Graphics Card with 160GB VRAM

Palisades Fire suspect's ChatGPT history to be used as evidence

Why is everything so scalable?

The phaseout of the mmap() file operation

Hold Off on Litestream 0.5.0

Subverting Telegram's end-to-end encryption (2023)

Wireshark 4.6.0 Supports macOS Pktap Metadata (PID, Process Name, etc.)

America is getting an AI gold rush instead of a factory boom

CRISPR-like tools that finally can edit mitochondria DNA

Why your boss isn't worried about AI – "can't you just turn it off?"

How bad can a $2.97 ADC be?

How to Turn Liquid Glass into a Solid Interface – TidBITS

How AI hears accents: An audible visualization of accent clusters

America Is Sliding Toward Illiteracy

What do Americans die from vs. what the news report on

Prefix sum: 20 GB/s (2.6x baseline)

Astronomers 'image' a mysterious dark object in the distant Universe

Show HN: Wispbit – Keep codebase standards alive

The Day My Smart Vacuum Turned Against Me

A 12,000-year-old obelisk with a human face was found in Karahan Tepe

Ultrasound is ushering a new era of surgery-free cancer treatment

ADS-B Exposed

Automatic K8s pod placement to match external service zones

Zoo of array languages

Beyond the SQLite Single-Writer Limitation with Concurrent Writes

New lab-grown human embryo model produces blood cells

Don’t Look Up: Sensitive internal links in the clear on GEO satellites [pdf]

Show HN: Metorial (YC F25) – Vercel for MCP

Pyrefly: Python type checker and language server in Rust

Kyber (YC W23) Is Hiring an Enterprise AE

Intel Announces Inference-Optimized Xe3P Graphics Card with 160GB VRAM

Palisades Fire suspect's ChatGPT history to be used as evidence

Why is everything so scalable?

The phaseout of the mmap() file operation

Hold Off on Litestream 0.5.0

Subverting Telegram's end-to-end encryption (2023)

Wireshark 4.6.0 Supports macOS Pktap Metadata (PID, Process Name, etc.)

America is getting an AI gold rush instead of a factory boom

CRISPR-like tools that finally can edit mitochondria DNA

Automatic K8s pod placement to match external service zones

Comments