Heartbeats in Distributed Systems

https://arpitbhayani.me/blogs/heartbeats-in-distributed-systems/

26•sebg•2h ago

Comments

paulsutter•1h ago

> When a system uses very short intervals, such as sending heartbeats every 500 milliseconds

500 milliseconds is a very long interval, on a CPU timescale. Funny how we all tend to judge intervals based on human timescales

Of course the best way to choose heartbeat intervals is based on metrics like transaction failure rate or latency

macintux•34m ago

Related advice based on my days working at Basho: find a way to recognize, and terminate, slow-running (or erratically-behaving) servers.

A dead server is much better for a distributed system than a misbehaving one. The latter can bring down your entire application.

toast0•16m ago

> Consider a system with 1000 nodes where each node sends heartbeats to a central monitor every 500 milliseconds. This results in 2000 heartbeat messages per second just for health monitoring. In a busy production environment, this overhead can interfere with actual application traffic.

If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.

Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption. If the node actually dies, you'd like to react as soon as possible; but if it's something like a link flap or system pause (GC or otherwise), most applications would prefer to wait and not transition state; some applications like live broadcast are better served by moving very rapidly and 500 ms might be too long.

Re: network partitioning, the author left out the really fun splits. Say you have servers in DC, TX, and CA. If there's a damaged (but not severed) link between TX and CA, there's a good chance that DC can talk to everyone, but TX and CA can't communicate. You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there.

Britain's railway privatization was an abject failure

Launch HN: Tweeks (YC W25) – Browser extension to de-enshittify the web

GitHub Partial Outage

European Nations Decide Against Acquiring Boeing E-7 Awacs Aircraft

Checkout.com hacked, refuses ransom payment, donates to security labs

Zed Is Our Office

Blender Lab

Android developer verification: Early access starts

Denx (a.k.a. U-Boot) Retires

SIMA 2: An Agent That Plays, Reasons, and Learns with You in Virtual 3D Worlds

Kratos - Cloud native Auth0 open-source alternative (self-hosted)

We cut our Mongo DB costs by 90% by moving to Hetzner

COBOL to Kotlin via Formal Models (IR and Alloy and Golden Master)

Heartbeats in Distributed Systems

Switching from GPG to Age

Tesla Is Recalling Cybertrucks Again. Yep, More Pieces Are Falling Off

Human Fovea Detector

Android 16 QPR1 is being pushed to the Android Open Source Project

A Challenge to Roboticists: My Humanoid Olympics

Telli (Voice AI – YC F24) is hiring engineers in Berlin

Steam Machine

Seed. LINE's Custom Typeface

Shader Glass

Homebrew no longer allows bypassing Gatekeeper for unsigned/unnotarized software

Continuous Autoregressive Language Models

Reverse Engineering Yaesu FT-70D Firmware Encryption

GPT-5.1: A smarter, more conversational ChatGPT

Randomness Testing Guide

Steam Frame

Transpiler, a Meaningless Word (2023)

Heartbeats in Distributed Systems

Comments

Britain's railway privatization was an abject failure

Launch HN: Tweeks (YC W25) – Browser extension to de-enshittify the web

GitHub Partial Outage

European Nations Decide Against Acquiring Boeing E-7 Awacs Aircraft

Checkout.com hacked, refuses ransom payment, donates to security labs

Zed Is Our Office

Blender Lab

Android developer verification: Early access starts

Denx (a.k.a. U-Boot) Retires

SIMA 2: An Agent That Plays, Reasons, and Learns with You in Virtual 3D Worlds

Kratos - Cloud native Auth0 open-source alternative (self-hosted)

We cut our Mongo DB costs by 90% by moving to Hetzner

COBOL to Kotlin via Formal Models (IR and Alloy and Golden Master)

Heartbeats in Distributed Systems

Switching from GPG to Age

Tesla Is Recalling Cybertrucks Again. Yep, More Pieces Are Falling Off

Human Fovea Detector

Android 16 QPR1 is being pushed to the Android Open Source Project

A Challenge to Roboticists: My Humanoid Olympics

Telli (Voice AI – YC F24) is hiring engineers in Berlin

Steam Machine

Seed. LINE's Custom Typeface

Shader Glass

Homebrew no longer allows bypassing Gatekeeper for unsigned/unnotarized software

Continuous Autoregressive Language Models

Reverse Engineering Yaesu FT-70D Firmware Encryption

GPT-5.1: A smarter, more conversational ChatGPT

Randomness Testing Guide

Steam Frame

Transpiler, a Meaningless Word (2023)