A dead server is much better for a distributed system than a misbehaving one. The latter can bring down your entire application.
If your 1000-node busy production environment is run so close to the edge that 2000 heartbeat messages per second, push it into overload, that's impressive resource scheduling.
Really, setting the interval balances speed of detection/cost of slow detection vs cost of reacting to a momentary interruption. If the node actually dies, you'd like to react as soon as possible; but if it's something like a link flap or system pause (GC or otherwise), most applications would prefer to wait and not transition state; some applications like live broadcast are better served by moving very rapidly and 500 ms might be too long.
Re: network partitioning, the author left out the really fun splits. Say you have servers in DC, TX, and CA. If there's a damaged (but not severed) link between TX and CA, there's a good chance that DC can talk to everyone, but TX and CA can't communicate. You can have that inside a datacenter too, maybe each node can only reach 75% of the other nodes, but A can reach B and B can reach C does not indicate A can reach C. Lots of fun times there.
paulsutter•1h ago
500 milliseconds is a very long interval, on a CPU timescale. Funny how we all tend to judge intervals based on human timescales
Of course the best way to choose heartbeat intervals is based on metrics like transaction failure rate or latency