It's like the quote attributed to Don Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."
But one tool (like TLA+) can’t realistically support all formalisms for all types of analyses ¯\_(ツ)_/¯
I only believe in formal methods where we always have a machine validated way from model to implementation.
> Distributed systems are notoriously hard to get right (i.e., guaranteeing correctness) as the programmer needs to reason about numerous control paths resulting from the myriad interleaving of events (or messages or failures). Unsurprisingly, programmers can easily introduce subtle errors when designing these systems. Moreover, it is extremely difficult to test distributed systems, as most control paths remain untested, and serious bugs lie dormant for months or even years after deployment.
> The P programming framework takes several steps towards addressing these challenges by providing a unified framework for modeling, specifying, implementing, testing, and verifying complex distributed systems.
It was last posted on HN about 2 years ago [1].
In large, distributed systems the best we're looking for is statistically acceptable. You can always tailor a workload that will break a guarantee in the real world.
So you engineer with techniques that reduce the likelihood that workloads you have characterized as realistic can be handled with headroom, and you worry about graceful degradation under oversubscription (i.e. maintaining "good-put"). In my experience, that usually comes down to good load-balancing, auto-scaling and load-shedding.
Virtually all of the truly bad incidents I've seen in large-scale distributed systems are caused by an inability to recover back to steady-state after some kind of unexpected perturbation.
If I had to characterize problem number one, it's bad subscriber-service request patterns that don't provide back pressure appropriately. e.g. subscribers that don't know how to back-off properly and services that don't provide back-pressure. Classical example is a subscriber that retries requests on a static schedule and gives up on requests that have been in-flight "too long", coupled with services that continue to accept requests when oversubscribed.
I personally could care less about proving that an endpoint always responds in less than 100ms say, but I care very much about understanding where various saturation points are in my systems, or what values I should set for limits like database connections, or how what the effect of sporadic timeouts are, etc. I think that's more the point of this post (which you see him talk about in other posts on his blog).
I think it often boils down to "know when you're going to start queuing, and how you will design the system to bound those queues". If you're not using that principle at design stage then I think you're already cooked.
I think simulation is definitely a promising direction.
In large, distributed systems the best we're looking for is statistically acceptable. You can always tailor a workload that will break a guarantee in the real world.
This is called "soft" realtime.I don't disagree with you that it's a realtime problem, I do however think that "just" is doing a lot of work there.
Agreed though, "just" is hiding quite a deep rabbit hole.
But, in practice, I've spent just as much time on issues introduced by perf / scalability limitations. And the post thesis is correct: we don't have great tools for reasoning about this. This has been pretty much all I've been thinking about recently.
Tools like Lean and Rocq can do arbitrary math — the limit is your time and budget, not the tool.
These performance questions can be mathematically defined, so it is possible.
And the SeL4 kernel has latency guarantees based on similar proofs (at considerable cost)
https://arxiv.org/abs/2205.15211
Already we have Resource Aware ML which
> automatically and statically computes resource-use bounds for OCaml programs
HPsquared•1d ago