This is fucking amazing
[0] https://www.hytradboi.com/2025/c222d11a-6f4d-4211-a243-f5b7f...
basically certain system clocks are rolled back and incremented in a deterministic manner to emulate real issues recorded from production bugs. Had lot of fun working with building this and the effect of messing with different system clocks by intercepting system calls.
Am I correct in thinking that this would require a fork of the main Go implementation, in order to make a deterministic (and controllable) goroutine scheduler and network stack?
Does one also need to run on a particularly predictable OS?
Yes, because the Go runtime actually explicitly and deliberately leans in the opposite direction. Additional non-determinism is deliberately added to the system, most notably that a "select" call with multiple valid channels will be psuedorandomly chosen, and iteration on a map is somewhat scrambled on each iteration. It's not entirely random and as such not suitable for pretty much any randomness requirement you may have, but enough to generally prevent accidental dependencies on deterministic iteration.
This has the effect of "spreading out" concurrency issues and hopefully reducing the sort of issues you get when you only discover a major issue when something else in the environment changes three years later. It's a nice default but nothing like a proof, of course.
You can also go the "predictable OS" route, since the entire Go runtime sits on top of it. Nothing can non-deterministically run on top of a fully deterministic OS image. But a fully deterministic OS would probably want some integration with the Go runtime to be able to more deeply control it through mechanisms other than varying inputs to the random function, which is a very indirect way to go down a code path.
But `synctest.Wait` won't work for non-durably blocked goroutines (such as in the middle of a TCP syscall) so requires memory-based implementations of e.g. net.Conn (I've plugged in https://pkg.go.dev/google.golang.org/grpc/test/bufconn with good success)
In fact I seem to be a bit iconoclastic on this matter but I'm not even a fan for testing purposes. Even non-proof-based testing needs the ability to test that goroutines may execute out of order. Nothing prevents a goroutine scheduled for 2ms from now to run all the way to completion before a goroutine scheduled to run 1ms from now even starts, but AFAIK this approach doesn't let you test that case. Such logic is not valid in a multithreaded system; it is at most the most likely progression, not the only possible progression.
But since we live in the world where the question is more will anyone write a concurrency test at all, moaning about not having a perfect one is missing the point, I suppose. I won't deny a deterministic test is better than a non-deterministic test in general.
Is a concurrency testing framework for Java. It also does deterministic simulation.
(Disclosure: I am an author.)
I also gave a talk overview of it at Strange Loop back in 2015, but don’t have the youtube link handy.
If you’re interested in trying DST, we have a new company that aims to make it a much easier lift, especially for existing projects that can’t be retrofitted onto one of the many simulation frameworks: https://antithesis.com
Happy to answer any questions about the approach, here or over email.
For example, the first implementations of AWS's internal lock service (Alf) used DST as a key part of the testing strategy, sometime around 2009. Al Vermeulen was influential in introducing it at AWS, and I believe it built on some things he'd worked on before.
Still, Anithesis is super cool, and I really admire how you all are changing the conversation around systems correctness. So this is a minor point.
As another interesting historical side note, I have wondered about the similarities between Antithesis and "Corensic", a startup spun out of UW around a similar time period (circa 2009). Apparently, they "built a hypervisor that could on-demand turn a guest operating system into deterministic mode" (see "Deterministic Multiprocessing" at https://homes.cs.washington.edu/~oskin/). My impression is that their product was not a significant commercial success, and the company was acquired by F5 Networks in 2012 (https://comotion.uw.edu/startups/corensic/).
Overall, I don't over-index on novelty, and think it is generally good for ideas to be recycled/revived/re-explored, with updated, modern perspectives. I believe that most rigorous systems designers/engineers likely converge to similar ideas (model checking, value of determinism, etc.) after dealing with these types of complex systems for long enough. But, it is nevertheless interesting to trace the historical developments.
Deterministic hypervisors are by no means new. Somebody once told me that VMWare used to support a deterministic emulation mode (mostly used for internal debugging). Apparently they lost the capability some time ago.
This. If you take nothing else away from the article (which has a lot) take this: fail well, don’t fail poorly.
What is the system state when it does error?
What is the best possible recovery from each error state?
What can the user/caller expect for an error?
E.g. if you're in python and raise a value error when an API is rate limited, someone down stream from you is going to have a bad time.
I have a lot of Go testing shims that look like:
type UserGetter interface {
GetUser(userID string) (User, error)
}
type TestUsers struct {
Users map[string]User
Error error
}
func (t TestUsers) GetUser(userID string) (User, error) {
if t.Error != nil {
return User{}, t.Error
}
user, have := t.Users[userID]
if !have {
return User{}, ErrUserNotFound
}
return user, nil
}
This allows easily testing errors upon retrieving users and ensuring the correct thing happens. I'm not a 100% maniacal "get 100% test coverage in everything" kind of guy, but on the flip side, if your test coverage only lights up the "happy path", your testing is not good enough and as you scale up the probability that your system is going to do something very wrong when an error occurs very rapidly approaches 1.It's more complicated when you have something like a byte stream where you want to simulate a failure at arbitrary points in the stream, but similar techniques can get you as close as you like, depending on how close that is.
From there, in terms of "how do you handle non-fatal errors", there really isn't a snap rule to give. Quite often you just propagate because there isn't anything else to do. Sometimes you retry some bounded number of times, maybe with backoff. Sometimes you log things and move on. Sometimes you have a fallback you may try. It just depends on your needs. I write a lot of network code, and I find that once my systems mature it's actually the case that rather a lot of the errors in the system get some sort of handling beyond "just propagate it up", but it's hard for me to ever guess in advance what they will be. It's a lot easy to mentally design all the happy paths than it is to figure out all the ways the perverse external universe will screw them up and how we can try to mitigate the issues.
Typical example would be processing an event that you can't handle from a message queue. You don't want to crashloop, so you'd probably have to throw it away on a dead letter queue and continue processing. But then, is your system still correct? What happens if you later receive another event relating to the same entity, which depends on the first event? Or sometimes you can't even tell which entity the malformed or bug-triggering event relates to, and then it's a real problem.
> In 2014, Yuan et al. found that 92% of catastrophic failures in tested distributed systems were triggered by incorrect handling of nonfatal errors.
To the point I've seen it cause real bugs in production.
The problem in Node is you can either throw to make a 4xx or you can return a 4xx so downstream there are 2 things to check.
https://www.cs.princeton.edu/~appel/papers/fpcc.pdf
I'll also add that mutation testing has found specification errors, too.
The real thing though, is that if you have a verified formal model and a buggy implementation, then _you know_ your problem is at that model <-> implementation level.
Could be the implementation is wrong. Could be that the "bug" is in fact not a bug according to your model. But your model isn't "wrong". If your model says that A happens in condition B, then that is what should happen!
You can avoid second guessing a lot of design patterns with models, and focus on this transitional layer.
If someone came up to you and said "I built a calculator and with this calculator 1/0 becomes 0" you don't say "oh... maybe my model of division is wrong". You think the calculator is wrong in one way or another.
Maybe the calculator's universe is consistent in some way but in that case its model is likely not the same model of division you're thinking of. This eliminates entire classes of doubt.
The papers will describe the strengths and weaknesses of those tools.
You can have a function that returns the anagram and testing will proves it correct for some pairs of words. But to prove it for all words require formal verification. And that’s when you catch some tricky errors due to undefined behavior or library bugs because you can’t prove their semantics.
good is doing a lot of heavy lifting here. The point of TLA+/Pluscal is to have a proof of the soundness of the design.
It models N threads non-atomically incrementing a shared counter, with the property "the counter eventually equals the number of threads in the model". When checked in TLA+, it finds a race condition where one threads overwrites another value. I've written implementations of the buggy design and on my computer, they race on less than 0.1% of executions, so testing for it directly would be very hard.
Most TLA+ specs are for significantly more complex systems than this, but this is a good demo because the error is relatively simple.
Any good open source libraries that do this that are language agnostic? Seems doable - spin up a container with some tools within it. Said tools require some middleware to know when a test is going to be run, when test is run, tools basically make certain things, networking, storage, etc "determinstic" in the context of the test run.
This is more-or-less what antithesis does, but haven't seen anything open source yet.
You of course, could write your tests well, such that you can stub out I/O, but that's work and not everyone will write their tests well anyway (you should do this anyway, but it's nicer imo if this determinism is on a layer higher than the application).
as a slight sidebar - I'm not really bullish on AI, but I think testing is one of the things where AI will hopefully shine, because the feedback loop during prompting can be driven by your actual application requirements, such that the test implementation (driven by AI), requirements (driven by you as the prompt) and "world" (driven by the actual code being tested) can hopefully help drive all three to some theoretical ideal. if AI gives us anything, I'm hoping it can make software a more rigorous discipline by making formal verification more doable.
(1) Previously, you had to build your entire system around one of the simulation frameworks (and then not take any dependencies).
(2) It’s way too easy to fool yourself with weak search/input generation, which makes all your tests look green when actually you aren’t testing anything nontrivial.
As you say, Antithesis is trying to solve both of these problems, but they are very challenging.
I don’t know of anybody else who has a reliable way of retrofitting determinism onto arbitrary software. Facebook’s Hermit project tried to do this with a deterministic Linux userspace, but is abandoned. (We actually tried the same thing before we wrote our hypervisor, but found it didn’t work well).
A deterministic computer is a generically useful technology primitive beyond just testing. I’m sure somebody else will create one someday, or we will open-source ours.
The problem with coupled tooling is that no one will use it. That's what is cool about antithesis. If they're able to complete their goal, that's basically what will be achieved.
I once built a much, much simpler thing that allowed to run multiple JVM threads in a particular kind of lockstep, by stubbing and controlling I/O operations and the advance of the system time. With that, I could run several asynchronously connected components in particular interaction patterns, including not just different thread activation order but also I/O failures, retries, etc. It was manageable, and it helped uncover a couple of nasty bugs before the code ever ran in prod.
But that was only possible because I went with drastic simplifications, controlling not the whole system but only particular synchronization points. It won't detect a generic data race where explicit synchronization would be just forgotten.
But now it is no longer used to generate production code?
I asked that question here, which I think was the same question as in a talk: https://news.ycombinator.com/item?id=34284557
It seems like if the generated code is used in a kernel, it could also be used in a cloud, which is less resource-constrained
[0]https://www.microsoft.com/en-us/research/wp-content/uploads/...
We have used Coyote/P# not just for model checking an abstract design (which no doubt is very useful) but testing real implementations of production services at Microsoft.
I thought I read that somewhere, but now I can't find the claim
I’m curious which one is actually more impressive in general.
Some previous discussion https://news.ycombinator.com/item?id=36900147
Reminded of the following comment from not too long ago.
There is no “basis” other my gut feeling. Unless you can get quantified metrics to compare that’s all we got. For example if you had lines of code for both, or average IQ. Both would lead towards the “basis” which neither you or I have.
But object stores are embarrassingly parallel, so if such a migration should be possible somewhere without down time, then it is definitely object stores.
That is, at what layer of the stack do you start migrating some stuff to the new strongly consistent system on the live service?
You can't really do it on a per-bucket basis, since existing buckets already have data in the old system.
You can't do it at the key-prefix level for the same reason.
Can't do both systems in parallel and try the new one and fall back to the old one if the key isn't in it, because opens up violations of the consistency rules you're trying to add.
Seems trickier than one might think.
Likely they don't have to physically move data of objects, but the layer that writes and reads coordinates based on some version control guarantees e.g in database land MVCC is a prominent paradigm. They'd need a distributed transactional kv store that tells every reader what the latest version of the object is and where to read from.
An object write only acknowledges finished if the data is written and kv store is updated with new version.
They could do this bucket by bucket in parallel since buckets are isolated from each other.
I can confidently say that as impressive as S3 is from the outside, it's at least that impressive internally, both in implementation, and organizational structure.
Also S3 is not better than gcs or azure blob.
S3 is likely an order of magnitude larger then those others - it's had a lot longer to grow.
TLA takes some time to learn to wield effectively but it pays off on spades.
Ok, I get how property-based testing and fuzzing have a relationship to formal methods (the thing being checked looks like part of a formal specification, and in some sense these are a subset of the checks that a model-checking confirms), but calling runtime monitoring a "semi-formal approach" seems like a real stretch.
I'm not trying to draw the ire of the Microsoft fan boys, and there are certainly smart people working on that, but it's just not going to happen for most people.
Had this been in golang, or maybe java, I'm sure many more hands would be digging in! Having said that, I hope this helps bring correctness and validation more into the mainstream. I've been casually following the project for a while now.
My long-term goal is to integrate model validation into https://github.com/purpleidea/mgmt/ so if this is an area of interest to you, please let me know!
EGreg•1d ago
Today I went to his website and discovered a lot about TLA+ and PlusCal. He still maintains it: https://lamport.azurewebsites.net/tla/peterson.html?back-lin...
I must say ... it would make total sense for a guy like that, who brought mathematics to programming and was an OG of concurrent systems, to create a systems design language that's used at AWS and other industrial places that need to serve people.
I wish more people who build distributed systems would use what he made. Proving correctness is crucial in large systems.
lopatin•1d ago
I've found multiple bugs for personal Rust projects like this (A Snake game that allowed a snake to do a 180 degree turn), and have verified some small core C++ components at work with it as well (a queue that has certain properties around locking and liveness).
I tried other models but kept getting issues with syntax and spec logic with anything else besides Opus.
dosnem•1d ago
lopatin•1d ago
Here's how the bug was caught: It should be impossible for the Snake representation to be < 2 points. So I told Opus to model the behavior of my snake, and also to write a TLA+ invariant that the snake length should never be under 2. TLA+ then basically simulates it and finds the exact sequence of steps "turns" that cause that invariant to not hold. In this case it was quite trivial, I never thought to prevent a Snake from making turns that are not 90 degrees.
Jtsummers•1d ago
skydhash•1d ago
> Also how is TLA+ spec determined from the code?
You start from the initial state, which is always known (or is fixed). Then you model the invariants for each lines.
> Won’t it implicitly model incorrect bugs as correct behaviour since it’s an existing state in the system
Invariants will conflicts with each other in this case.
> Also when using TLA from the start, can it be applied to implementations?
Yes, by fully following the specs and handling possible incorrect states that may happens in practice. If your initial state in the TLA+ specs says that it only includes natural numbers between 1 and 5, you add assertions in your implementation (or throw exceptions) that check that as the Int type in many type systems is not a full guarantee for that constraint. Even more work when using a dynamic language.
k__•1d ago
skydhash•1d ago
It could be good in smaller, but critical and widely used utilities like SSH and terminals.
oblio•1d ago
rthnbgrredf•1d ago
oblio•12h ago
belter•1d ago
You can't do that...
The model checker says the specification satisfies the properties you wrote within the finite state space you explored...
amw-zero•19h ago