The complexity here is the complete opposite of the simple toy examples. What are the edge cases of an optimizing compiler? How do you even approach them, if they're buried deep in a chain of transformations?
The properties are simple things like "the compiler shouldn't crash, the compiled code shouldn't crash, and code compiled with different optimization levels should do the same thing." This assumes the randomly generated code doesn't touch undefined behavior in the language spec.
Here's a recent example of a bug found by this approach. The Common Lisp code stimulating the bug has been automatically minimized: https://bugs.launchpad.net/sbcl/+bug/2109837 with the bug fix https://sourceforge.net/p/sbcl/sbcl/ci/1abebf7addda1a43d6d24...
Seems like a unit test tests specific input and property testing tests more than one input.
Also, if you never look at the test data, it might give you false confidence about how thoroughly your code is being tested.
Set the expectations and model the expected behavior, verify that the system matches that expectation. The approach works at all testing scales.
With property based testing, it actually CAN be 9 passes and 1 failure, because that one single fail can be hitting an edge case the others just aren't. In fact, only a few failures are more likely than it being all failures
I can tell you from experience that random failures cause loss of trust. People learn to ignore failures and just keep hitting rebuild until the tests pass. People will not investigate test failures in code that they don't understand.
But yeah, probabilistic testing isn't perfect.
Allowing reproducibility for a given change set.
Didn't even enter my mind.
Hypothesis has a nice option, to pick the seed for each property by hashing that property's code. It's a nice idea, but relies on Python's highly dynamic nature; so may not be easy/possible in other languages (especially compiled ones).
The co-worker should add the test to a new branch / test it on main. If it fails that's a new ticket (with the great side effect of having a failing test). If that passes it's a problem in their branch. If not it's the same as having a broken main which happens anyway and you deal with that as you usually do.
You can have a prng seeded from something like the commit hash and have prngs generate the test cases. That still can fail on 1/10 tests for a particular run.
The most memorable discussion I had around PBT was with a colleague (a skip report) who saw "true" randomness as a net benefit and that reproducibility was not a critical characteristic of the test suite (I guess the reasoning was then it could catch things at a later date?). To be honest, it scared the hell out of me and I pushed back pretty hard on them and the broader team.
I have no issue with a psuedo-random set of test cases that are declaratively generated. That makes sense if that is what is meant by PBT. Since it is just a more efficient way of testing (and you would assume this would allow you to cast a wider net).
The idea is you have random testing and the test failures are added as explicit tests that then always get run.
Is that so different from someone else testing?
The main issue is you stumble across a new issue in an unrelated branch, but it's not wildly different from doing that while using your application.
I've been there. It takes many hours to try to guess where the system went wrong to produce the undesirable result, and then you still might not be sure if you are looking at the right place, and then there are always environment issues, you aren't sure of. So, you don't know who to blame.
Only very simple systems will reproduce pathological results 100% of the time given some initial conditions. The bane of complex systems is the timeouts. They are usually very hard to justify and are easy to blame for undesirable behavior.
To test an entire system with reproducable failures, you probably need something more heavyweight like Antithesis. Property tests are more useful for unit tests.
It seems to be almost totally forgotten, since the only link I could find is an excerpt from a PalmOS programming book:
https://www.oreilly.com/library/view/palm-programming-the/15...
In bigger programs this is an outright necessity because pure random fuzzing would basically be a lottery.
I've always felt that unit testing frameworks and libraries and even parameterized testing were missing this kind of functionality.
Intellij is able to run my tests and figure out the code coverage, but why isn't it closing the loop and auto-fuzzing/auto-discovering how to mutate tests to cover more?
And don't point me at AI, none of this requires AI and nothing should have to "think" to do this.
It's crazy to me that the vast majority of code running all the time is not exhaustively tested through almost all of it's possible state space with most of it's possible input space. It's not like we are lacking the CPU bandwidth to do it.
Why can't I write a new function and have something tell me within ten minutes "this input param causes an exception" without any effort from me? Instead all those extra cores in my CPU just run javascript trash and crowdstrike scanners
https://developer.android.com/studio/test/other-testing-tool...
In practice, property based testing fails because the organization is not actually interested in delivering correct code. "This bug will never happen in practice so we won't fix it." "If we fix this, we may change some incorrect behavior some customer is depending on." And once that happens, PBT is useless, because it will keep finding that "don't fix" bug over and over.
In that case, we update the properties to reflect the new spec.
It found so many bugs: file corruption, crashes, memory leaks, pathological performance issues. The kind of issues that standard unit testing doesn't find.
> Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT. The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here...
Here's the link: https://www.hillelwayne.com/post/property-testing-complex-in...
If it's valid for a user to do, you can make a list and have it do those in sequence.
I had this for a UI library. It could call the functions to add and create the library and then afterwards would move through it. It was for the BBC so on TVs and could move u/d/l/r - the logic was regardless of the UI if you moved right and the focus changed then moving left should bring you back to where you were (u/d the same, etc).
That's tricky, yet being able to write
FOR ANY ui a person can construct
FOR ANY path a user takes through it
WHEN a user presses R
AND the focus changes
THEN when the user presses L
THEN the user is on the item they were before
Was actually quite easy to write and yet insanely powerful.
The one that really convinced me on PBT was one in this library where it found the bug, and the bug had an explicit test for it and it was explicitly in the spec but the spec was inconsistent! The spec was broken, and nobody had noticed.
Another that drove out a lot of bugs was similar but was that regardless of how many ui changes we made and how many movements the user made, we always had something in focus.
Anyway, the big thing here I want to stress is a series of API calls and asserting something at the end or all the way through.
Side note - oh my this is so long ago, 15 years ago building a new PBT tool in actionscript
JSON.parse(JSON.stringify(randomObject)) === randomObject
That often works. What also often works is generating the expected output and constructing the input from it. For example, a `stripPrefix` function that removes a known prefix from string, e.g. `stripPrefix("foo", "foobar") === "bar"`. Property test: stripPrefix(randomPrefix, randomPrefix + randomSuffix) === randomSuffix
Note, we "go backwards" and generated the expected output `randomSuffix` directly and then construct the input from it `randomPrefix + randomSuffix`.Reference implementation based properties also work very often. For example, we've been developing a JavaScript rich text editor. That requires a bunch of utility functions on DOM trees that are analogous to standard string functions. For example, on a standard string you can get a char at an index with `"foo bar".charAt(3)` and on a rich text DOM tree we would need something like `treeCharAt(<U><B>foo</B> bar</U>, 3)`. The string functions can serve as a reference implementation for the more complex tree functions:
treeCharAt(randomTree, randomIndex) === extractStringContent(randomTree).charAt(randomIndex)
The same can be done with all string functions like `slice`, `indexOf`, `trim`, ...- Despite the terrible tutorial examples, PBT isn't about running one function on an arbitrary input, then trying to think of assertions about the result. Instead, focus on ways that different parts of your production code fits together, what assumptions are being made at each point, etc.
- You don't need to plug random inputs directly into the code you're testing. There are usually very few things to say regarding truly arbitrary inputs, like `forAll(x) { foo(x) }`; but lots more to say about e.g. "inputs which don't contain Y" (so run the input through a filter first), or "inputs which don't overlap" (so remove any overlapping region first), and so on.
- Don't focus on the random inputs; the whole idea is that they're irrelevant to the statement you're asserting (it's meant to hold regardless of their value). Likewise, if your unit test contains some irrelevant details, use PBT to generate those parts instead.
- It's often useful in business-type software to think of a "sequence of actions" (which could be method calls, REST endpoints, DB queries, or whatever). For example, "any actions taken as User A will not affect the data for User B". Come up with a simple datatype to represent the actions you care about, write a function which "interprets" those actions (i.e. a `switch` to actually call the method, or trigger the endpoint, or submit to query, or whatever). Then we can write properties which take a list of actions as input. Remember, we don't need to run truly arbitrary lists: a property might filter certain things out of the list, prepend/append some particular actions, etc.
- Once we have some assertion, look for ways to generalise it; for example by looking for places to stick extra things which should be irrelevant.
As a simple example, say we have a function like `store(key, value)`; it's hard to say much about the result of that on its own, but we can instead say how it relates to other functions, like `lookup(key)`:
forAll(key, value) {
store(key, value);
assertEqual(lookup(key), Some(value))
}
Yet we don't really care about lookups happening immediately after stores, we want to make a more general statement about values being persisted: forAll(key, value, pre, suf) {
runActions(pre) # Storing shouldn't be affected by anything before it
store(key, value)
runActions(suf.filter(notIsStore(key))) # Do anything except storing the same key
assertEqual(lookup(key), Some(value))
}
That last test style you describe can be done with Hypothesis. I've had some good success testing both Python programs and programs written in other languages that could be driven from Python with it. Like a server using gRPC (or CORBA once) as an interface, driven by tests written in Python imitating client behavior.
- "Metamorphic testing" is where analyze how code changes with changing inputs. For example, adding more filters to a query should return a strict subset of the results, or if a computer vision system recognizes a person, it should recognize the same person if you tilt the image.
- Creating a simplified model of the code, and then comparing the code implementation to the model, a la https://matklad.github.io/2024/07/05/properly-testing-concur... or https://johanneslink.net/model-based-testing
There's also this paper, which I haven't read yet but seems intriguing: https://andrewhead.info/assets/pdf/pbt-in-practice.pdf
“Civilization advances by extending the number of important operations which we can perform without thinking about them.”
We can do property based testing with less thinking than if we try to prove correctness. We are exploiting the ability of the computer to run millions or even billions of tests. It's the enormous power of today's computers that enables this to work.
The theorem prover approach would work only if it could be automatic: press a button, get the proof (after some acceptable delay). Otherwise, look at all the expensive manual effort you just signed up for.
You might think that as computers get faster, theorem proving becomes easier. But testing becomes easier also. It's not clear testing will ever lose this race.
Proving code is only as good as the requirements which are often garbage - the customer often doesn't know what they even want. Even if you put in effort, requirements as the proof needs are often very abstract from the customer requirements and so your program can be proved but still be wrong because it doesn't do what the customer really wanted. In any complex program is a reasonable to state that several requirements are wrong and thus even if your prove your code correct it will be wrong. Often the problem itself cannot even be formally defined - a spell checker cannot be proved correct because human languages are not formally defined, not that you can't prove one, just that whatever you prove will be wrong.
Many systems are very complex. You can (should!) prove simple algorithms, but put everything together and a proof is not something we can do at all. There are too many halting problems like things in large programs.
Tests solve some of the above problems: They can (do not confuse with what they do!) be a simple example of "yes when inputs are exactly x,y,z then I expect that result". A bunch of simple examples that make sense can often be close enough.
We do a lot more theorem proving than most people realize. Types which many languages have are a form of formal proof. They don't cover everything, but even in C++ they cover a lot of issues.
I think the best answer is a combination: prove the things we know how to prove, and test the rest and hope that between the two we have covered enough to prevent bugs.
But "is commutative" is just an example here (one of the topics of this post is how simplistic examples can mislead people as to the usefulness of a given verification technique).
The general point of software verification is to ensure the software "does what I want". But in a very large proportion of cases, people aren't clear on precisely what they want. They could not use a formal method because they could not write a formal specification. A nice thing about unit tests is that you can work through your expectations iteratively and incrementally, broadening and deepening your understanding of exactly what the software should do, capturing each insight along the way in a reusable way.
Here's a few reasons to use testing strategies:
1. Your developers might not be mathematicians.
2. Your system or its properties might be hard to specify mathematically. One can often design a test for such properties.
3. Your functions that are easy to model mathematically might also have side effects or environmental dependencies due to other requirements (eg performance, legacy).
4. Your specifications and code might get out of sync at some point. If it does, people will think the code has properties that it doesn't. That can poison the verification all the way up the proof chain.
5. Mathematical modeling or proof might take much, much longer to find the bug than a code review or testing. That is, it's a waste of money.
6. Your mathematical tools might have errors that cause a false claim of correctness. Diverse, assurance methods catch errors like this. Also, testing often uses the most, widely-used parts of a programming language. The constructs are highly likely to be compiled correctly vs esoteric methods or tools in formally-proven systems.
7. Automated testing that, in some way, searches through your execution paths can find problems your team never thought of. Fuzzing is the most common technique. However, there's many methods of automated, test generation.
Your best bet is to use code reviews, Design-by-Contract, static analyzers for common problems, contract/property-based generation of tests, fuzzing with contracts as runtime checks, and manual tests for anything hard to specify.
Don't waste time on formal verification at all unless it's a high-value asset that's worth it. If you do, first attempt it with tools like SPARK Ada and Frama-C. That have high automation. Also, if you fail on full correctness, you might still prove no runtime errors in certain categories.
The funny thing is that the parsing library was correct and it was the test property that was wrong—but I still learned about an edge case I had never considered!
This has been a common pattern for "simpler" property-based tests I've written: I write a test, it fails right away, and it turns out that my code is fine, but my property was wrong. And this is almost always useful; if I write an incorrect property, it means that some assumption I had about my code is incorrect, so the exercise directly improves my conceptual model of whatever I'm doing.
And what do they do about it? They add more emoji.
Besides, even if it's justified, it's still sections 7.1-A to 7.3-D of hell.
1) lightweight, because most of our test suites run on production infrastructure and can’t afford to run them constantly
2) "creative", to find bugs we hadn’t considered before
Probabilistic test scenarios allow us to increase the surface we're testing without needing to exhaustively test every scenario.
It's been said (I think by Hughes?) that the causes of property failures tend to be spread equally between buggy code, buggy property and buggy generator.
For then it's not clear how would one derive the answer from the generated inputs, that is what code is for.
But PBT can be great for pruning out crashes you don't expect while parsing.
Not quite sure what you mean by "only one answer per input" (that it's a function, i.e. a 1:1 mapping?), but there are of lots of properties that aggregations might typically need to satisfy, e.g. off the top of my head:
# Identity element
forAll(pre, post) {
assertEqual(
agg(pre ++ [agg([])] ++ post),
agg(pre ++ post)
)
}
# Invariant to order
forAll(elems, seed) {
assertEqual(
agg(elems),
agg(permute(elems, seed))
)
}
# Left-associative
forAll(xs, ys, zs) {
assertEqual(
agg([agg(xs ++ ys)] ++ zs),
agg(xs ++ ys ++ zs)
)
}
# Right-associative
forAll(xs, ys, zs) {
assertEqual(
agg(xs ++ [agg(yz ++ zs)]),
agg(xs ++ ys ++ zs)
)
}
(FYI these typical properties of a (commutative) monoid, which is an algebraic structure that describes many "aggregation-like" operations)An aggregate on discrete values my have the property that the output is one of the elements in the input.
It may also have a no-NaN property, or maybe no-NaN unless NaN in input.
There's a bunch of these kinds of patterns (the above was inspired by [0]) that are useful in practice, but unfortunately rarely talked about. I suppose that's because most people end up replicating their TDD workflows and just throwing more randomness at it, instead of throwing more properties at their code.
[0] https://fsharpforfunandprofit.com/posts/property-based-testi...
In practice sum is a sufficiently well-understood function that this property will only catch the edge-cases people know about up-front (integer overflow, floating point issues...). But for more complex cases, this kind of property will catch problems you didn't think about. And even if you decide that the bug is not important—sometimes we have no real choice but to live with these edgecases—at least you'll know about them explicitly and be able to document them.
I don't find this to be true at all. Most bugs I find are business scenarios that I didn't consider or mismatches in API expectations etc. Rarely is a bug for me coming from not considering edge cases of min and maximum values for integers, floats etc.
arnsholt•7h ago