Why Property Testing Finds Bugs Unit Testing Does Not (2021)

https://buttondown.com/hillelwayne/archive/why-property-testing-finds-bugs-unit-testing-does/

57•Tomte•1mo ago

Comments

arnsholt•1mo ago

I haven't used PBT much, but I did once get a lot of mileage out of it, which was when I was implementing a fairly gnarly edit-distance algorithm. In that case, I used PBT to check that the required properties of metric functions (d(x,x)=0, d(x,y)>0, d(x,y)=d(y,x), d(x,z) <= d(x,y)+d(y,z)) held for my implementation, which helped shake out a fair few stupid implementation mistakes.

frogulis•1mo ago

Aw man, I was nodding along with the "most examples suck" section and then... it ended :(

pfdietz•1mo ago

I'll give you the good example I've been doing for the last two decades: testing a compiler.

The complexity here is the complete opposite of the simple toy examples. What are the edge cases of an optimizing compiler? How do you even approach them, if they're buried deep in a chain of transformations?

The properties are simple things like "the compiler shouldn't crash, the compiled code shouldn't crash, and code compiled with different optimization levels should do the same thing." This assumes the randomly generated code doesn't touch undefined behavior in the language spec.

Here's a recent example of a bug found by this approach. The Common Lisp code stimulating the bug has been automatically minimized: https://bugs.launchpad.net/sbcl/+bug/2109837 with the bug fix https://sourceforge.net/p/sbcl/sbcl/ci/1abebf7addda1a43d6d24...

chipsrafferty•1mo ago

Unfortunately I have no idea what I'm looking at here.

pfdietz•1mo ago

The function constructs two lambda expressions (source code for anonymous functions) that should be equivalent. One has some extra declarations. It then compiles the two lambda expressions and calls the compiled code on the same arguments, and gets different values (which is the bug).

hikarudo•1mo ago

Here's a great talk by John Hughes, one of the authors of QuickCheck, with real-life examples:

https://www.youtube.com/watch?v=zi0rHwfiX1Q

frogulis•1mo ago

Thanks, that's gotta be one of the best talks I've ever watched. A passionate speaker, talking about fascinating, useful tech, giving specific real world examples of utility, AND showing how to actually apply it for interesting "dirty" situations.

nickpsecurity•1mo ago

They should put some good ones in the article.

djoldman•1mo ago

Isn't unit testing a subset of property testing?

Seems like a unit test tests specific input and property testing tests more than one input.

skybrian•1mo ago

Yes, but converting a test to be data-driven adds some complexity (how do you debug just the failures?) and generating the inputs randomly makes it harder to know what properties to assert.

Also, if you never look at the test data, it might give you false confidence about how thoroughly your code is being tested.

stonemetal12•1mo ago

Yes, Property testing is a generalization of unit testing.

andreareina•1mo ago

They’re different, complementary approaches. Example based testing has a known input and output. Unless you have an oracle (not usually the case) you won’t know the exact output you should get from a generated input. So instead you depend on something that should be true no matter the input. e.g. if you have a serialize/deserialize pair then `deserialize(serialize(x))` better equal `x`.

Jtsummers•1mo ago

Sort of. But you can also use PBT for integration and end-to-end testing, too. A user tries to schedule something and it's overlapping with an existing event, what happens? A user adds 10 events with their various requirements (none overlapping) do they successfully schedule?

Set the expectations and model the expected behavior, verify that the system matches that expectation. The approach works at all testing scales.

chriswarbo•1mo ago

Yes, if you've got a mixture of unit tests and property tests then you can write them all using a single PBT framework, rather than needing two test frameworks.

esafak•1mo ago

I independently discovered PBT when I was a junior, and suggested we use it. My coworkers rejected it because tests should be predictable, and it's the programmer's job to pick the edge cases.

skybrian•1mo ago

You do need to be able to reproduce a test failure, which can be done by printing the inputs or the random seed used in a way that makes it trivial to rerun it.

alkonaut•1mo ago

I think it goes without saying. PBT shouldn't be random-random (e.g. use a timestamp or cryptographic seed), it should be deterministically pseudorandom if it uses random values. You shouldn't be able to run it 10 times and get 9 pass and one failure. It's either 10 passes or 10 failures.

Akronymus•1mo ago

> You shouldn't be able to run it 10 times and get 9 pass and one failure. It's either 10 passes or 10 failures.

With property based testing, it actually CAN be 9 passes and 1 failure, because that one single fail can be hitting an edge case the others just aren't. In fact, only a few failures are more likely than it being all failures

bluGill•1mo ago

That is the one thing about PBT that worries me. I can write code and all tests pass, then next week the edge case I missed randomly is hit by a coworker who now has to figure out why their change broke my code (it didn't).

I can tell you from experience that random failures cause loss of trust. People learn to ignore failures and just keep hitting rebuild until the tests pass. People will not investigate test failures in code that they don't understand.

Akronymus•1mo ago

That's a fair concern. I can only really suggest upping the amount of test cases that are ran when merging so that you get a much more extensive run for that time, and later dial it back. Along with having the seed included in the failure case, so that you can bisect to check what actually broke that test. Also, implementing a standard test alongside the property based on, for all bugs you encounter over time (basically "hard coding" one of the failure cases)

But yeah, probabilistic testing isn't perfect.

aswerty•1mo ago

Potentially using the git hash as a seed would make sense, so for a given snapshot of code it is always going to be deterministic. When the git hash changes (i.e. your code) then that would result in a different set of test inputs running.

Allowing reproducibility for a given change set.

Akronymus•1mo ago

That's a pretty good idea, actually. Using the git hash as a seed to seed the rng for the different runs of the PB tests.

Didn't even enter my mind.

chriswarbo•1mo ago

Using a git hash still has the problem of a co-worker's changes (which alter the git commit) causing an unrelated property to fail.

Hypothesis has a nice option, to pick the seed for each property by hashing that property's code. It's a nice idea, but relies on Python's highly dynamic nature; so may not be easy/possible in other languages (especially compiled ones).

jononor•1mo ago

So lock the seed in that case?

bluGill•1mo ago

If you lock your seed you are worse than unit tests - odds are you are not testing some interesting cases ever. The whole point of PBT is there are some properties that hold in all cases (well technically in the domain of the function inputs), so try random examples to see if I missed something. Generally the domain of all possible function inputs is a very large set such that exhaustive testing is impossible. The more different random seeds we try the better the odds we eventually catch some case that we didn't handle correctly.

IanCal•1mo ago

This is fair, but also a tooling or process problem.

The co-worker should add the test to a new branch / test it on main. If it fails that's a new ticket (with the great side effect of having a failing test). If that passes it's a problem in their branch. If not it's the same as having a broken main which happens anyway and you deal with that as you usually do.

zelphirkalt•1mo ago

Usually though the failure case inputs should be stored, so that once the test fails, it will fail again, no matter how often you hit the run or rebuild button.

alkonaut•1mo ago

When would that be preferable to have nondeterministic random?

Akronymus•1mo ago

I don't think I have said anything about it being nondeterministic.

You can have a prng seeded from something like the commit hash and have prngs generate the test cases. That still can fail on 1/10 tests for a particular run.

Jtsummers•1mo ago

With PBT you generate an arbitrary number of inputs, ideally they all pass. However it's entirely possible that you have an error in your program and the property only holds for some inputs, but not all. In that case, since each execution starts with a different seed (unless you provide a specific seed, in which case the generated inputs should be exactly the same each time), you may have some executions that always pass, some that always fail, and others that are in between (have a mix of passes and failures). This is expected.

aswerty•1mo ago

Just as an anecdotal experience. It doesn't necessarily go without saying.

The most memorable discussion I had around PBT was with a colleague (a skip report) who saw "true" randomness as a net benefit and that reproducibility was not a critical characteristic of the test suite (I guess the reasoning was then it could catch things at a later date?). To be honest, it scared the hell out of me and I pushed back pretty hard on them and the broader team.

I have no issue with a psuedo-random set of test cases that are declaratively generated. That makes sense if that is what is meant by PBT. Since it is just a more efficient way of testing (and you would assume this would allow you to cast a wider net).

senderista•1mo ago

This dilemma is of course trivially solvable by persisting a (presumably randomly generated) RNG seed with each test run. You just have to ensure that your RNG is configured once with the seed at the beginning of each test run.

IanCal•1mo ago

What's the issue you have?

The idea is you have random testing and the test failures are added as explicit tests that then always get run.

Is that so different from someone else testing?

The main issue is you stumble across a new issue in an unrelated branch, but it's not wildly different from doing that while using your application.

crabbone•1mo ago

Nah. That's not what this is about. In any system worth testing, with property-based testing, the system will take so many steps before it encounters an error that even knowing what the steps were isn't going to be very helpful in reproducing the error.

I've been there. It takes many hours to try to guess where the system went wrong to produce the undesirable result, and then you still might not be sure if you are looking at the right place, and then there are always environment issues, you aren't sure of. So, you don't know who to blame.

Only very simple systems will reproduce pathological results 100% of the time given some initial conditions. The bane of complex systems is the timeouts. They are usually very hard to justify and are easy to blame for undesirable behavior.

jononor•1mo ago

Property based testing is incredibly useful for state-free systems/modules. Especially those that have a wide/complex input space. Simple examples would be a general (de)serialization library for something like JSON.

skybrian•1mo ago

Yes, timeouts are bad, but that’s true even for regular tests with determininistic input. You’ll get flakes running on different machines, depending on how much load is on them.

To test an entire system with reproducable failures, you probably need something more heavyweight like Antithesis. Property tests are more useful for unit tests.

crabbone•1mo ago

Multiple-step unit tests are easier to modify to only slightly change the behavior, which is usually too hard to do with property-based testing where small changes to initial input cause huge changes downstream.

But, in general, yes, unit test failure with many steps would've been just as difficult to interpret.

My experience though was that once such a difficult failure is encountered during property-based testing, one has to write a unit test to reproduce the error anyways. But it's hard to assess the probability of the unexpected behavior of being an actual bug. Sometimes you discover that the system behavior was underspecified, or that you misunderstood how the system is supposed to behave after reading the specification.

nottorp•1mo ago

It's unrelated to this article, but I suppose this motivation is why no one has a Gremlins-like feature any more.

It seems to be almost totally forgotten, since the only link I could find is an excerpt from a PalmOS programming book:

https://www.oreilly.com/library/view/palm-programming-the/15...

chriswarbo•1mo ago

That sounds like fuzz testing; which is similar to PBT, but (a) usually checks a single property ("the program doesn't crash") and (b) sends data via the program/system's ordinary input channels (whereas PBT has "white box" access to internals, like unit test do).

mrguyorama•1mo ago

Good fuzzing workflows and tools use the flow of code and which branches get taken to help find correlations between inputs and outcomes, and uses that extra context to more efficiently fuzz inputs.

In bigger programs this is an outright necessity because pure random fuzzing would basically be a lottery.

I've always felt that unit testing frameworks and libraries and even parameterized testing were missing this kind of functionality.

Intellij is able to run my tests and figure out the code coverage, but why isn't it closing the loop and auto-fuzzing/auto-discovering how to mutate tests to cover more?

And don't point me at AI, none of this requires AI and nothing should have to "think" to do this.

It's crazy to me that the vast majority of code running all the time is not exhaustively tested through almost all of it's possible state space with most of it's possible input space. It's not like we are lacking the CPU bandwidth to do it.

Why can't I write a new function and have something tell me within ten minutes "this input param causes an exception" without any effort from me? Instead all those extra cores in my CPU just run javascript trash and crowdstrike scanners

chriswarbo•1mo ago

> Good fuzzing workflows and tools use the flow of code and which branches get taken to help find correlations between inputs and outcomes, and uses that extra context to more efficiently fuzz inputs.

Sure, but that's an optimisation/implementation-detail. Similar to how PBT frameworks tend to use random generation + shrinking: it's not fundamental to the approach, but turns out to be much more effective than e.g. enumerative testing (e.g. Smallcheck), or showing un-shrunk examples.

> I've always felt that unit testing frameworks and libraries and even parameterized testing were missing this kind of functionality.

Coverage-guided PBT seems to have been re-invented several times (e.g. using QuickCheck with HPC in Haskell), though all the examples I've seen are toys or experiments. Hypothesis has experimental support for generating data using an external fuzzer, which presumably uses coverage (though I've not tried that feature yet).

> Why can't I write a new function and have something tell me within ten minutes "this input param causes an exception" without any effort from me?

I agree. One piece of advice is to respect the options provided by PBT frameworks, e.g. for the number of tests to run, the maximum discard:success ratio, the maximum "size" to pass into generators, etc. These can be tweaked per property, e.g. if a particular property is slowing down our CI we might want to only test it 20 times instead of the default of 100. However, I always make sure to transform that default value (in this case dividing it by 5) rather than setting a particular number: that way the test suite can also be run with bigger options to get a more thorough search (e.g. locally in the background, or by another CI job that's run less often, etc.).

Unfortunately I've not come across a framework that will keep on checking properties continuously (say, in a round-robin fashion). Sticking the test command in a loop should probably work though: `while runTests; do sleep 1; done; notify "Tests failed!"`

PS: As for "AI", I think it's better to be more specific. LLMs certainly aren't needed for this; but fuzzers have been using GOFAI techniques like genetic algorithms for a long time!

sundarurfriend•1mo ago

The gremlins turned into monkeys (as they quickly did in the page you shared as well):

https://developer.android.com/studio/test/other-testing-tool...

nottorp•1mo ago

Ohh nice. Thanks!

pfdietz•1mo ago

Tests should be optimized to find bugs. Tests that you have already run have a lower chance of doing that (they only find regressions); tests with novel inputs are preferred. And since writing tests manually is so expensive, this means automatic test input generation. How do you determine if such tests pass? Properties.

In practice, property based testing fails because the organization is not actually interested in delivering correct code. "This bug will never happen in practice so we won't fix it." "If we fix this, we may change some incorrect behavior some customer is depending on." And once that happens, PBT is useless, because it will keep finding that "don't fix" bug over and over.

chriswarbo•1mo ago

Your last paragraph is using terms like "correct", "fix" and "bug" as if they're absolute, when they're actually relative to some sort of spec (whether formal or informal, written or vibes-based, etc.). If the organisation controls the spec, then it can be perfectly reasonable for them to update that spec to e.g. allow certain behaviours that previously would have been considered bugs.

In that case, we update the properties to reflect the new spec.

pfdietz•1mo ago

The spec becomes "don't break important customers code". How could one possibly formalize that?

chriswarbo•1mo ago

The same way one formalises the implementation (AKA writing code): iteratively, on a best-effort basis, using one's own knowledge and experience, with input and feedback from colleagues and stakeholders, etc.

pfdietz•1mo ago

That makes no sense. How do you know what bugs the customer depends on, especially if you don't have access to their code?

What's more PBT doesn't depend on having a spec, just on having some properties that hold. So you very possibly didn't have a spec to start with.

chriswarbo•1mo ago

> How do you know what bugs the customer depends on, especially if you don't have access to their code?

You don't know, but you can ask, you can make educated guesses, etc. As I said, we do this stuff iteratively, on a best-effort basis, using one's own knowledge and experience, with input and feedback from colleagues and stakeholders, etc. That's what most programming is.

> What's more PBT doesn't depend on having a spec, just on having some properties that hold

I'd say that "having some properties that hold" certainly counts as "some sort of spec (whether formal or informal, written or vibes-based, etc.)".

> So you very possibly didn't have a spec to start with.

There's always "some sort of spec"; even if it starts as vague as "let's try to make some money using computers".

pfdietz•1mo ago

One can use property testing to automatically generate unit test inputs. The property becomes "does this do something new that I wanted to test for but wasn't?" (or, rather, the negation; inputs "pass" and can be ignored if they do nothing new.) This could be code coverage, or it could be killing code mutants.

bluGill•1mo ago

Unfortunately it ends before it gets to the good stuff. It has me interested that maybe PBT can find some bugs that unit testing wouldn't - however I'm not sure how to write a PBT that would catch those bugs. The obvious tests that drive PBT advocates to drink are not interesting - unit tests will catch all the errors and because there is no randomness they will catch the errors faster in general. However how do I write a property for non-trivial functions it not clear at all, and the article stops.

drowsspa•1mo ago

Yeah, for simple nearly mathematical functions it's quite clear how to do it. I find it hard to extend this to more business-focused inputs

zelos•1mo ago

I remember using something similar a long time ago - basically fuzz testing, I suppose you could call it property based testing where the property is "whatever edits the user does, we should be able to save and reopen the document without crashing".

It found so many bugs: file corruption, crashes, memory leaks, pathological performance issues. The kind of issues that standard unit testing doesn't find.

bpshaver•1mo ago

Does the article he links to towards the end of the article address your concerns?

> Without complex input spaces, there's no explosion of edge cases, which minimizes the actual benefit of PBT. The real benefits come when you have complex input spaces. Unfortunately, you need to be good at PBT to write complex input strategies. I wrote a bit about it here...

Here's the link: https://www.hillelwayne.com/post/property-testing-complex-in...

bluGill•1mo ago

It is a start, but I still feel like I'm not sure how I'd apply that to my own domain.

chipsrafferty•1mo ago

Indeed! The article says PBT is useful, but can't provide any examples of how :/

IanCal•1mo ago

I used it for running a series of "execute this api call".

If it's valid for a user to do, you can make a list and have it do those in sequence.

I had this for a UI library. It could call the functions to add and create the library and then afterwards would move through it. It was for the BBC so on TVs and could move u/d/l/r - the logic was regardless of the UI if you moved right and the focus changed then moving left should bring you back to where you were (u/d the same, etc).

That's tricky, yet being able to write

FOR ANY ui a person can construct

FOR ANY path a user takes through it

WHEN a user presses R

AND the focus changes

THEN when the user presses L

THEN the user is on the item they were before

Was actually quite easy to write and yet insanely powerful.

The one that really convinced me on PBT was one in this library where it found the bug, and the bug had an explicit test for it and it was explicitly in the spec but the spec was inconsistent! The spec was broken, and nobody had noticed.

Another that drove out a lot of bugs was similar but was that regardless of how many ui changes we made and how many movements the user made, we always had something in focus.

Anyway, the big thing here I want to stress is a series of API calls and asserting something at the end or all the way through.

Side note - oh my this is so long ago, 15 years ago building a new PBT tool in actionscript

senderista•1mo ago

You can do "white-box" PBT by just asserting all the nontrivial invariants you can think of in your code, and then counting on the generator to force evaluation of those invariants on a representative sample of inputs.

ngruhn•1mo ago

If you have a pair of functions for encoding/decoding "something" you can do a round-trip and test that you get the original input back out, e.g.:

    JSON.parse(JSON.stringify(randomObject)) === randomObject

That often works. What also often works is generating the expected output and constructing the input from it. For example, a `stripPrefix` function that removes a known prefix from string, e.g. `stripPrefix("foo", "foobar") === "bar"`. Property test:

    stripPrefix(randomPrefix, randomPrefix + randomSuffix) === randomSuffix

Note, we "go backwards" and generated the expected output `randomSuffix` directly and then construct the input from it `randomPrefix + randomSuffix`.

Reference implementation based properties also work very often. For example, we've been developing a JavaScript rich text editor. That requires a bunch of utility functions on DOM trees that are analogous to standard string functions. For example, on a standard string you can get a char at an index with `"foo bar".charAt(3)` and on a rich text DOM tree we would need something like `treeCharAt(<U><B>foo</B> bar</U>, 3)`. The string functions can serve as a reference implementation for the more complex tree functions:

    treeCharAt(randomTree, randomIndex) === extractStringContent(randomTree).charAt(randomIndex)

The same can be done with all string functions like `slice`, `indexOf`, `trim`, ...

perrygeo•1mo ago

The base property you can test for on every function is "does this crash or return?"

chriswarbo•1mo ago

For code that's more "business logic" rather than "algorithmic", I find the following helpful:

- Despite the terrible tutorial examples, PBT isn't about running one function on an arbitrary input, then trying to think of assertions about the result. Instead, focus on ways that different parts of your production code fits together, what assumptions are being made at each point, etc.

- You don't need to plug random inputs directly into the code you're testing. There are usually very few things to say regarding truly arbitrary inputs, like `forAll(x) { foo(x) }`; but lots more to say about e.g. "inputs which don't contain Y" (so run the input through a filter first), or "inputs which don't overlap" (so remove any overlapping region first), and so on.

- Don't focus on the random inputs; the whole idea is that they're irrelevant to the statement you're asserting (it's meant to hold regardless of their value). Likewise, if your unit test contains some irrelevant details, use PBT to generate those parts instead.

- It's often useful in business-type software to think of a "sequence of actions" (which could be method calls, REST endpoints, DB queries, or whatever). For example, "any actions taken as User A will not affect the data for User B". Come up with a simple datatype to represent the actions you care about, write a function which "interprets" those actions (i.e. a `switch` to actually call the method, or trigger the endpoint, or submit to query, or whatever). Then we can write properties which take a list of actions as input. Remember, we don't need to run truly arbitrary lists: a property might filter certain things out of the list, prepend/append some particular actions, etc.

- Once we have some assertion, look for ways to generalise it; for example by looking for places to stick extra things which should be irrelevant.

As a simple example, say we have a function like `store(key, value)`; it's hard to say much about the result of that on its own, but we can instead say how it relates to other functions, like `lookup(key)`:

    forAll(key, value) {
      store(key, value);
      assertEqual(lookup(key), Some(value))
    }

Yet we don't really care about lookups happening immediately after stores, we want to make a more general statement about values being persisted:

    forAll(key, value, pre, suf) {
      runActions(pre)  # Storing shouldn't be affected by anything before it
      store(key, value)
      runActions(suf.filter(notIsStore(key)))  # Do anything except storing the same key
      assertEqual(lookup(key), Some(value))
    }

Jtsummers•1mo ago

https://hypothesis.readthedocs.io/en/latest/stateful.html

That last test style you describe can be done with Hypothesis. I've had some good success testing both Python programs and programs written in other languages that could be driven from Python with it. Like a server using gRPC (or CORBA once) as an interface, driven by tests written in Python imitating client behavior.

chriswarbo•1mo ago

Yeah, I've seen several packages for doing this, e.g. QuickCheck has modules for basic "monadic testing", and there are more elaborate packages like quickcheck-lockstep, etc. I'm sure they're useful to someone, but I've not found them compelling, compared to just writing these as "normal" properties.

(For context, some "stateful" things that I've tested using ordinary PBT include browser automation (Hypothesis + Chrome + ChromeDriver), window manager scripts (QuickCheck + polysemy + Yabai), and a dynamic binding implementation for the JVM (ScalaCheck + a bespoke DSL for testing that it works correctly with Futures))

There's a good discussion of stateful property testing at https://stevana.github.io/the_sad_state_of_property-based_te... but personally I'm more interested in the parallel/concurrent/linearisability aspect that's also discussed.

hwayne•1mo ago

A couple of useful general approaches:

- "Metamorphic testing" is where analyze how code changes with changing inputs. For example, adding more filters to a query should return a strict subset of the results, or if a computer vision system recognizes a person, it should recognize the same person if you tilt the image.

- Creating a simplified model of the code, and then comparing the code implementation to the model, a la https://matklad.github.io/2024/07/05/properly-testing-concur... or https://johanneslink.net/model-based-testing

There's also this paper, which I haven't read yet but seems intriguing: https://andrewhead.info/assets/pdf/pbt-in-practice.pdf

moi2388•1mo ago

Why are we even testing to begin with, and not using theorem provers like lean to prove without any doubt that our commutative function is indeed commutative?

klysm•1mo ago

I agree with the sentiment, but formal methods are tricky

pfdietz•1mo ago

It's an example of Alfred Whitehead's quote:

“Civilization advances by extending the number of important operations which we can perform without thinking about them.”

We can do property based testing with less thinking than if we try to prove correctness. We are exploiting the ability of the computer to run millions or even billions of tests. It's the enormous power of today's computers that enables this to work.

The theorem prover approach would work only if it could be automatic: press a button, get the proof (after some acceptable delay). Otherwise, look at all the expensive manual effort you just signed up for.

You might think that as computers get faster, theorem proving becomes easier. But testing becomes easier also. It's not clear testing will ever lose this race.

recroad•1mo ago

Can you please explain more and maybe give some examples that resonate with people who don't have the understanding that you do?

moi2388•1mo ago

An example would be critical systems such as defence and Aerospace, where they use for example Ada Spark to formally proof certain bugs cannot occur

bluGill•1mo ago

Multiple problems.

Proving code is only as good as the requirements which are often garbage - the customer often doesn't know what they even want. Even if you put in effort, requirements as the proof needs are often very abstract from the customer requirements and so your program can be proved but still be wrong because it doesn't do what the customer really wanted. In any complex program is a reasonable to state that several requirements are wrong and thus even if your prove your code correct it will be wrong. Often the problem itself cannot even be formally defined - a spell checker cannot be proved correct because human languages are not formally defined, not that you can't prove one, just that whatever you prove will be wrong.

Many systems are very complex. You can (should!) prove simple algorithms, but put everything together and a proof is not something we can do at all. There are too many halting problems like things in large programs.

Tests solve some of the above problems: They can (do not confuse with what they do!) be a simple example of "yes when inputs are exactly x,y,z then I expect that result". A bunch of simple examples that make sense can often be close enough.

We do a lot more theorem proving than most people realize. Types which many languages have are a form of formal proof. They don't cover everything, but even in C++ they cover a lot of issues.

I think the best answer is a combination: prove the things we know how to prove, and test the rest and hope that between the two we have covered enough to prevent bugs.

jmull•1mo ago

For purely mathematical properties, a purely mathematical technique is probably best.

But "is commutative" is just an example here (one of the topics of this post is how simplistic examples can mislead people as to the usefulness of a given verification technique).

The general point of software verification is to ensure the software "does what I want". But in a very large proportion of cases, people aren't clear on precisely what they want. They could not use a formal method because they could not write a formal specification. A nice thing about unit tests is that you can work through your expectations iteratively and incrementally, broadening and deepening your understanding of exactly what the software should do, capturing each insight along the way in a reusable way.

kylereeve•1mo ago

Assuming you're not being facetious, one of the best parts about PBT is it gets you a good percent of the value of formal proof with a lot less work. PBT at least lets you demonstrate that property is ~probably~ true, whereas traditional unit testing doesn't usually explicitly state properties.

nickpsecurity•1mo ago

High-assurance systems always required a combo of methods since they can catch what others missed. Also, they have different cost-benefit ratios. A quick glance at the code or some tests catch many problems quickly while formal verification takes roughly forever in real-world, project time.

Here's a few reasons to use testing strategies:

1. Your developers might not be mathematicians.

2. Your system or its properties might be hard to specify mathematically. One can often design a test for such properties.

3. Your functions that are easy to model mathematically might also have side effects or environmental dependencies due to other requirements (eg performance, legacy).

4. Your specifications and code might get out of sync at some point. If it does, people will think the code has properties that it doesn't. That can poison the verification all the way up the proof chain.

5. Mathematical modeling or proof might take much, much longer to find the bug than a code review or testing. That is, it's a waste of money.

6. Your mathematical tools might have errors that cause a false claim of correctness. Diverse, assurance methods catch errors like this. Also, testing often uses the most, widely-used parts of a programming language. The constructs are highly likely to be compiled correctly vs esoteric methods or tools in formally-proven systems.

7. Automated testing that, in some way, searches through your execution paths can find problems your team never thought of. Fuzzing is the most common technique. However, there's many methods of automated, test generation.

Your best bet is to use code reviews, Design-by-Contract, static analyzers for common problems, contract/property-based generation of tests, fuzzing with contracts as runtime checks, and manual tests for anything hard to specify.

Don't waste time on formal verification at all unless it's a high-value asset that's worth it. If you do, first attempt it with tools like SPARK Ada and Frama-C. That have high automation. Also, if you fail on full correctness, you might still prove no runtime errors in certain categories.

tikhonj•1mo ago

I wrote some property-based tests for a parsing library. Eventually one of the tests for case-insensitive parsing failed... because it hit the letter "Ǆ" which has upper, lower and title-case.

The funny thing is that the parsing library was correct and it was the test property that was wrong—but I still learned about an edge case I had never considered!

This has been a common pattern for "simpler" property-based tests I've written: I write a test, it fails right away, and it turns out that my code is fine, but my property was wrong. And this is almost always useful; if I write an incorrect property, it means that some assumption I had about my code is incorrect, so the exercise directly improves my conceptual model of whatever I'm doing.

nottorp•1mo ago

That's Unicode, which has a whole section of hell dedicated to it.

riwsky•1mo ago

Update: It’s actually been divided into multiple sections; if your code assumes a single section, it will break as of the addition of multibyte characters in Unicode v7.1

atmavatar•1mo ago

When the robots rise up and exterminate the human race, Unicode will be one of the reasons we deserve it.

rented_mule•1mo ago

Is the root of the problem Unicode? Or is the root of the problem the complexity of the union of written human languages? To the extent that it's the latter, Unicode is just the messenger.

nottorp•1mo ago

I would shed a tear but then I remember that they have not one, not two but four canonical forms...

And what do they do about it? They add more emoji.

Besides, even if it's justified, it's still sections 7.1-A to 7.3-D of hell.

testthetest•1mo ago

Yes! We use a version of this in our end-to-end Playwright scripts as well, because we want our tests to be both:

1) lightweight, because most of our test suites run on production infrastructure and can’t afford to run them constantly

2) "creative", to find bugs we hadn’t considered before

Probabilistic test scenarios allow us to increase the surface we're testing without needing to exhaustively test every scenario.

chriswarbo•1mo ago

Indeed, I've had a property fail due to `Char.isWhitespace` disagreeing with a `\s` regex over whether Mongolian Vowel Separator counts as whitespace (it's not any more, but was prior to Unicode 6.3, according to https://unicode-explorer.com/c/180E )

It's been said (I think by Hughes?) that the causes of property failures tend to be spread equally between buggy code, buggy property and buggy generator.

senderista•1mo ago

Yup, and many have discovered this is an unexpected benefit of formal methods in general: writing a formal spec forces you to think precisely about requirements, assumptions, and edge cases, independently of whatever benefit formal verification may provide.

sn9•1mo ago

There are techniques to discover properties of your function as written, which can then tell you if you've written the function you intended to write: https://www.fuzzingbook.org/html/DynamicInvariants.html (also at https://www.debuggingbook.org/html/DynamicInvariants.html)

metalrain•1mo ago

I would love to used PBT more, but many tests I write have only one answer per input. Think sum like aggregations.

For then it's not clear how would one derive the answer from the generated inputs, that is what code is for.

But PBT can be great for pruning out crashes you don't expect while parsing.

chriswarbo•1mo ago

> I would love to used PBT more, but many tests I write have only one answer per input. Think sum like aggregations.

Not quite sure what you mean by "only one answer per input" (that it's a function, i.e. a 1:1 mapping?), but there are of lots of properties that aggregations might typically need to satisfy, e.g. off the top of my head:

    # Identity element
    forAll(pre, post) {
      assertEqual(
        agg(pre ++ [agg([])] ++ post),
        agg(pre ++ post)
      )
    }

    # Invariant to order
    forAll(elems, seed) {
      assertEqual(
        agg(elems),
        agg(permute(elems, seed))
      )
    }

    # Left-associative
    forAll(xs, ys, zs) {
      assertEqual(
        agg([agg(xs ++ ys)] ++ zs),
        agg(xs ++ ys ++ zs)
      )
    }

    # Right-associative
    forAll(xs, ys, zs) {
      assertEqual(
        agg(xs ++ [agg(yz ++ zs)]),
        agg(xs ++ ys ++ zs)
      )
    }

(FYI these typical properties of a (commutative) monoid, which is an algebraic structure that describes many "aggregation-like" operations)

jononor•1mo ago

Numerical aggregates often have the property that their output is in the range min and max of the input.

An aggregate on discrete values my have the property that the output is one of the elements in the input.

It may also have a no-NaN property, or maybe no-NaN unless NaN in input.

Sukera•1mo ago

For aggregation-like things, the interesting properties are often about the properties of the accumulation function, and not the entire aggregation, which should then be correct by extension. So for your `sum` example, you'd use PBT to test that your `+` works first, and only then coming up with things that should hold on top of that when repeatedly applying your operation. For example, once you have a list of numbers and know it's sum, adding one additional number to the list and taking the sum again should lead to the same result as if you had added the number to the sum directly (barring non-associative shenanigans like with floating point - but that should have already been found in the addition step ;) ).

There's a bunch of these kinds of patterns (the above was inspired by [0]) that are useful in practice, but unfortunately rarely talked about. I suppose that's because most people end up replicating their TDD workflows and just throwing more randomness at it, instead of throwing more properties at their code.

[0] https://fsharpforfunandprofit.com/posts/property-based-testi...

tikhonj•1mo ago

Good example for sums: you can write a property that checks whether the sum of a list is the same before and after you randomly shuffle the elements.

In practice sum is a sufficiently well-understood function that this property will only catch the edge-cases people know about up-front (integer overflow, floating point issues...). But for more complex cases, this kind of property will catch problems you didn't think about. And even if you decide that the bug is not important—sometimes we have no real choice but to live with these edgecases—at least you'll know about them explicitly and be able to document them.

recroad•1mo ago

> The majority of errors you find with testing are either issues with an entire "partition" of inputs or "boundary" inputs, like INT_MIN.

I don't find this to be true at all. Most bugs I find are business scenarios that I didn't consider or mismatches in API expectations etc. Rarely is a bug for me coming from not considering edge cases of min and maximum values for integers, floats etc.

NIST ion clock sets new record for most accurate clock

Show HN: Shoggoth Mini – A soft tentacle robot powered by GPT-4o and RL

Encrypting files with passkeys and age

To be a better programmer, write little proofs in your head

Hierarchical Modeling (H-Nets)

The Story of Mel, A Real Programmer, Annotated (1996)

Designing for the Eye: Optical Corrections in Architecture and Typography

The FIPS 140-3 Go Cryptographic Module

Show HN: Beyond Z²+C, Plot Any Fractal

Helix Editor 25.07

Reflections on OpenAI

Underwriting Superintelligence

Hazel: A live functional programming environment with typed holes

How Culture Is Made

Human Stigmergy: The world is my task list

Lorem Gibson

Petabit-class transmission over > 1000 km using standard 19-core optical fiber

CoinTracker (YC W18) is hiring to solve crypto taxes and accounting (remote)

Voxtral – Frontier open source speech understanding models

LLM Inevitabilism

Blender 4.5 LTS Released

What caused the 'baby boom'? What would it take to have another?

Claude for Financial Services

Most (ly Dead) Influential Programming Languages (2020)

Where's Firefox Going Next?

Show HN: We made our own inference engine for Apple Silicon

Literalism plaguing today’s movies

KDE's official Roku/Android TV alternative is back from the dead

A quick look at unprivileged sandboxing

Cloudflare starts blocking pirate sites for UK users

NIST ion clock sets new record for most accurate clock

Show HN: Shoggoth Mini – A soft tentacle robot powered by GPT-4o and RL

Encrypting files with passkeys and age

To be a better programmer, write little proofs in your head

Hierarchical Modeling (H-Nets)

The Story of Mel, A Real Programmer, Annotated (1996)

Designing for the Eye: Optical Corrections in Architecture and Typography

The FIPS 140-3 Go Cryptographic Module

Show HN: Beyond Z²+C, Plot Any Fractal

Helix Editor 25.07

Reflections on OpenAI

Underwriting Superintelligence

Hazel: A live functional programming environment with typed holes

How Culture Is Made

Human Stigmergy: The world is my task list

Lorem Gibson

Petabit-class transmission over > 1000 km using standard 19-core optical fiber

CoinTracker (YC W18) is hiring to solve crypto taxes and accounting (remote)

Voxtral – Frontier open source speech understanding models

LLM Inevitabilism

Blender 4.5 LTS Released

What caused the 'baby boom'? What would it take to have another?

Claude for Financial Services

Most (ly Dead) Influential Programming Languages (2020)

Where's Firefox Going Next?

Show HN: We made our own inference engine for Apple Silicon

Literalism plaguing today’s movies

KDE's official Roku/Android TV alternative is back from the dead

A quick look at unprivileged sandboxing

Cloudflare starts blocking pirate sites for UK users

Why Property Testing Finds Bugs Unit Testing Does Not (2021)

Comments