An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.
IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.
If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.
What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".
For example, let's say you have a method which sorts data.
A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".
Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.
Kent Dodd agrees with me. [1]
This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.
[1] https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...
If you haven't seen those problems with yours, unit tests would be useless.
The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing. Watch next time someone tries to come up with definitions to separate them and you'll soon notice that they didn't actually find a difference or they invent some kind of imagined way of testing that serves no purpose and nobody would ever do.
Regardless, even if you want to believe there is a difference, the advice above isn't invalidated by any of them. It is only saying test the visible, public interface. In fact, the good testing frameworks out there even enforce that — producing compiler errors if you try to violate it.
Really it all comes down to agreeing to what terms mean within the context of a conversation. Unit, functional, and end-to-end are all weasel words, unless defined concretely, and should raise an eyebrow when someone uses them.
I agree that the boundaries may be blurred in practice, but I still think that there is distinction.
> visible, public interface
Visible to whom? A class can have public methods available to other classes, a module can have public members available to other modules, a service can have public API that other services can call through network etc
I think that the difference is the level of abstraction we operate on:
unit -> functional -> integration -> e2e
Unit is the lowest level of abstraction and e2e is the highest.
The user. Your tests are your contract with the user. Any time there is a user, you need to establish the contract with the user so that it is clear to all parties what is provided and what will not randomly change in the future. This is what testing is for.
Yes, that does mean any of classes, network services, graphical user interfaces, etc. All of those things can have users.
> Unit is the lowest level of abstraction and e2e is the highest.
There is only one 'abstraction' that I can see: Feed inputs and evaluate outputs. How does that turn into higher or lower levels?
That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).
I also think it's important to make sure we're checking the values we actually care about; since those might not be the literal return value of the "function under test". For example, if we're testing that some function correctly populates a table cell, I would avoid comparing the function's result against a hard-coded table, since that's prone to change over time in ways that are irrelevant. Instead, I would compare that cell of the result against a hard-coded value. (Rather than thinking about the individual values, I like to think of such assertions as relating one piece of code to another, e.g. that the "get_total" function is related to the "populate_total" function, in this way...).
The reason I find this important, is that breaking a test requires us to figure out what it's actually trying to test, and hence whether it should have broken or not; i.e. is it a useful signal that requires us to change our approach (the table should look like that!), or is it noise that needs its incidental details updated (all those other bits don't matter!). That can be hard to work out many years after the test was written!
Green meaning "to the best of our knowledge, everything is good with the software" is well understood.
Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).
And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.
"beyond idiotic" -> "misleading | poor UX"
(I agree it's a terrible choice, but civility matters, and strengthens your case.)
The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.
I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.
I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.
Or you have a concurrency issue in your production code?
If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:
> (but it could also just have been flaky again).
We had a codebase where about 15% of CI runs were flaky. Instead of fixing the root causes (mostly race conditions in tests and one service that would intermittently timeout), the team just added auto-retry. Three attempts before it actually reported failure. So now a genuinely broken build takes 3x longer to tell you it's broken, and the flaky stuff just gets swept under the rug.
The article's right that failure is the point, but only if someone actually investigates the failure instead of clicking retry.
> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.
> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.
One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.
The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.
But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".
While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".
(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)
(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)
²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…
Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.
We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.
Solid writeup. Definitely keeping in my personal notes.
jgbuddy•2h ago
chriswarbo•1h ago
I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)
n2d4•1h ago
SAI_Peregrinus•33m ago