The purpose of Continuous Integration is to fail

https://blog.nix-ci.com/post/2026-02-05_the-purpose-of-ci-is-to-fail

44•Norfair•2d ago

Comments

jgbuddy•2h ago

This is of course true as a blanket "gotcha" headline- although I wouldn't call a failed test the CI itself failing. A real failure would be a false positive, a pass where there wasn't coverage, or a failure when there was no breaking change. Covering all of these edge cases can become as tiresome as maintaining the application in the first place (of course this is a generalization)

chriswarbo•1h ago

> a pass where there wasn't coverage

I always feel obliged to point out that we can have 100% coverage without making a single assertion (beware Goodhart's law)

n2d4•1h ago

True, but you can't have complete tests without 100% coverage. It's a necessary, but not a sufficient condition; as long as it doesn't become the sole goal, it's still a useful metric.

SAI_Peregrinus•33m ago

100% coverage is an EXPTIME problem.

chriswarbo•1h ago

I agree. The same can be said for testing too: their main purpose is to find mistakes (with secondary benefits of documenting, etc.). Whenever I see my tests fail, I'm happy that they caught a problem in my understanding (manifested either as a bug in my implementation, or a bug in my test statement).

cogman10•1h ago

This ultimately is what shapes my view of what a good test is vs a bad test.

An issue I have with a lot of unit tests is they are too strongly coupled to the implementation. What that means is any change to the implementation ultimately means you have to change tests.

IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.

If it turns out that a single change to an implementation requires you to change and update 20 tests, those are bad tests.

What I want as a dev is to immediately think "I must have broken something" when a test fails, not "I need to go fix 20 tests".

For example, let's say you have a method which sorts data.

A bad test will check "did you call this `swap` function 5 times". A good test will say "I gave the method this unsorted data set, is the data set sorted?". Heck, a good test can even say something like "was this large data set sorted in under x time". That's more tricky to do well, but still a better test than the "did you call swap the right number of times" or even worse "Did you invoke this sequence of swap calls".

vova_hn2•1h ago

> IMO, good tests are relatively immutable. You should be able to have multiple valid implementations. You should add new tests to describe the new functionality of that implementation, however, the old tests should remain relatively untouched.

Taken to extreme this would mean getting rid of unit tests altogether in favor of functional and/or end-to-end testing. Which is... a strategy. I don't know if it is a good or bad strategy, but I can see it being viable for some projects.

cogman10•1h ago

If you can't tell, I actually think functional tests have a lot more value than most unit tests :)

Kent Dodd agrees with me. [1]

This isn't to say I see no value in unit tests, just that they should tend towards describing the function of the code under test, not the implementation.

[1] https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...

marcosdumay•56m ago

The goal of unit tests is to circumvent problems with performance or specificity from functional tests.

If you haven't seen those problems with yours, unit tests would be useless.

9rx•1h ago

> Taken to extreme this would mean getting rid of unit tests all together in favor of functional and/or end-to-end testing.

The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing. Watch next time someone tries to come up with definitions to separate them and you'll soon notice that they didn't actually find a difference or they invent some kind of imagined way of testing that serves no purpose and nobody would ever do.

Regardless, even if you want to believe there is a difference, the advice above isn't invalidated by any of them. It is only saying test the visible, public interface. In fact, the good testing frameworks out there even enforce that — producing compiler errors if you try to violate it.

bluejellybean•1h ago

Yep, the 'unit' is size in which one chooses to use. The exact same thing happens when trying to discuss micro services v monolith.

Really it all comes down to agreeing to what terms mean within the context of a conversation. Unit, functional, and end-to-end are all weasel words, unless defined concretely, and should raise an eyebrow when someone uses them.

vova_hn2•1h ago

> The dirty little secret in CS is that unit, functional, and end-to-end tests are all the exact same thing.

I agree that the boundaries may be blurred in practice, but I still think that there is distinction.

> visible, public interface

Visible to whom? A class can have public methods available to other classes, a module can have public members available to other modules, a service can have public API that other services can call through network etc

I think that the difference is the level of abstraction we operate on:

unit -> functional -> integration -> e2e

Unit is the lowest level of abstraction and e2e is the highest.

9rx•1h ago

> Visible to whom?

The user. Your tests are your contract with the user. Any time there is a user, you need to establish the contract with the user so that it is clear to all parties what is provided and what will not randomly change in the future. This is what testing is for.

Yes, that does mean any of classes, network services, graphical user interfaces, etc. All of those things can have users.

> Unit is the lowest level of abstraction and e2e is the highest.

There is only one 'abstraction' that I can see: Feed inputs and evaluate outputs. How does that turn into higher or lower levels?

skydhash•1h ago

It took me a bit of time (and two or three different view) to finally get this. That is mostly why I hardcode my values in the tests. Make them simpler. If something fails, either the values are wrong or the algorithm of the implementation is wrong.

chriswarbo•1h ago

Comparing actual outputs against expected ones is the ideal situation, IMHO. My own preference is for property-checking; but hard-coding a few well-chosen values is also fine.

That's made easier when writing (mostly) pure code, since the output is all we have (we're not mutating anything, or triggering other processes, etc. that would need extra checking).

I also think it's important to make sure we're checking the values we actually care about; since those might not be the literal return value of the "function under test". For example, if we're testing that some function correctly populates a table cell, I would avoid comparing the function's result against a hard-coded table, since that's prone to change over time in ways that are irrelevant. Instead, I would compare that cell of the result against a hard-coded value. (Rather than thinking about the individual values, I like to think of such assertions as relating one piece of code to another, e.g. that the "get_total" function is related to the "populate_total" function, in this way...).

The reason I find this important, is that breaking a test requires us to figure out what it's actually trying to test, and hence whether it should have broken or not; i.e. is it a useful signal that requires us to change our approach (the table should look like that!), or is it noise that needs its incidental details updated (all those other bits don't matter!). That can be hard to work out many years after the test was written!

skydhash•1h ago

Also agree. There’s also a diminishing returns with test cases. Which is why I focus mainly on what I do not want to fail. The goal is not really to prove that my code work (formal verification is the tool for that), but to verify that certain failure cases will not happen. If one does, the code is not merged in.

yrjrjjrjjtjjr•1h ago

The purpose of a car's crumple zone is to crumple.

ralferoo•1h ago

The premise of the article has some weight, but the final conclusion with the suggestion to change the icons seems completely crazy.

Green meaning "to the best of our knowledge, everything is good with the software" is well understood.

Using green to mean "we know that this doesn't work at all" is incredibly poor UI (EDITED from "beyond idiotic" due to feedback, my bad).

And whilst flaky tests are the most problematic for a CI system, it's because they often work (and usually, from my experience most flaky tests are because they are modelling situations that don't usually happen in production) and so are often potentially viable builds for deployment with a caveat. If anything, they should be marked orange if they are tests that are known to be problematic.

chrisweekly•1h ago

Good insights but I'd suggest

"beyond idiotic" -> "misleading | poor UX"

(I agree it's a terrible choice, but civility matters, and strengthens your case.)

ralferoo•1h ago

Fair point, updated my wording.

Norfair•1h ago

Hey, author here: I completely agree, that's why I also haven't used those strange colours for https://nix-ci.com. I just thought they would make for a cool visual representation of the point of the blog post.

9rx•1h ago

> When it passes, it's just overhead: the same outcome you'd get without CI.

The outcome still isn't the same. CI, even when everything passes, enables other developers to build on top of your partially-built work as it becomes available. This is the real purpose of CI. Test automation is necessary, but only to keep things sane amid you continually throwing in fractionally-complete work.

cestith•1h ago

It also allows for much better record keeping than just spinning up new versions in production without the pipeline.

1970-01-01•1h ago

Oversimplified click bait. The purpose never changed from catching bad bugs before it was sent to prod. The goal of CI is to prevent the resulting problems from doing damage and requiring emergency repairs.

bilkow•24m ago

I don't really understand the point you're trying to make, I don't see anywhere in the post nor the title claiming the purpose changed and the title is directly related to the content. In fact, it seems like you are just agreeing with the post.

I think people can get frustrated at CI when it fails, so they're explaining that that's the whole purpose of it and why it's is a actually good thing.

I would personally actually frame it slightly different than the author. Non-flaky CI errors: your code failed CI. Flaky CI errors: CI failed. Just to be clear, that's more precise but would never catch on because people would simplify "your code failed CI" to "CI failed" over time, but I don't thing that changes it from being an interesting way to frame.

zelos•1h ago

> Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

Or you have a concurrency issue in your production code?

cestith•1h ago

It's possibly something else nondeterministic, which may be even more subtle from an external look than a race condition. That should be rare, but it’s been known to happen.

fxwin•1h ago

I thought that line was kind of funny: When a CI run fails, you don't rerun it and wait for the result, you rerun it and check why the original run failed in the meantime. Is it flaky? Is it a pipeline issue? Connectivity issue? Did some Key expire?

If you just rerun and don't go to find out what exactly caused CI to fail, you end up at the author's conclusion:

> (but it could also just have been flaky again).

Norfair•1h ago

Then the test is still flaky. If there's a bug you want the test to consistently fail, not just sometimes.

9rx•52m ago

The parent is talking about when the implementation is flaky, not the test. When you go to fix the problem under that scenario there is no reason for you to modify the test. The test is fine.

rkangel•27m ago

What you're describing is the every day reality but what you WANT is that if your implementation has a race condition, then you want a test that 100% of the time detects that there is a race condition (rather than 1% of the time).

throw_await•50m ago

But also a flaky test is a bug by itself.

globular-toast•1h ago

I think this can be generalised into saying that the purpose of tests is to fail. I've seen far too many tests that are written to pass. You need to write tests to fail.

mihir_kanzariya•1h ago

The biggest problem I've seen with CI isn't the failing part, it's what teams do when it fails. The "just rerun it" culture kills the whole point.

We had a codebase where about 15% of CI runs were flaky. Instead of fixing the root causes (mostly race conditions in tests and one service that would intermittently timeout), the team just added auto-retry. Three attempts before it actually reported failure. So now a genuinely broken build takes 3x longer to tell you it's broken, and the flaky stuff just gets swept under the rug.

The article's right that failure is the point, but only if someone actually investigates the failure instead of clicking retry.

flowerbreeze•56m ago

The "just retry" approach is truly bothersome. I think it is at least partly an organizational issue, because it happens far more often when QA is a separate team.

ant6n•33m ago

If a build fails 10% of the time, it actually takes 100x longer before to fail for the 10%x10%x10% case.

jbstack•19m ago

I don't understand this at all. Why not just skip CI altogether if you're not interested in the results?

deathanatos•49m ago

> One dreaded and very common situation is when a failing CI run can be made to pass by simply re-running it. We call this flaky CI.

> Flaky CI is nasty because it means that a CI failure no longer reliably indicates that a mistake was caught. And it is doubly nasty because it is unfixable (in theory); sometimes machines just explode.

> Luckily flakiness can be detected: Whenever a CI run fails, we can re-run it. If it passes the second time, we are sure it was flaky.

One of the specialties that I have (unwillingly!) specialized in at my current company is CI flakes. Nearly all flakes, well over 90% of them, are not "unfixable", nor are they even really some boogy man unreliable thing that can't be understood.

The single biggest change I think we made that helped was having our CI system record the order¹ in which tests are run. Rerunning the tests, in the same order, makes most flakes instantly reproduce locally. Probably the next biggest reproducer is "what was the time the test ran?" and/or running it in UTC.

But once you get from "it's flakey" (and fails "seeming" "at" "random") to "it fails 100% of the time on my laptop when run this way" then it becomes easier to debug, b/c you can re-run it, attach a debugger, etc. Database sort issues (SQL is not deterministically ordered unless you ORDER BY), issues with database IDs (e.g., test expects row ID 3, usually gets row ID 3, but some other test has bumped us to row ID 4²), timezones — those are probably the biggest categories of "flakes".

While I know what people express with "flake", "flake" as a word is usually "failure mode I don't understand".

(Excluding truly transitory issues like a network failure interfering with a docker image pull, or something.)

(¹there are a lot of reasons people don't have deterministically ordered CI runs. Parallelism, for example. Our order is deterministic, b/c we made a value judgement that random orderings introduce too much chaos. But we still shard our tests across multiple VMs, and that sharding introduces its own changes to the order, as sometimes we rebalance one test to a different shard as devs add or remove tests.)

²this isn't usually because the ID is hardcoded, it is usually b/c, in the test, someone is doing `assert Foo.id == Bar.id`, unknowningly. (The code is usually not straight-forward about what the ID is an ID to.) I call this ID type confusion, and it's basically weakly-typed IDs in langs where all IDs are just some i32 type. FooId and BarId types would be better, and if I had a real type system in my work's lang of choice…

win311fwg•34m ago

> those are probably the biggest categories of "flakes".

Interesting. In my experience, it is always either a concurrency issue in the program under test or PBTs finding some extreme edge case that was never visited before.

pm215•16m ago

A fairly large category of the flaky CI jobs I see is "dodgy infrastructure". For instance one recurring type for our project is one I just saw fail this afternoon, where a gitlab CI runner tries to clone the git repo from gitlab itself and gets an HTTP 502 error. We've also had issues with "the s390 VM that does CI job running is on an overloaded host, so mostly it's fine but occasionally the VM gets starved of CPU and some of the tests time out".

We do also have some genuinely flaky tests, but it's pretty tempting to hit the big "just retry" button when there's all this flakiness we can't control mixed in there.

IshKebab•46m ago

This is stupidly obvious but you'd be surprised how many people have the attitude that competent developers should have tested their code manually before making PRs so you shouldn't need CI.

lgunsch•17m ago

Some of the other practices of CI are also important. Not explicitly mentioned by the article, but perhaps implied. CI is a lot more than just running tests on pull request. It's a whole suite of practices enabling teams to perform and ship better. Some of which include keeping branches short lived by merging back to main early and often. Keeping code ready for deployment at any time by using strategies like feature switches. This keeps the cost of shipping a feature as low as possible, avoiding issues like spending lots of time rebasing and merging long lived feature branches.

jquaint•11m ago

https://github.com/srid/nixci Is this the project or is this a completely different Nix based CI/CD tool? I can't find a Github or anything on the website.

stego-tech•11m ago

I’m one of today’s lucky 10k, because this judo-threw me with how I (didn’t) understand CI/CD. My experience with it has largely been a cumbersome add-on to existing processes that are often incredibly fragile and impossible to amend; turns out, that’s kind of the point. Understanding that it’s the equivalent of doing rocket tests on kit you expect to fail and using that to build better rockets suddenly makes its value far more recognizable, at least to my eyes.

Solid writeup. Definitely keeping in my personal notes.

Malus – Clean Room as a Service

The Met Releases High-Def 3D Scans of 140 Famous Art Objects

US banks' exposure to private credit hits $300B (2025)

Kotlin creator's new language: a formal way to talk to LLMs instead of English

Dolphin Progress Release 2603

Asia rolls out 4-day weeks, WFH to solve fuel crisis caused by Iran war

Colon cancer now leading cause of cancer deaths under 50 in US

ATMs didn't kill bank Teller jobs, but the iPhone did

Avoiding Trigonometry (2013)

Hive (YC S14) is hiring scrappy product managers and product/data engineers

The Cost of Indirection in Rust

3D-Knitting: The Ultimate Guide

Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3)

Italian prosecutors seek trial for Amazon, 4 execs in alleged $1.4B tax evasion

Show HN: s@: decentralized social networking over static sites

Another DOGE staffer explaining how he flagged grants at NEH for "DEI"

Big Data on the Cheapest MacBook

Printf-Tac-Toe

Show HN: We analyzed 1,573 Claude Code sessions to see how AI agents work

High fidelity font synthesis for CJK languages

Reliable Software in the LLM Era

Returning to Rails in 2026

Datahäxan

SHOW HN: A usage circuit breaker for Cloudflare Workers

SBCL: A Sanely-Bootstrappable Common Lisp (2008) [pdf]

Tested: How Many Times Can a DVD±RW Be Rewritten? Methodology and Results

Suburban school district uses license plate readers to verify student residency

Don't post generated/AI-edited comments. HN is for conversation between humans

ArcaOS 5.1.2 (based on OS/2 Warp 4.52) now available

1B identity records exposed in ID verification data leak

Malus – Clean Room as a Service

The Met Releases High-Def 3D Scans of 140 Famous Art Objects

US banks' exposure to private credit hits $300B (2025)

Kotlin creator's new language: a formal way to talk to LLMs instead of English

Dolphin Progress Release 2603

Asia rolls out 4-day weeks, WFH to solve fuel crisis caused by Iran war

Colon cancer now leading cause of cancer deaths under 50 in US

ATMs didn't kill bank Teller jobs, but the iPhone did

Avoiding Trigonometry (2013)

Hive (YC S14) is hiring scrappy product managers and product/data engineers

The Cost of Indirection in Rust

3D-Knitting: The Ultimate Guide

Emacs internals: Tagged pointers vs. C++ std:variant and LLVM (Part 3)

Italian prosecutors seek trial for Amazon, 4 execs in alleged $1.4B tax evasion

Show HN: s@: decentralized social networking over static sites

Another DOGE staffer explaining how he flagged grants at NEH for "DEI"

Big Data on the Cheapest MacBook

Printf-Tac-Toe

Show HN: We analyzed 1,573 Claude Code sessions to see how AI agents work

High fidelity font synthesis for CJK languages

Reliable Software in the LLM Era

Returning to Rails in 2026

Datahäxan

SHOW HN: A usage circuit breaker for Cloudflare Workers

SBCL: A Sanely-Bootstrappable Common Lisp (2008) [pdf]

Tested: How Many Times Can a DVD±RW Be Rewritten? Methodology and Results

Suburban school district uses license plate readers to verify student residency

Don't post generated/AI-edited comments. HN is for conversation between humans

ArcaOS 5.1.2 (based on OS/2 Warp 4.52) now available

1B identity records exposed in ID verification data leak

The purpose of Continuous Integration is to fail

Comments