I came across this HN post that perfectly illustrates the pattern: https://news.ycombinator.com/item?id=46967724
Retries, quarantining, adding waits - these aren't fixes. They're letting CI skip errors temporarily. The real problem: there's a growing stock of broken code feeding your pipeline, and no mechanism exists to make the person who introduced it feel the pain.
Two reinforcing loops start running:
RED PIPELINE --> <RETRY> --> GREEN BUILD
^ |
| (Long-Term) |
| v
MORE FLAKINESS <---- HIDDEN BUGS ACCUMULATE
R1 "The Addiction": Every retry makes the light green. But hidden bugs accumulate underneath, making the system flakier, forcing more retries tomorrow. This is textbook "Shifting the Burden" from Donella Meadows.R2 "The Erosion": Because you don't trust CI signal, you lower standards. Because you lower standards, worse code gets merged. Signal becomes even less trustworthy. Repeat until your pipeline is a decoration.
The original post asked how QA and engineering should split responsibility. Wrong question. The right question: how do you make the pain of instability felt by the person who introduced it?
I HIT THE SAME WALL
I built CI to port 973 ROS 2 packages onto two non-officially-supported Linux distros (openEuler + openKylin, RISC-V). Zero upstream support.
v1 - Brute-force probe. Pull all 973 packages, let them break. 597 built, 214 dependency gaps and 151 failures mapped. The pipeline wasn't meant to pass. It was meant to make every hidden stock visible.
v2 - Verification engine. Probe the environment first, identify gaps before building. Stop feeding garbage into the pipeline. Build attempts dropped, success rate went up.
v3 - Incremental stock management. Isolate small batches of problems, resolve them one group at a time. Subtraction, not addition.
MY SYSTEM IS ADDICTED TOO
Here's where I punch myself in the face. My own CI has the same pattern. Virtual environments to bypass dependency conflicts. Masquerade rules that spoof package identities. Band-aids.
But I know they're band-aids. Most teams don't. They think retries are solutions. Being aware of the addiction and being consumed by it are two very different things.
I can identify the stock poisoning your pipeline. I can design the feedback loop that makes the right person feel the pain. What I can't do is force an organization to care. That's usually the real bottleneck - not the flaky tests, but the system's refusal to let anyone feel the consequences.
Repo: https://github.com/Sebastianhayashi/the_adaptive_verification_engine