If you read between the lines, the underlying problem in most of the discussion is GitHub's dominance of the code hosting space coupled with it's less than ideal CI integration - which while getting better is stuck with baggage from all their past missteps and general API frailty.
That's why the OpenStack community built Zuul on top of Gerrit: it added a real gating system that could speculatively test multiple commits in a queue and only merge them if CI passed together. In other words, Zuul was Gerrit's version of a merge queue.
> February 2 2014, 22:25 […] Thirteen years ago I worked at Cygnus/RedHat […] Ben, Frank, and possibly a few other folks on the team cooked up a system [with a] simple job: automatically maintain a repository of code that always passes all the tests.
One trickier problem is when you don't know until later that a past change was bad: perhaps slow-running performance tests show a regression, or flaky tests turn out to have been showing a real problem, or you just want the additional velocity of pipelined landings that don't wait for all the tests to finish. Or perhaps you don't want to test every change, but then when things break you need to go back and figure out which change(s) caused the issue. (My experience is at Mozilla where all these things are true, and often.) Then you have to deal with backouts: do you keep an always-green chain of commits by backing up and re-landing good commits to splice out the bad, and only fast-forwarding when everything is green? Or do you keep the backouts in the tree, which is more "accurate" in a way but unfortunate for bisecting and code archaeology?
The angle we took in the blog post focused on what was widely documented and accessible to the community (open-source tools like Bors, Homu, Bulldozer, Zuul, etc.), because those left a public footprint that other teams could adopt or build on.
It's a great reminder that many companies were solving the "keep main green" problem in parallel (some with pretty sophisticated tooling), even if it didn't make it into OSS or blog posts at the time.
You'd join the queue, and then you'd have to wait for like 12 other people in front of you who would each spend up to a couple hours trying to get their merge branch to go green so it could go out. You couldn't really look away because it could be your turn out of nowhere - and you had to react to it like being on call, because the whole deployment process was frozen until your turn ended. Often that meant just clicking "retry" on parts of the CI process, but it was complicated, there were dependencies between sections of tests.
My opinion is that this situation of a merge skew happens rarely enough not to be a major problem. And personally, I think instead of the merge queues described in the article, it would be overall more beneficial to invest in tooling to automatically revert broken commits in your main branch. Merging the PR into a temporary branch and running tests is a good thing, but it is overly strict to require your main branch to be fast forwarded. You can generally set a time limit of one day or so: as long as tests pass when merging the PR onto a main branch less than one day old, you can just merge it.
oftenwrong•2h ago
This blog post about choosing which commit to test is also relevant and may be of interest: https://sluongng.hashnode.dev/bazel-in-ci-part-1-commit-unde...