Forking the ecosystem of actions to plug in your cache backed isn’t a good long term solution.
We saw the same thing at Vercel. Back when we were still doing docker-as-a-service we used k8s for both internal services as well as user deployments. The latter lead to master deadlocks and all sorts of SRE nightmares (literally).
So I was tasked to write a service scheduler from scratch that replaced k8s. When it got to the manhandling of IP address allocations, deep into the rabbit hole, we had already written our own redis-backed DHCP implementation and needed to insert those IPs into the firewall tables ourselves, since Docker couldn't really do much at all concurrently.
Iptables was VERY fragile. Aside from the fact it didn't even have a stable programmatic interface, it was also a race condition nightmare, rules were strictly ordered, had no composition or destruction-free system (name spacing, layering, etc), and was just all around the worst tool for the job.
Unfortunately not much else existed at the time, and given that we didn't have time to spend on implementing our own kernel modules for this system, and that Docker itself had a slew of ridiculous behavior, we ended up scratching the project.
Learned a lot though! We were almost done, until we weren't :)
We ended up using Docker Swarm. Painless afterward
Back in the day, one of the best things about Linux was actually how good the docs were. Comprehensive man pages, stable POSIX standards, projects and APIs that have been used since 1970 so every little quirk has been documented inside out.
Now it seems like the entire OS has been rewritten by freedesktop and if I'm lucky I might find some two year out of date information on the ArchLinux wiki. If I'm even luckier, that behaviour won't have been completely broken by a commit from @poettering in a minor point release.
I actually think a lot of the new stuff is really fantastic once I reverse engineer it enough to understand what it's doing. I will defend to the death that systemd is, in principle, a lot better than the adhoc mountain of distro-specific shell scripts it replaces. Pulseaudio does a lot of important things that weren't possible before, etc. But honestly it feels like nobody wants to write any docs because it's changing too frequently, but then everything just constantly breaks because it turns out changing complex systems rapidly without any documentation leads to weird bugs that nobody understands.
[0] https://www.rwx.com/blog/retry-failures-while-run-in-progres...
There are ways to refactor your technology so that you don't have to suffer so much at integration and deployment time. For example, the use of containers and hosted SQL where neither are required can instantly 10x+ the complexity of deploying your software.
The last few B2B/SaaS projects I worked on had CI/CD built into the actual product. Writing a simple console app that polls SCM for commits, runs dotnet build and then performs a filesystem operation is approximately all we've ever needed. The only additional enhancement was zipping the artifacts to an S3 bucket so that we could email the link out to the customer's IT team for install in their secure on-prem instances.
I would propose a canary - If your proposed CI/CD process is so complicated that you couldn't write a script by hand to replicate it in an afternoon or two, you should seriously question bringing the rest of the team into that coal mine.
So, to begin with, testing is rarely prioritized. But most developer orgs eventually realize that centralized testing is necessary or else everyone is stuck in permanent "works on my machine!" mode. When deciding to switch to automated ci, eng management is left with the build vs buy decision. Buy is very attractive for something that is not seriously valued anyway and that is often given away for free. There is also industry consensus pressure, which has converged on github (even though github is objectively bad on almost every metric besides popularity -- to be fair the other larger players are also generally bad on similar ways). This is when the lock in begins. What begins as a simple build file starts expanding outward. Well intentioned developers will want to do things idiomatically for the ci tool and will start putting logic in the ci tool's dsl. The more they do this, the more invested they become and the more costly switching becomes. The CI vendor is rarely incentivized to make things truly better once you are captive. Indeed, that would threaten their business model where they typically are going to sell you one of two things or both: support or cpu time. Given that business model, it is clear that they are incentivized to make their system as inefficient and difficult to use (particularly at scale) as possible while still retaining just enough customers to remain profitable.
The industry has convinced many people that it is too costly/inefficient to build your own test infrastructure even while burning countless man and cpu hours on the awful solutions presented by industry.
Companies like blacksmith are smart to address the clear shortcomings in the market though personally I find life too short to spend on github actions in any capacity.
At what point does the line between CPU time in GH Actions and CPU time in the actual production environment lose all meaning? Why even bother moving to production? You could just create a new GH action called "Production" that gets invoked at the end of the pipeline and runs perpetually.
I think I may have identified a better canary here. If the CI/CD process takes so much CPU time that we are consciously aware of the resulting bill, there is definitely something going wrong.
CPU time is cheaper than an engineers time, you should be offloading formatting/linting/testing checks to CI on PRs. This will add up though when multiple by hundreds or thousands, it isn't a good canary.
That sounds like the biggest yikes.
On the flip side, I did something that might break Blacksmith: I used append blobs instead of block blobs. Why? ... Because it was simpler. For block blobs you have to construct this silly XML payload with the block list or whatever. With append blobs you can just keep appending chunks of data and then seal it when you're done. I have always wondered if the fact that I am responsible for the fact that some of GitHub Actions Cache is using append blobs would ever come back to bite me, but as far as I can tell from the Azure PoV it makes very little difference, pricing seems the same at least. But either way, they need to support append blobs now probably. Sorry :)
(If you are wondering why not use the Rust Azure SDK, as far as I can tell the official Rust Azure SDK does not support using signed URLs for uploading. And frankly, it would've brought a lot of dependencies and been rather complex to integrate for other Rust reasons.)
(It would also be possible, by setting env variables a certain way, to get virtually all workflows to behave as if they're running under GitHub Enterprise, and get the old REST API. However, Azure SDK with its concurrency features probably yields better performance.)
Note that GitHub Actions Cache v2 is actually very good in terms of download/upload speed right now, when running from GitHub managed runners. The low speed Blacksmith was seeing before is just due to their slow (Hetzner?) network.
I benchmarked most providers (I maintain RunsOn) with regards to their cache performance here: https://runs-on.com/benchmarks/github-actions-cache-performa...
https://cirrus-runners.app/blog/2024/04/23/speeding-up-cachi...
* Perf: don't use "install X" (Node, .Net, Ruby, Python, etc.) tasks. Create a container image with all your deps and use that instead.
* Perf: related to the last, keep multiple utility container images around of varying degrees of complexity. For example, in our case, I decided on PowerShell because we have some devs with Windows and it's the easiest to get working across Linux+Windows - so my simplest container has pwsh and some really basic tools (git, curl, etc.). I build another container on that which has .Net deps. Then each .Net repo uses that to:
* Perf: don't use the cache action at all. Run a job nightly that pulls down your code into a container, restore/install to warm the cache, then delete the code. `RUN --mount` is a good way to avoid creating a layer with your code in it.
* Maintainability: don't write big scripts in your workflow file. Create scripts as files that can also be executed on your local machine. Keep the "glue code" between GHA and your script in the workflow file. I slightly lie here, I do source in a single utility script that reads in GHA envars and has functions to set CI variables and so forth (that does sensible things when run locally).
Our CI builds are stupid fast. Comparatively speaking.
For the OP (I just sent your pricing page to my manager ;) ): having a colocated container registry for these types of things would be super useful. I would say you don't need to expose it to the internet, but sometimes you do need to be able to `podman run` into an image for debug purposes.
[1]: https://docs.github.com/en/actions/how-tos/writing-workflows...
movedx01•8h ago
aayushshah15•7h ago
This is on our radar! The primitives mentioned in this blog post are fairly general and allow us to support various types of artifact storage and caching protocols.
AOE9•7h ago
tsaifu•7h ago
AOE9•7h ago
tagraves•6h ago
AOE9•36m ago
* Your per min billing is double blacksmith's * RWX is a proprietary format? vs blacksmith's one line change. * No fallback option, blacksmith goes down and I can revert back to GitHub temporarily.
pbardea•7h ago