In the world of VC powered growth race to bigger and bigger chunk of market seems to be the only thing that matters. You don't optimize your software, you throw money at the problem and get more VMs from your cloud provider. You don't work on fault tolerance, you add a retry on FE. You don't carefully plan and implement security, you create a bug bounty.
It sucks and I hate it.
I find valgrind easy on Linux and ktrace(1) on OpenBSD easy to use. I do not spend much time, plus I find testing my items on Linux, OpenBSD and NetBSD tends to find most issues without a lot of work and time.
You can do a decent hardening job without too much effort, if follow some basic guidelines. You just have to be conscientious enough.
I would love to say that this was an exception during almost 20 years of my professional career, but it wasn't. It was certainly the worst, but also much closer to average experience than it should have been.
Source: most of the companies I worked or consulted for in the past 20 years.
Maybe you trigger a load test, or run a soaking test or whatever, while that runs you do something else, pause and check results, metrics, logs, whatever.
If something is funky, you may fix something and try again, get back to your other task and so on.
It's messy, and keeping track of that would add significant cognitive load for little gain.
The tricky part would be measuring time spent on hardening and making the business decision on how to quantify product features vs reliability (which I know is a false tradeoff because of time spent fixing bugs but still applies at a hand wavy level)
1) Putting NULL pointer checks (that result in early returns of error codes) in every damn function. Adds a sizable amount of complexity for little gain.
2) Wrapping every damn function that can fail with a “try 10 times and hope one works” retry loop. It quickly becomes problematic and unscalable. An instantaneous operation becomes a “wait 5 minutes to get an error” just because the failure isn’t transient (so why retry?).
Also becomes quickly absurd (gee, tcp connect failed so let’s retry the entire http request and connect 10 more times each attempt… gee, the HTTP request failed so let’s redo the larger operation too!)
esafak•4mo ago
https://www.wiley.com/en-us/How+to+Measure+Anything+in+Cyber...
actionfromafar•4mo ago
1over137•4mo ago
actionfromafar•4mo ago
mathattack•4mo ago
It's bad to say "Let's give it to folks who are underutilized or have capacity" because those are rarely the people who can do it well.
All I can come up with is the hardening % should be in proportion to how catastrophic a failure is, while keeping some faith that well done hardening ultimately pays for itself.
Philip Crosby wrote about this in manufacturing as "Quality is Free" https://archive.org/details/qualityisfree00cros
gregw2•4mo ago
There should be at least some large-company corporate incentive to measure "Bugs vs features"; the former is OpEx and the latter is CapEx, right?
(I've been at places where Finance and IT aligned to put 3 mandatory radio-button questions in JIRA which Finance used to then approximate development expenditure as CapEx vs OpEx. You were also invited as a manager to override the resulting percentages for your team once every period)