It was by far the most fun, productive, and fulfilling week.
It went on to shape the course of our development strategy when I started my own company. Regularly work on tech debt and actively applaud it when others do it too.
It's not binary.
Obviously things take as long as they take. I've always been an educator of this back to the business leadership. In my experience, most business people truly have no freaking clue how a product gets built and code gets shipped.
Giving proactive updates (meaning not the day it was expected to be done according to last update) are important part of a professionals working life. There's always a tension between business and engineers. Engineers just generally don't do that well with tension and try to minimize it, or complain about it.
It's just a predictable dance. You say something will take this long, then you find a bug. You point it out to the client, and they get mad at you because your estimate was off. They try to pressure you into fixing the bug for free, whether you were even around when it was made.
Eventually you just make a judgement call about bugs every time you run into them.
I don't mean to sound negative, I think it's a great idea. I do something like this at home from time to time. Just spend a day repairing and fixing things. Everything that has accumulated.
Places where you can move fast and actually do things are actually far better places to work for. I mean the ones were you can show up, do 5 hours of really good work, and then slack off/leave a little early.
This kind of thing takes more than 2 days to fix, unless you're really good.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=217637
Or this one
https://security.stackexchange.com/questions/104845/dhe-rsa-...
I can find more of these that I've run into if I look. I've had tricky bugs in my team's code too, but those don't result in public artifacts, and I'm responsible for all the code that runs on my server, regardless of who wrote it... And I also can't crash client code, regardless of who wrote it, even if my code just follows the RFC.
Or just an hour or two. I can't find it anymore, but I've run into libraries where simple things with months didn't work, because like May only has three letters or July and June both start with Ju. That can turn into a big deal, but often it's easy, once someone notices it.
Our software isn't serving millions of people though, it's a cli tool with a few hundred end users.
The way I learned the trade, and usually worked, is that bug fixing always comes first!
You don't work on new features until the old ones work as they should.
This worked well for the teams I was on. Having a (AFAYK) bug free code base is incredibly useful!!
I've had some mix of luck and skill in finding these jobs. Working with people you've worked with before helps with knowing what you're in for.
I also don't really ask anyone, I just fix any bugs I find. That may not work in all organizations :)
code reviewing coworker: "This shouldn't be done on this branch!" (OK, at least this is easy to fix by doing it on a separate branch.)
Yes, a ticket takes 2 seconds. it also puts me off my focus :P but i guess measuring is more important than achieving
One leader kind of listened. Sort of. I'm pretty sure I was lucky.
when i took lead of eng it was quite an easy path to making it clear stability was critical. slow everything down and actually do QA. customer became super happy because basically 3x releases went out with minimal bugs/tweaks required. “users don’t want broken changes immediately, they want working changes every so often” was my spiel etc etc.
unfortunately it was impossible to convince people about that until they screwed it all up. i still struggle to let things “get bad so they can get good”, but am aware of the lesson today at least.
tl;dr sometimes you gotta let people break things so badly that they become open to another way
You put effort into writing an unnecessary tldr on a short post, but couldn't be bothered to properly Capitalize your sentences in order to ensure the readability.
Weird.
Same source.
Don't trivialize my useful feedback.
If a person tries to communicate, but his stylistic choice of laziness (his own admission!) gets in the way of delivering his message, it is very tangibly useful information to tell, so that the writing effort could be better optimized for effect.
I wasn't even demanding/telling him what to do. I simply shared my observation, but it's up to him to decide if he wants to communicate better. Information and understanding is power.
https://dictionary.cambridge.org/dictionary/english/ostensib...
ostensible laziness => not actually laziness.
although yes it is a stylistic choice (which i wont be changing as the result of our interaction).
i changed my iphone settings to not auto-capitalise words
i put effort into my ostensible laziness
cat /dev/null .
The type that claims they're going to achieve zero known and unknown bugs is also going to be the type to get mad at people for finding bugs.
This is usually EMs in my experience.
At my last job, I remember reading a codebase that was recently written by another developer to implement something in another project, and found a thread safety issue. When I brought this up and how we’ll push this fix as part of the next release, he went on a little tirade about how proper processes weren’t being followed, etc. although it was a mistake anyone could have made.
There are also always bugs detected after shipping (usually in beta), which need to be accounted for.
Assuming it works as intended.
I've seen that very argument several times, it was even in the requirements on one occasion. In each instance it was incorrect, there were times when a second page was reached.
> Better to have a policy of always fixing bugs but be more flexible on what counts as a bug
I just disagree with this. It's entirely possible for something to not work correctly, but that fact be unimportant at the moment (or less important than something else).
In fact, that's exactly the mindset that "bugs first" is designed to prevent. If you have a mindset where a bug has to be more important than a feature in order to get prioritized, then you will breed a culture in which bugs are rarely prioritized, if ever. (Especially if fixing them would be time-consuming.)
This is for the simple reason that, in isolation, any individual feature can almost always be argued to be more important than any individual bug which could've been worked on instead. Yet, in the aggregate, once you've dumped 50 individual low-priority bugs into the backlog, they all add up to a horrendous experience for the user.
It's sort of like running a restaurant. Cooking food is how we make money, but you still have to clean the floors. If you keep putting it off to get the food out faster, eventually you're going to be knee-deep in shit.
Treating bugs as different than features and automatically pushing them to the front of the line likely leads to a non-parsimonious expenditure of effort and sets up some nasty fights with other parts of the company which will definitely figure out that something being a "bug" gets it prioritized. Obviously this can be done poorly, and why even have engineers if you aren't listening to their prioritization as well.
Most software is not formally specified, so it's not technically guaranteed that we can prove whether that expectation is correct or not. But, there is usually a collective understanding, reinforced by the software's own interface (e.g. "the button says Do X but I click it and X doesn't happen"), the documentation, and/or general technological norms (e.g. "it crashed" or "when I type text sometimes it disappears and I have to start over").
There are occasional ambiguous cases, but in practice these are uncommon in a well-run organization, and generally the job of a product manager is to have the final say on such matters via consultation with relevant stakeholders, contracts, etc.
I’ve seen very close to bug free backends (more early on in development). But every frontend code base ever just always seems to have a long list of low impact bugs. Weird devices, a11y things, unanticipated screen widths, weird iOS safari quirks and so on.
Also I feel like if this was official policy, many managers would then just start classifying whatever they wanted done as a bug (and the line can be somewhat blurry anyway). So curious if that was an issue that needed dealing with.
I do agree that it's rare, this is my first workplace where they actually work like that.
They weren't big enough to have "official policies". We talked to each other instead.
I did work a few years at big companies twice. That taught me to appreciate the simple life :)
And there's other bugs that don't really have any measurable impact, or only affect a small percentage of people, etc.
Also love the humble brag. "I've just closed my 12th bug" and later "12 was maximum number of bugs closed by one person"
> 1) no bug should take over 2 days
Is odd. It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.
That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.
Also, I tend to attack bugs by priority/severity, as opposed to difficulty.
Some of the most serious bugs are often quite easy to find.
Once I find the cause of a bug, the fix is usually just around the corner.
Good result == LLM + Experience.
The LLM just reduces the overhead.
That’s really what every “new paradigm” has ever done.
ChrisMarshallNY only said they fed the dump into the LLM. They said nothing about using the LLM to write the fix.
Is this something an LLM could help with? What exactly do you mean when you say you feed a dump to the prompt?
> I am getting occasional crashes on my iOS 17 or above UIKit program. Given the following stack trace, what problem do think it might be?
I will attach the source file, if I think I know the general area, along with any symptoms and steps to reproduce. One of the nice things about an LLM, is that it's difficult to overwhelm with too much information (unlike people).
It will usually respond with a fairly detailed analysis. Usually, it has some good ideas to use as starting points.
I don't think "I have a bug. Please fix it." would work, though. It's likely to try, but caveat emptor.
I could see Apple or Microsoft, building it into their IDEs.
But, as was noted elsewhere, I think it’s only useful as an advisor. I think a lot of folks look at LLMs as some kind of programmer replacement.
Some of the code I get from Claude and ChatGPT is ... not so good.
i used to think options like -vvv or -loglevel panic were just someone being funny, but they do work when necessary. -loglevel sane, -loglevel unsane, -loglevel insane would be my take but am aware that most people would roll their eyes so we're lame using ERROR, WARNING, INFO, VERBOSE
But I'm talking about adding and removing logs per dev task. There's really no need to have sophisticated log levels and maintaining them as the app evolves and grows, because the LLM can "instantly" add and remove the logging it needs per granular task. This is much faster for me than maintaining logs and carefully selecting log levels and managing how logs can be filtered. That only made sense to me when it took actual dev effort to add or remove these logs.
That's the beauty of it - it's able to add and remove huge amounts of logging per task, so I never need to manage the scale and complexity of logging that outlasts the task it was purposefully added for. With typical development, adding logging takes time so we keep it around and maintain it.
What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.
But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.
There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.
Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.
Wish there were more like you, out there.
Knowing all of those aspects and where an issue lands makes it possible to prioritise it properly, but it also gives the developer the opportunity hone their investigation and debugging skills without the pressure to solve it at the same time. A good write-up is great for knowledge sharing.
The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.
and you just look at this and thinks: one day, all of this is going to crash and it will never, ever boot again.
And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?
Move fast and break things is also a managerial/cultural problem in certain contexts.
You can only say with a straight face that if you're not the one responsible to clean up after Musk or whatever CTO sharted across the chess board.
C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout. In the worst case the loss is catastrophic, requiring weeks worth of ad-hoc emergency mode cleanup across multiple teams - say, some thing in finance depends on that server doing a report at the end of the year and the C-level exec's decision was made in January... but by that time, if you're in real bad luck, the physical hardware got sold off and the backup retention has expired. But when someone tries to blame the C-level exec, said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained".
Also with enterprise software a simple bug can do massive damage to clients and endanger large contracts. That's often a good reason to follow the Chesterton's fence rule.
It's not in the C-level's job description to manage the daily operations of the company, they have business managers to do that. If there's an expensive asset in the company that's not (actively) owned by any business manager, that's a liability -- and it is in the C-level's job description to manage liabilities.
said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained"
And that's a perfectly valid defense, they're acting true to their role. The failure lies with the business/operations manager not being in control of their process tooling.
Of course you can get lost on the way but worst case is you learn the architecture.
Cost efficient for your team’s budget sure, but a 1% chance of a 10+ million dollar issue is worth significant effort. That’s the thing with enterprise systems the scale of minor blips can justify quite a bit. If 1 person operating for 3 months could figure out what something is doing there’s scales where that’s a perfectly reasonable thing to do.
Enterprise covers a while range of situations there’s a lot more billion dollar orgs than trillion dollar orgs so your mileage may very.
In a reasonable organization only very minor systems can be undocumented enough to fall through the cracks.
Stuff that’s been working fine for years is easy for a team to forget about, especially when it’s a hidden dependency in some script that’s going to make some process quietly fail.
> Stuff that’s been working fine for years is easy for a team to forget about
That's why serious companies have a documentation system describing their processes, tools and dependencies.
Documentation on every possible system that could use the resource would need to be accurate, complete, have someone locate and actually read it, remember, and communicate it with someone in a relevant meeting which may be taking place multiple levels of management above the reader here. As part of that chain when a new manager shows up and there’s endless seemingly minor details, so even if they actually did encounter that information at some point theirs nothing that particularly calls out as worth remembering at the time.
That’s a lot of individual points of failure which is why I’m saying in the real world even well run companies mess this stuff up.
1) I was (temporarily) the only one still at the company who knew why it was there
2) I only knew myself because I had reverse engineered it, because the person who put it there had left the company
Now, some of those things had indeed become unnecessary over time (and thus were removed). Some of them, however, have been important (and thus were documented). In aggregate, it's been well worth the effort to do that reverse engineering to classify things properly.
____
[0] https://open.substack.com/pub/lunduke/p/the-scream-test
(Obviously not the first description of the technique as you’ll read, but I like it as a clear example of how it works)
This is explained later in the post. The 2 day hard limit is applied not to the estimate but rather to the actual work: "If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else."
Once I find a bug, the fix is often negligible.
But I can get into a rabbithole, tracking down the root cause. I don’t know if I’ve ever spent more than a day, trying to pin down a bug, but I have walked away from rabbitholes, a couple of times. I hate doing that. Leaves an unscratchable itch.
Now I find that odd.
The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.
It took multiple engineers months of investigating to finally track down the root cause.
Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.
Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.
I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.
Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.
So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".
The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.
It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.
Since it was embedded, no malloc. Everything being static allocations made the search possible in the first place.
This wasn't the only HW bug we found, ugh.
Static analysis can cover all your code, though generally with a significant rate of false positives that you will need to analyse.
You seem to be in the first club, and the other poster in the second.
Working on drivers, a relatively recent example is when we started looking at a "small" image corruption issue in some really specific cases, that slowly spidered out to what was fundamentally a hardware bug affecting an entire class of possible situations, it was just this one case happened to be noticed first.
There was even talk about a hardware ECO at points during this, though an acceptable workaround was eventually found.
I could never have predicted that when I started working on it, and it seemed every time we thought we'd got a decent idea about what was happening even more was revealed.
And then there's been many other issues when you fall onto the cause pretty much instantly and a trivial fix can be completed and in testing faster than updating the bugtracker with an estimate.
True there's probably a decent amount, maybe even 50%, where you can probably have a decent guess after putting in some length of time and be correct within a factor of 2 or so, but I always felt the "long tail" was large enough to make that pretty damn inaccurate.
I used to work for a Japanese company. When we'd have review meetings, each manager would have a small notebook on the table, in front of them.
Whenever a date was mentioned, they'd quickly write something down.
Those dates were never forgotten.
Anytime someone says that you absolutely know they will treat whatever you say as being a commitment written in blood!
Or not. A bug description can also be a ticket from a fellow engineer who knows the problem space deeply and have an initial understanding of the bug, likely cause and possible problems. As always, it depends, and IME the kind of bugs that end up in those "bugathons" are the annoying "yeah I know about it, we need to fix it at some point because it's PITA".
I can understand the "I don't do estimates" mantra for bigger projects, but ballpark estimations for bugs - even if you can be wrong in the end - should not be labelled as 100% impossible all the times.
I understand the urge to quantify something that is impossible to quantify beforehand. There is nothing wrong with making a guess, but people who don't understand my argument usually also don't understand the meaning of "guess". A guess is something based on my current understanding, and as that may change substantially, my guess may also change substantially.
I can make a guess right now on any bug I will ever encounter, based on my past experience: It will not take me more than a day to fix it. Happy?
And sometimes, the exact opposite happens.
It took longer than 2 days to fix.
They were damn cool. I seriously doubt that something like that, exists outside of a TSMC or Intel lab, these days.
(apart from the ones in the firmware, and the hardware glitches...)
In my experience there are two types of low-priority bugs (high-priority bugs just have to be fixed immediately no matter how easy or hard they are).
1. The kind where I facepalm and go “yup, I know exactly what that is”, though sometimes it’s too low of a priority to do it right now, and it ends up sitting on the backlog forever. This is the kind of bug the author wants to sweep for, they can often be wiped out in big batches by temporarily making bug-hunting the priority every once in a while.
2. The kind where I go “Hmm, that’s weird, that really shouldn’t happen.” These can be easy and turn into a facepalm after an hour of searching, or they can turn out to be brain-broiling heisenbugs that eat up tons of time, and it’s difficult to figure out which. If you wipe out a ton of category 1 bugs then trying to sift through this category for easy wins can be a good use of time.
And yeah, sometimes a category 1 bug turns out to be category 2, but that’s pretty unusual. This is definitely an area where the perfect is the enemy of the good, and I find this mental model to be pretty good.
The fact that something is high priority doesn't make it less work.
I often find the nastiest bugs are the quickest fixes.
I have a "zero-crash" policy. Crashes are never acceptable.
It's easy to enforce, because crashes are usually easy to find and fix.
$> ThreadingProblems has entered the chat
Race conditions in 3rd party services during / affected by very long builds and with poor metrics and almost no documentation. They only show up sometimes, and you have to wait for it to reoccur. Add to this a domain you’re not familiar with, and your ability to debug needs to be established first.
Stack two or three of these on top of each other and you have days of figuring out what’s going on, mostly waiting for builds, speculating how to improve debug output.
After resolving, don’t write any integration tests that might catch regressions, because you already spent enough time fixing it, and this needs to get replaced soon anyway (timeline: unknown).
1. The cause isn't immediately obvious. In this case, finding the problem is usually 90% of the work. Here it can't be known how long finding the problem is beforehand, though I don't think bailing because it's taking too long is a good idea. If anything, it's those really deep rabbit holes the real gremlins can hide.
2. The cause is immediately obvious, but is an architecture mistake, the fix is a shit-ton of work, breaks workflows, requires involving stakeholders, etc. Even in this case it can be hard to say how long it will take, especially if other people are involved and have to sign off on decisions.
I suppose it can also happen in low-trust sweatshops where developers held on such a tight leash they aren't able to fix trivial bugs they find without first going through a bunch of jira rigmarole, which is sort of low key the vibe I got from the post.
The initial description? "Touchscreen sometimes misses button presses".
I love hearing stories like this.
Other favourites include "Microsoft Structured Exception Handling sometimes doesn't catch segfaults", and "any two of these network devices work together but all three combined freak out".
I have encountered areas where the basic design was wrong (often comes from rushing in, before taking the time to think things through, all the way).
In these cases, we can either kludge a patch, or go back and make sure the design is fixed.
The longer I've been working, the less often I need to go back and fix a busted design.
The longer I work as a software engineer, the rarer it is that I get to work with bugs that take only a day to fix.
Nowadays, after some 17 years in the business, it's pretty much always intermittently and rarely occurring race conditions of different flavors. They might result in different behaviors (crashes, missing or wrong data, ...), but at the core of it, it's almost always race conditions.
The easy and quick to fix bugs never end up with me.
I tend to mostly work alone, these days (Chief Cook & Bottle-Washer).
All bugs are mine.
“Happens only once every 100k runs? Won’t fix”. That works until it doesn’t, then they come looking for the poor bastard that never fixes a bug in 2 days.
It was all about fixing bugs; often, terrifying ones.
That background came in handy, once I got into software.
Won’t fix doesn’t get accepted so well. Trying to work out what the hell happened from the charred remains isn’t so easy either.
I tend to work alone, so my scope is limited.
Some of the stuff I work on is quite involved, anyway.
I’ve been at this game awhile (coding for over 40 years), so I have learned a few tricks.
Of course, I “cheat.” I’ve learned to write software that doesn’t tend to have that many bugs, and I also don’t have to deal with other people’s code, so much. I write code for myself, which means that I don’t get to practice my debugging, so much, these days.
You can see for yourself. Much of my work is open-source, or source-available: https://github.com/ChrisMarshallNY
If you invest 2 days of work and did not find the root cause of a bug, then you have the human desire to keep investing more work, because you already invested so much work. At that point however its best to re-evaluate and do something different instead, because it might have a bigger impact.
Likelihood that after 2 days of not finding the problem, you wont find it after another 2 days is higher than starting over with another bug that on average you find the problem earlier.
Of course, if it's a difficult bug and you can just say 'fuck it' and bury it in the backlog forever that's fine, but in my experience the very complex ones don't get discovered or worked on at all unless it's absolutely critical or a customer complains.
You could tell them that 25% chance it's going to take 2 hours or less, 50% chance it's going to take 4 hours or less, 75% chance it's going to take 8 hours or less, 99% it's going to take 16 hours or less, to be accurate, but communication wise you'll win out if you just call items like those 10 hours or similar intuitively. Intuitively you feel that 10 hours seems safe with those probabilities (which are intuitive experience based too). So you probably would say 10 hours, unless something really unexpected (the 1%) happens.
Btw in reality with above probabilities the actual average would be 5h - 6h with 1% tasks potentially failing, but even your intuitive probability estimations could be off so you likely want to say 10h.
But anyhow that's why story points are mostly used as well, because if you say hours they will naturally think it's more fixed estimation. Hours would be fine if everyone understood naturally that it implies a certain statistical average of time + reasonable buffer it would take over a large amount of similar tasks.
Virtually everywhere I've ever worked has had an unwritten but widely understood informal policy of placing a multiple on predicted effort for both new code/features and bug fixing to account for Hofstadter's law.
There will always be those “only happens on the 3rd Tuesday every 6 months” issues that are more complicated but…if you can get all the small stuff out of the way it’s much easier to dedicate some time to the more complicated ones.
Maximizing the value of time is the real key to focusing on quicker fixes. If nobody can make a case why one is more important than other, then the best use of your time is the fastest fix.
At least it crashed at startup, if it was random it would have been hell.
oh sweet sweet summer child...
Learning how to better estimate how long tasks take is one of my biggest goals. And one I've yet to even figure out how to master
You mean starting after it has been properly tracked down? It can often take a whole lot of time to go from "this behavior is incorrect sometimes" to "and here's what need to change".
I have found that really deep bugs are the result of bad design, on my part, and applying "band-aid" fixes often just kicks the can down the road, for a reckoning (that is now just a bit worse), later.
If it is not super-serious (small performance issues, for instance; which can involve moving a lot of cheese), I can often schedule a design review for a time when it's less critical, and maybe set up an exploration branch.
People keep bringing up threading and race conditions, which are legitimately nasty bugs.
In my experience, they are often the result of bad design, on my part. It's been my experience that "thread everything" can be a recipe for disaster. The OS/SDK will often do internal threading, and I can actually make things worse, by running my own threads.
I try to design stuff that will work fine, in any thread, which gives me the option to sequester it into a new thread, at a later time (I just did exactly that, a few days ago, in a Watch app), but don't immediately do that.
I don't get this. Either you give up on the bug after a day, or you throw out the entire codebase and start over?
Sure, if the bug is low severity and I don't have a reproduction, I will ignore it. But there are bad bugs that are not understood and can take a lot more than a day to look into, such as by adding telemetry to help track it down.
Yes, it is usually the case that tracking it down is harder than fixing. But there are also cases where the larger system makes some broad assumptions which are not true, and fixing is tricky. It is not usually an option to throw out the entire system and start over each time this happens in a project.
Nah. That’s called “catastrophic thinking.” This is why it’s important (in my experience) to back off, and calm down.
I’ll usually find a way to manage a smaller part of the codebase.
If I make decisions when I’m stressed, Bad Things Happen.
Really old software can be referred to as "Mature," as opposed to "Decrepit." It can be extremely well-documented, and well-understood. Many times, there are tools that grow up, alongside the main code.
I wrote stuff that was still in use, 25 years later, because the folks that took it over, did a really good job of maintaining it.
This is one part that is rarely properly implemented. We have our bug bash days too, but I noticed after the fact that maybe 1/3 of the bugs we solved is on a feature we are thinking of deprecating soon due to low usage.
How can we attack bugs better by priority?
Usually it's implicit, rather than explicit: Nobody tells you to limit work on bugs to 1-2 days, but if you spend an entire week debugging something difficult and don't accumulate any story points in Jira, a cadre of project manager, program managers, and other manager titles you didn't even know existed will descend upon you and ask why you're dragging the velocity down.
Lesson learned: Next time, avoid the hard bugs and give up early if something isn't going to turn into story points for hidden charts that are viewed by more people than you ever thought.
At some point one can't help but wonder: if almost everyone is "misusing" it, then maybe it's a problem with the methodology itself, and the people for whom it works would have worked just as well organically without it?
I understood it as the whole point of the 2 day hard limit - you start working on a bug that turn out to be bigger than expected, so you write down your findings and move on to the next one.
> In one of our early fixits, someone picked up what looked like a straightforward bug. It should have been a few hours, maybe half a day. But it turned into a rabbit hole. Dependencies on other systems, unexpected edge cases, code that hadn’t been touched in years.
> They spent the entire fixit week on it. And then the entire week after fixit trying to finish it. What started as a bug fix turned into a mini project. The work was valuable! But they missed the whole point of a fixit. No closing bugs throughout the week. No momentum. No dopamine hits from shipping fixes. Just one long slog.
> That’s why we have the 2-day hard limit now. If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else. The limit isn’t about the work being worthless - it’s about keeping fixit feeling like fixit.
Some things turn out to be surprisingly complex, but you can very often know that the simple thing is simple.
I think what they mean is that after 2 days of working on bug you stop it regardless the result, leaving paper trail behind for the next person.
Code slows you down, always worth cleaning up. Yes, the business case is aligned with both the past bloat, and the current cleanup.
My preferred approach is to explicitly plan in 'keep the lights on' capacity into the quarter/sprint/etc in much the same way that oncall/incident handling is budgeted for. With the right guidelines, it gives the air cover for an engineer to justify spending the time to fix it right away and builds a culture of constantly making small tweaks.
That said, I totally resonate with the culture aspect - I think I'd just expand the scope of the week-long event to include enhancements and POCs like a quasi hackathon
What good and bad experiences have people had with software development metrics leaderboards?
However, I love the idea of an occasional team based leaderboard for an event. I've held bug and security hackathons with teams of 3-5 and have had no problem with them.
I do appreciate though that certain people, often very good detail oriented engineers, find large backlogs incredibly frustrating so I support fix-it weeks even if there isn't clear business ROI.
???
Basically any major software product accumulates a few issues over time. There's always a "we can fix that later" mindset and it all piles up. MacOS and Windows are both buggy messes. I think I speak for the vast majority of people when I say that I'd prefer they have a fix-it year and just get rid of all the issues instead of trying to rush new features out the door.
Maybe rushing out features is good for more money now, but someday there'll be a straw that breaks the camel's back and they'll need to devote a lot of time to fix things or their products will be so bad that people will move to other options.
>For iOS 27 and next year’s other major operating system updates — including macOS 27 — the company is focused on improving the software’s quality and underlying performance.
-via Bloomberg today
Overall, I think this kind of thing is very positive for the health of building software, and morale to show that it is a priority to actually address these things.
I don't mean to be too harsh on the author. They mean well. But I am saddened by the wider context, where a dev posts 'we fix bugs occasionally' and everyone is thrilled, because the idea of ensuring software continues to work well over time is now as alien to software dev as the idea of fair dealing is to used car salesmen.
We as industry have taught people that broken products is acceptable.
In any other industry, unless people are from the start getting something they know is broken or low quality, flea market, 1 euro shop, or similar, they will return the product, ask for the money back, sue the company whatever.
Example: (aftermarket) car headunit.
The real solution is to have individual software developers be licensed and personally liable for the damage their work does. Write horrible bugs? A licencing board will review your work. Make a calculated risk that damages someone? Company sued by the user, developer sued by the company. This correctly balances incentives between software quality and productivity, and has the added benefit of culling low quality workers.
Software isn't uniquely high stakes relative to other industries. Sure, if there's a data breach your data can't be un-leaked, but you can't be un-killed when a building collapses over your head or your car fails on the highway. The comparison with other industries works just fine - if we have high stakes, we should be shipping working products.
This is not the vibe I got from the post at all. I am sure they fix plenty of bugs throughout the rest of the year, but this will be balanced with other work on new features and the like and is going to be guided by wider businesses priorities. It seems the point in the exercise is focusing solely on bugs to the exclusion of everything else, and a lot of latitude to just pick whatever has been annoying you personally.
The name is just an indication you can do it any day but idea is on Friday when you are at no point to start big thing, pick some small one you want to fix personally. Maybe a big in product maybe local dev setup.
Doing what you want to do instead of what you should doing (hint: you should be busy making money).
Inability to triage and live with imperfections.
Not prioritizing business and democratizing decision making.
Also explains the casual mention of "estimation" on fixes. A real bug fix is even more hard to estimate than already brittle feature estimates.
Fixit weeks is a band aid, and we also tried it. The real fix is being a good boss and trusting your coworkers to do their jobs.
(I run a small SaaS product - a micro-SaaS as some call it.)
We’ll stop work on a new feature to fix a newly reported bug, even if it is a minor problem affecting just one person.
Once you have been following a “fix bugs first” approach for a while, the newly discovered bugs tend to be few, and straight forward to reproduce and fix.
This is not necessarily the best approach from a business perspective.
But from the perspective of being proud of what we do, of making high quality software, and treating our customers well, it is a great approach.
Oh, and customers love it when the bug they reported is fixed within hours or days.
A lot of the internal behaviors ARE bugs that have been worked around, and become part of the arbitrary piles of logic that somehow serve customer needs. My own understanding of bugs in general has definitely changed.
Strangely the math looks such that they could hire nearly 1 FTE engineer that works full time only on "little issues" (40 weeks, given that people have vacations and public holidays and sick time that's a full year's work at 100%), and then the small issues could be addressed immediately, modulo the good vibes created by dedicating the whole group to one cause for one week. Of course nobody would approve that role...
I wonder if the janitor role could be rotated weekly or so? Then everyone could also reap the benefits of this role too, I can imagine this being a good thing for anyone in terms of motivation. Fixing stuff triggers a different positive response than building stuff
Unfortunately, that on-call was so overwhelmed that you were lucky to be able to handle all the alerts/crises, let alone having spare time to fix the root causes.
> The benefits of fixits
> For the product: craftsmanship and care
sorry, but this is not care when the priority system is so broken that it requires a full suspension, but only once a quarter
> A hallmark of any good product is attention to detail:
That's precisely the issue, taking 4 years to bring attention to detail, and only outside the main priority system.
Now, don't get me wrong, a fixit is better than nothing and having 4 year bugs turn into 40 year ones, it's just that this is not a testament of craftsmanship/care/attention to detail
I'm not sure I understand this line. The whole point of the fixit is to address the bugs which are considered "low priority" because they only appear in a edge case or are not quite 100% perfectly polished but still matter over the long tail of people using the product.
Or do you propose that every issue like this needs to be fixed before doing anything else?
1. Working on Feature A, stopped by management or by the customer because we need Feature B as soon as possible.
2. Working on Feature B, stopped because there is Emergency C in production due to something that you warned the customer about months ago but there was no time to stop, analyze and fix.
3. Deployed a workaround and created issue D to fix it properly.
4. Postponed issue D because the workaround is deemed to be enough, resumed Feature B.
5. Stopped Feature B again because either Emergency E or new higher priority Feature F. At this point you can't remember what that original Feature A was about and you get a feeling that you're about to forget Feature B too.
6. Working on whatever the new thing is, you are interrupted by Emergency G that happened because that workaround at step 3 was only a workaround, as you correctly assessed, but again, no time to implement the proper fix D so you hack a new workaround.
Maybe add another couple of iterations but at this time every party are angry or at least unhappy of each other party.
You have a feeling that the work of the last two or three months on every single feature has been wasted because you could not deliver any one of them. That means that the customer wasted the money they paid you. Their problem, but it can't be good for their business so your problem too.
The current state of the production system is "buggy and full of workarounds" and it's going to get worse. So you think that the customer would have been wiser to pause and fix all the nastier bugs before starting Feature A. We could have had a system running smoothly, no emergencies, and everybody happier. But no, so one starts thinking that maybe the best course of action is changing company or customer.
Yes, the issue is not you, it's a toxic workplace. Leave as soon as you can.
These places cannot and will not change. If you can, find employment elsewhere.
When you finally complete Feature B, the analysts look at it again, and realize that it actually wasn't necessary, and you should revert it.
It had the same spirit as a hackathon.
[1] https://westwing.fandom.com/wiki/Big_Block_of_Cheese_Day
At Meta we did "fix-it weeks", more or less every quarter. At the beginning I was thrilled: leadership that actually cares about fixing bugs!
Then reality hit: it's the worst possible decision for code and software quality. Basically this turned into: you are allowed to land all the possible crap you want. Then you have one week to "fix all the bugs". Guess what: most of the time we couldn't even fix a single bug because we were drown in tech debt.
> That’s not to say we don’t fix important bugs during regular work; we absolutely do. But fixits recognize that there should be a place for handling the “this is slightly annoying but never quite urgent enough” class of problems.
So in their case, fixit week is mostly about smaller bugs, quality of life improvements and developer experience.
This is when the report comes in that your login form update from six months ago does not work on mobile Opera if you disable JavaScript. The fix isn’t obvious and will require research, potentially many hours or even days of testing and since it is a login form you will need the QA team to test it after you find another developer on your team to do a code review for you.
What exactly would you do in this case? Pull resources from a major project that has the full attention of the C suite to accommodate some tin foil Luddite a few weeks sooner or classify this as lower priority?
I'd document that mobile Opera with Javascript disabled is an unsupported config, and ask a team to make a help center doc asking mobile Opera users to enable JS.
Being able to think of simple, practical solutions like this is one of the hardest skills to develop as a team, IMO. Not everything needs to be perfect and not everything needs a product-level fix. Sometimes a "here's the workaround" is good enough and if enough people complain or your metrics show use friction in some journey, then prioritize the fix.
GP's example is so niche that it isn't worth fixing without evidence that the impact is significant.
- This bug genuinely sounds like low priority.
- This organization seems to operate assuming unforeseen problems will never pop up. That is unwise.
Instead you need to have a triage process and a planning process, which to some degree most software teams do. The problem is that most of these processes do not have a rigid way of dealing with really old low priority bugs. A bug fix week is one option for addressing that need.
If you only have a reasonable number of bugs, and fix them as you find them, it's just how you do work.
It may sound impossible, but I did work like this for two decades, and it worked well for those teams.
In most situations you have users who also find bugs and report them when they want, not when you are ready for them.
You can even see that your argument does not apply generally by the fact that bugs exist in software for years. If your way was both more efficient AND more aligned with human nature then everyone would be working like this but clearly almost nobody can afford to drop everything to fix a user’s random low priority bug the minute it is reported.
You have deadlines, velocity is a goal rather than a measurement, and probably several other (IMHO) process mistakes. In such systems, doing what is best for the organization can often be bad for your personal career. Still, that's probably the norm in much of the industry.
My view is that having bugs is costly. They cause problems in development, and alienates users. A bug free code base is an incredible asset to have!
You say it's inefficient to "switch context" and fix a bug the moment you find it. There is some truth there, but... (1) there are ways to work without huge context load, (2) I don't have to fix the bug that very minute. Usually, I make a note and get to it the next day or so. Also (3) the average bug fix in a well structured and tested code base is usually pretty quick.
> If your way was both more efficient AND more aligned with human nature then everyone would be working like this
This assumes the software industry is really well organized. After 40 years experience writing software, that is just hilarious! Though I probably also thought that before I got involved with much better organizations.
But yes I am aware of lots of parts of this industry where you do not need to rush a project no matter what. I worked at places that had a breakneck velocity and at places where it is much more chill. I prefer the latter but I can say that I still want to ship software which means goals and deadlines. Bugs should be fixed ASAP but priorities must also be respected.
After 20 years doing this as a career, I agree this industry is a bit of a mess :)
[0] See "Shop policies" near the bottom of https://www.etsy.com/shop/ForbiddenGlade vs last December: https://web.archive.org/web/20241215201533/https://www.etsy....
Adding regular fixits is how they fix the normal process.
This addition recognizes that most bug fixes do not move the metrics that are used to prioritize work toward business goals. But maintaining software quality, and in particular preventing technical debt from piling up, are essential to any long-running software development process. If quality goes to crap, the project will eventually die. It will bog down under its own weight, and soon no good developer will want to work on it. And then the software project will never again be able to meet business goals.
So: the normal process now includes fixits, dedicated time to focus on fixing things that have no direct contribution to business goals but do, over the long term, determine whether any business goals get met.
"...if you don't fix your bugs your new code will be built on buggy code and ensure an unstable foundation and if you check in buggy code someone else is going to be writing code based on your bad code and well you know you can imagine how wasteful that's going to be"
16:22 of "The Early Days of id Software: Programming Principles" by John Romero (Strange Loop 2022) https://www.youtube.com/watch?v=IzqdZAYcwfY&t=982s
If you'll allow me to project a lot of lived experience on to this story: A policy of fixing bugs immediately sounds like a policy software developers would come up with. A policy of deferring bug fixes to a neatly scheduled week on the calendar for bug fixes sounds like a policy some project managers would brainstorm as a way to keep velocity numbers high and get their tickets closed on schedule.
It's also patronizing to the devs. "Internal survey shows devs complain about software quality, let's give them a week every quarter and the other 11 we do whatever we want". What needs to change here is leadership being honest about business, as sometimes fixing bugs is simply not important. Sure sure it depends on the bug... I am talking about when devs complain about having a huge number of bugs in the backlog (most of them low impact) or whatever something that only affects a small percentage. Another strategy here would be to properly surface the impact of said bugs to users / customers... until you do this, nobody has a reason to care.
I question my life a lot when I'm reviewing code which appears to have been written incorrectly at first so that the author can land a follow up diff with the "fix"
It should be understood that there WILL be bugs, that is NOT a sign of incompetence, and so cleaning them up should be an ongoing task so they do not linger and collect (and potentially get worse by compounding with other bugs).
1) I agree that estimating a bug's complexity upfront is an error prone process. This is exactly why I say in the post that we encourage everyone to "feel out" non trivial issues and if it feels like the scope is expanding too much (after a few hours of investigation), to just pick something else after writing up their findings on the bug.
2) I use the word "bug" to refer to more traditional bugs ("X is wrong in product") but also feature requests ("I wish X feature worked differently"). This is just a companyism that maybe I should have called out in the post!
3) There's definitely a risk the fixit week turns into just "let's wait to fix bugs until that week". This is why our fixits are especially for small bugs which won't be fixed otherwise - it's not a replacement for technical hygiene (i.e. refactoring code, removing dead code, improving abstractions) nor a replacement for fixing big/important issues in a timely manner.
I'd also be curious to know the following: how many new errors or regressions were caused by the bug fixes?
Historically though, I would guess maybe 5-10% end up needing some followup fix which is itself usually smaller than the original (maybe a typo in some documentation or some edge case we spot when it hits prod etc).
The smaller the original fixes, the less likely you are to need followups so another reason to prefer working mainly on them!
It depends on the stage and size of your team and company of course, but for us the result was more predictable delivery and happier, more-engaged developers.
For anyone curious to learn more: https://basecamp.com/shapeup/2.2-chapter-08#cool-down
https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-s...
Some seem ridiculously obvious today, but weren't standard 25 years ago. Seriously! At the turn of the century, not everyone used a bug database or ticket tracker. Lots of places had complicated builds to production, with error-prone manual steps.
But question five is still relevant today: Do you fix bugs before writing new code?
Having every 3rd or 4th sprint being dev initiatives and bugs... Or having a long/short sprint cycle where short sprints are for bugs mostly... Basically every 3rd week is for meetings and bug work so you get a solid 2 weeks with reduced meetings.
It's hard to convince upper managers of the utility though.
1) Things that have existed in your product for decades and haven't been major strategic issues.
2) Things that arose recently in the wake of launches. This can be because it's hard to fix every corner case, or because of individuals throwing sloppy code over the wall to look like they "ship fast".
I try to hold the team to fix bugs (2) quickly while their memory is fresh as it points to unwanted regressions.
The bugs in (1) are more interesting. It's a bit sad that teams kinda have to "sneak that work in" with fixit weeks. I have known of products large enough to be able to A/B test the effects of a quarter's worth of "small fixes", and finding significant gains in key product metrics. That changed management's attitude with respect to "small fixes" - when you have a ton of them, they can produce meaningful impact worthy of strategic consideration, not just a week of giving the dev team free rein to scratch their itch.
With an average of 4 bugs fixed in 5 days and 150 bugs, we can assume 50 bugs with less than one days's effort were just lying around with noone daring to touch them.
1. Build features at all costs
2. Eventually a high profile client has a major issue during an event, costing them a ton of goodwill
3. Leadership pauses everything and the company only works on bugfixes and tech debt for a week or two
I onboarded during step 3. I should have taken that as a warning that that's how the company operated. If your company doesn't make time for bugfixes and getting out of its own way, that culture is hard to change.I would also question why only 3 of 8 devs approve PRs. Even if that can't change more broadly all of the time, this kind of exercise seems like a perfect time to allow everyone to review PRs - two fold benefit, more fixes are reviewed and gives experience reviewing to others that don't get to do that regularly.
So yes, definitely still do PRs, and if that is problematic, consider whether that is an indication the PR process may itself need to be reviewed.
I advocate to never size/score bugs. Instead, if your process demands scores, call everything a 2 because over the course of all the bugs, that will be your average. You'll knock out 10 small ones and then get stuck on a big one. Bug-fixing efforts should be more Kanban than Scrum. Prioritize the most important/damaging/whatever ones, do them in order, and keep doing them until they are done or you run out of time.
inhumantsar•2mo ago
eg: My last company's system was layer after layer built on top of the semi-technical founder's MVP. The total focus on features meant engineers worked solo most of the time and gave them few opportunities to coordinate and standardize. The result was a mess. Logic smeared across every layer, modules or microservices with overlapping responsibilities writing to the same tables and columns. Mass logging all at the error or info level. It was difficult to understand, harder to trace, and nearly every new feature started off with "well first we need to get out of this corner we find ourselves painted into".
When I compare that experience with some other environments I've been in where engineering had more autonomy at the day-to-day level, it's clear to me that this company should have been able to move at least as quickly with half the engineers if they were given the space to coordinate ahead of a new feature and occasionally take the time to refactor things that got spaghettified over time.
lalitmaganti•2mo ago
To be clear, engineers have a lot of autonomy in my team to do what they want. People can and do fix things as they come up and are encouraged to refactor and pay down technical debt as part of their day to day work.
It's more that even with this autonomy fixits bugs are underappreciated by everyone, even engineers. Having a week where we can address the balance does wonders.
inhumantsar•2mo ago