Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.
This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.
I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.
Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.
https://www.memtest86.com/blacklist-ram-badram-badmemorylist...
The most expensive memory failure I had was of this sort, and frustratingly came from accidentally unplugging the wrong computer.
After this I did buy some used memory from a recycling center that had the sorts of problems you described and was able to employ them by masking off the bad regions.
We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...
They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.
At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.
A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.
There are power suppliers that are mildly defective but got past QC.
There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.
Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.
There are a ton of causes for bitflips other than cosmic rays.
For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?
It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.
Unfortunately, not that many consumer platforms make this possible or affordable.
I wonder sometimes if we shouldn't be doing like NASA does and triple-storing values and comparing the calculations to see if they get the same results.
I find this impossible to believe.
If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.
>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.
Might be the case, but 10% is still huge.
There imo has to be something else going on. Either their userbase/tracking is biased or something else...
Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.
The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.
The more RAM you have, the higher the probabilty that there will be some bad bits. And the more RAM a program uses, the more likely it will be using some that is bad.
Same phenomenon with huge hard drives.
A bit flip actually needs to be pretty "lucky" to result in a crash.
Granted, they're probably just as accurate as netcraft. /shrug
67k crashes / day
claim: "Given # of installs is X; every install must be crashing several times a day"
We'll translate that to: "every install crashes 5 times a day"
67k crashes day / 5 crashes / install
12k installs
Your claim is there's 12k firefox users? Lol
Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.
However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.
In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.
In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).
As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.
I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.
Thats the thing. Bit flips impact everything memory-resident - that includes program code. You have no way of telling what instruction was actually read when executing the line your instrumentation may say corresponds to the MOV; or it may have been a legit memory operation, but instrumentation is reporting the wrong offset. There are some ways around it, but - generically - if a system runs a program bigger than the processor cache and may have bit flips - the output is useless, including whatever telemetry you use (because it is code executed from ram and will touch ram).
Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!
We'd save the test result to the registry and include the result in automated bug reports.
The common causes we discovered for the problem were:
- overclocked CPU
- bad memory wait-state configuration
- underpowered power supply
- overheating due to under-specced cooling fans or dusty intakes
These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.
Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.
And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.
Sometimes I'm amazed that computers even work at all!
Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.
I imagine the largest volume of game memory consumption is media assets which if corrupted would really matter, and the storage requirement for important content would be reasonably negligible?
I don't think engineering effort should ever be put into handling literal bad hardware. But, the user would probably love you for letting them know how to fix all the crashing they have while they use their broken computer!
To counter that, we're LONG overdue for ECC in all consumer systems.
Funny you say this, because for a good while I was running OC'd RAM
I didn't see any instability, but Event Viewer was a bloodbath - reducing the speed a few notches stopped the entries (iirc 3800MHz down to 3600)
I eventually discovered with some timings I could pass all the usual tests for days, but would still end up seeing a few corrected errors a month, meaning I had to back off if I wanted true stability. Without ECC, I might never have known, attributing rare crashes to software.
From then on I considered people who think you shouldn’t overlock ECC memory to be a bit confused. It’s the only memory you should be overlocking, because it’s the only memory you can prove you don’t have errors.
I found that DDR3 and DDR4 memory (on AMD systems at least) had quite a bit of extra “performance” available over the standard JEDEC timings. (Performance being a relative thing, in practice the performance gained is more a curiosity than a significant real life benefit for most things. It should also be noted that higher stated timings can result in worse performance when things are on the edge of stability.)
What I’ve noticed with DDR5, is that it’s much harder to achieve true stability. Often even cpu mounting pressure being too high or low can result in intermittent issues and errors. I would never overclock non-ECC DDR5, I could never trust it, and the headroom available is way less than previous generations. It’s also much more sensitive to heat, it can start having trouble between 50-60 degrees C and basically needs dedicated airflow when overclocking. Note, I am not talking about the on chip ECC, that’s important but different in practice from full fat classic ECC with an extra chip.
I hate to think of how much effort will be spent debugging software in vain because of memory errors.
This attitude is entirely corporate-serving cope from Intel to serve market segmentation. They wanted to trifurcate the market between consumers, business, and enthusiast segments. Critically, lots of business tasks demand ECC for reliability, and business has huge pockets, so that became a business feature. And while Intel was willing to sell product to overclockers[0], they absolutely needed to keep that feature quarantined from consumer and business product lines lest it destroy all their other segmentation.
I suspect they figured a "pro overclocker" SKU with ECC and unlocked multipliers would be about as marketable as Windows Vista Ultimate, i.e. not at all, so like all good marketing drones they played the "Nobody Wants What We Aren't Selling" card and decided to make people think that ECC and overclocking were diametrically supposed.
[0] In practice, if they didn't, they'd all just flock to AMD.
only when AMD had better price/performance, not because of ECC. At best you have a handful of homelabbers that went with AMD for their NAS, but approximately nobody who cares about performance switched to AMD for ECC ram, because ECC ram also tend to be clocked lower.
I've read this decade ago... https://www.codeofhonor.com/blog/whose-bug-is-this-anyway
Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.
I also find that firefox crashes much more than chrome based browsers, but it is likely that chrome's superior stability is better handing of the other 90% of crashes.
If 50% of chrome crashes were due to bit flips, and bit flips effect the two browsers at basically the same rate, that would indicate that chrome experiences 1/5th the total crashes of firefox... even though the bit flip crashes happen at the same rate on both browsers.
It would have been better news for firefox if the number of crashes due to faulty hardware were actually much higher! These numbers indicate the vast majority of firefox crashes are actually from buggy software : (
RAM flips are common. This kind of thing is old and has likely gotten worse.
IBM had data on this. DEC had data on this. Amazon/Google/Microsoft almost certainly had data on this. Anybody who runs a fleet of computers gets data on this, and it is always eye opening how common it is.
ZFS is really good at spotting RAM flips.
Hardware problems are just as good a potential explanation for those as anything else.
[1] https://www.corsair.com/us/en/explorer/diy-builder/memory/is...
CPU caches and registers - how exactly are they different from a RAM on a SoC in this regard?
CPUs tend to be built to tolerate upsets, like having ECC and parity in arrays and structures whereas the DRAM on a Macbook probably does not. But there is no objective standard for these things, and redundancy is not foolproof it is just another lever to move reliability equation with.
"I can't answer that question directly because crash reports have been designed so that they can't be tracked down to a single user. I could crunch the data to find the ones that are likely coming from the same machine, but it would require a bit of effort and it would still only be a rough estimate."
You can't claim any percentage if you don't know what you are measuring. Based on his hot take, I can run an overclocked machine have firefox crash a few hundred thousand times a day and he'll use my data to support his position. Further, see below:
First: A pre-text: I use Firefox, even now, despite what I post below. I use it because it is generally reliable, outside of specific pain points I mention, free, open source, compatible with most sites, and for now, is more privacy oriented than chrome.
Second: On both corporate and home devices, Firefox has shown to crash more often than Chrome/Chromium/Electron powered stuff. Only Safari on Windows beats it out in terms of crashes, and Safari on Windows is hot garbage. If bit flips were causing issues, why are chromium based browsers such as edge and Chrome so much more reliable?
Third: Admittedly, I do not pay close enough attention to know when Firefox sends crash reports, however, what I do know is that it thinks it crashes far more often than it does. A `sudo reboot` on linux, for example, will often make firefox think it crashed on my machine. (it didn't, Linux just kills everything quickly, flushes IO buffers, and reboots...and Firefox often can't even recover the session after...)
Fourth: some crashes ARE repeatable (see above), which means bit flips aren't the issue.
Just my thoughts.
These are potential bitflips.
I found an issue only yesterday in firefox that does not happen in other browsers on specific hardware.
My guess is that the software is riddled with edge-case bugs.
It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.
If Firefox itself has so few bugs that it crashes very infrequently, it is not contradictory to what you are saying.
I wouldn't be surprised if 99% of crashes in my "hello world" script are caused by bit flips.
> That's one crash every twenty potentially caused by bad/flaky memory, it's huge! And because it's a conservative heuristic we're underestimating the real number, it's probably going to be at least twice as much.
So the data actually only supports 5% being caused by bitflips, then there's a magic multiple of 2? Come on. Let alone this conservative heuristic that is never explained - what is it doing that makes him so certain that it can never be wrong, and yet also detects these at this rate?
Has to be normalized, and outliers eliminated in some consistent manner.
thegrim33•1d ago
tredre3•1d ago
He doesn't explain anything indeed but presumably that code is available somewhere.
rincebrain•9h ago
Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].
There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.
edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.
[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...
[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568
[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...
[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...
rendaw•9h ago
rincebrain•9h ago
hexyl_C_gut•5h ago
wmf•46m ago
dboreham•39m ago
LeifCarrotson•15m ago