Anybody have sizable examples? Everything I can think of results in dedicated gpus.
There's PoC's of corrupting memory _that the kernel uses to decide what that process can access_ but the process can't read that memory. It only knows that the kernel says yes where it used to say no. (Assuming it doesn't crash the whole machine first)
I'd expect all code to be strongly controlled in the former, and reasonably secured in the latter with software/driver level mitigations possible and the fact that corrupting somebody else's desktop with row-hammer doesn't seem like good investment.
As another person mentioned- and maybe it is a wider usage than I thought- cloud gpu compute running custom code seems to be the only useful item. But, I'm having a hard time coming up with a useful scenario. Maybe corrupting a SIEM's analysis & alerting of an ongoing attack?
Which is my point.
* random aside: how is colab compute credits having a 90 day expiration legal? I thought california outlawed company-currency expiring? (A la gift cards)
Basically Google Colab credits is like buying a seasonal bus pass with X trips or a monthly parking pass with X amount of hours. Rather than getting store cash which can be used for anything.
In a proof-of-concept, we use these bit flips to tamper with a victim’s DNN models and degrade model accuracy from 80% to 0.1%, using a single bit flip
There is a certain irony in doing this to probabilistic models, designed to mimic an inherently error-prone and imprecise reality.
So it doesn't seem that wild to me that turning on ECC might require running at lower bandwidth.
A similar situation occurs with GDDR6, except Nvidia was too cheap to implement the extra traces and pay for the extra chip, so instead, they emulate ECC using the existing memory and memory bandwidth, rather than adding more memory and memory bandwidth like CPU vendors do. This causes the performance hit when you turn on ECC on most Nvidia cards. The only exception should be the HBM cards, where the HBM includes ECC in the same way it is done on CPU memory, so there should be no real performance difference.
Frustratingly, it's only unregistered that's stuck in limbo; VCC makes a kit of registered 7200.
There is no technical reason why ECC UDIMMs cannot be overclocked to the same extent and ECC actually makes them better for overclocking since they can detect when overclocking is starting to cause problems. You might notice that the non-ECC UDIMMs have pads and traces for an additional IC that is present on ECC UDIMMs. This should be because the ECC DIMMs and non-ECC DIMMs are made out of the same things. They use the same PCBs and the same chips. The main differences would be whether the extra chips to store ECC are on the module, what the SPD says it is and what the sticker says. There might also be some minor differences in what resistors are populated. Getting back to the topic of overclocking, if you are willing to go back to the days before the premium pre-overclocked kits existed, you will likely find a number of ECC UDIMMs can and will overclock with similar parameters. There is just no guarantee of that.
As for RDIMMs having higher transfer rates, let us consider the differences between a UDIMM, a CUDIMM and a RDIMM. The UDIMM connects directly to the CPU memory controller for the clock, address, control and data signals, while the RDIMM has a register chip that buffers the clock, address and control signals, although the data signals still connect to the memory controller directly. This improves signal integrity and lets more memory ICs be attached to the memory controller. A recent development is the CUDIMM, which is a hybrid of the two. In the CUDIMM, the clock signal is buffered by a Client Clock Driver, which does exactly what the register chip does to the clock signal in RDIMMs. CUDIMM are able to reach higher transfer rates than UDIMMs without overclocking because of the Client Clock Driver, and since RDIMMs also do what CUDIMMs do, they similarly can reach higher transfer rates.
That said, GDDR7 does on die ECC, which gives immunity to this in its current form. There is no way to get information on corrected bitflips from on-die ECC, but it is better than nothing.
Worst case scenario someone pulls this off using webgl and a website is able to corrupt your VRAM. They can't actually steal anything in that scenario (AFAIK) making it nothing more than a minor inconvenience.
You escape a closed virtual universe by not "breaking out" in the tradidional sense, exploiting some bug in the VM hypervisor's boundary itself, but by directly manipulating the underlying physics of the universe on wich the virtual universe is founded, just by creating a pattern inside the virtual universe itself.
No matter how many virtual digital layers, as long as you can impact the underlying analog substrate this might work.
Makes you dream there could be an equivalent for our own universe?
I’ve always considered that to be what’s achieved by the LHC: smashing the fundamental building blocks of our universe together at extreme enough energies to briefly cause ripples through the substrate of said universe
As an example of an alternative analogy: think of how many bombs need to explode in your dreams before the "substrate" is "rippled". How big do the bombs need to be? How fast does the "matter" have to "move"? I think "reality" is more along those lines. If there is a substrate - and that's a big if - IMO it's more likely to be something pliable like "consciousness". Not in the least "disturbed" by anything moving in it.
The LHC is extremely impressive from a human engineering perspective, but it's nowhere close to pushing the boundaries of what's going on every second in the universe at large.
Turns out this whole virtualized house abstraction is a sham
On a philosophical level I somewhat agree, but on a practical level I am sad as this likely means reduced performance again.
perching_aix•9h ago
andyferris•9h ago
GPUs have always been squarely in the "get stuff to consumers ASAP" camp, rather than NASA-like engineering that can withstand cosmic rays and such.
I also presume an EM simulation would be able to spot it, but prior to rowhammer it is also possible no-one ever thought to check for it (or more likely that they'd check the simulation with random or typical data inputs, not a hitherto-unthought-of attack vector, but that doesn't explain more modern hardware).
privatelypublic•9h ago
This is a huge theme for vulnerabilities. I almost said "modern" but looking back I've seen the cycle (disregard attacks as strictly hypothetical. Get caught unprepared when somebody publishes something making it practical) happen more than a few times.
Palomides•9h ago
(personally I think all RAM in all devices should be ECC)
grafmax•9h ago
andyferris•8h ago
andyferris•9h ago
It's more of a tragedy-of-the-commons problem. Consumers don't know what they don't know and manufacturers need to be competitive with respect to each other. Without some kind of oversight (industry standards bodies or goverment regulation), or a level of shaming that breaks through to consumers (or e.g. class action lawsuits that impact manufacturers), no individual has any incentive to change.
progmetaldev•8h ago
userbinator•6h ago
userbinator•8h ago
RAM that doesn't behave like RAM is not RAM. It's defective. ECC is merely an attempt at fixing something that shouldn't've made it to the market in the first place. AFAIK there is a RH variant that manages to flip bits undetectably even with ECC RAM.
nsteel•1h ago
Single Error Correction, Double Error Detection, Tripple Error Chaos.
ryao•7h ago
It should be considered unethical to sell machines with non-ECC memory in any real volume.
justincormack•1h ago
userbinator•8h ago
It was known as "pattern sensitivity" in the industry for decades, basically ever since the beginning, and considered a blocking defect. Here's a random article from 1989 (don't know why first page is missing, but look at the references): http://web.eecs.umich.edu/~mazum/PAPERS-MAZUM/patternsensiti...
Then some bastards like these came along...
https://research.ece.cmu.edu/safari/thesis/skhan_jobtalk_sli...
...and essentially said "who cares, let someone else be responsible for the imperfections while we can sell more crap", leading to the current mess we're in.
The flash memory industry took a similar dark turn decades ago.
MadnessASAP•7h ago
Nothing is perfect, everything has its failure conditions. The question is where do you choose to place the bar? Do you want your component to work at 60, 80, or 100C? Do you want it to work in high radiation environments? Do you want it to withstand pathological access patterns?
So in other words, there isnt a sufficent market for GPUs at double the $/GB RAM but are resilient to rowhammer attacks to justify manufacturing them.
thijsr•4h ago
sroussey•2h ago
The positive part of the original rowhammer report was that it gave us a new tool to validate memory (it caused failures much faster than other validation methods).