Why don’t PCs use error correcting RAM? “Because Intel,” says Linus

This site is reader-supported. When you click through links on our site, we may be compensated.

We've been enjoying a kinder, gentler Linus Torvalds for the past couple of years... but that doesn't mean he stopped having <em>opinions.</em>“><figcaption class=
Enlarge / We've been enjoying a kinder, gentler Linus Torvalds for the past couple of years… but that doesn't mean he stopped having opinions.

This Monday, Linux kernel creator Linus Torvalds went on a frustrated rant about the lack of Error Correcting Checksum (ECC) RAM in consumer PCs and laptops.

… the misguided and arse-backwards policy of “consumers don't need ECC”, [made] the market for ECC memory go away.

The arguments against ECC were always complete and utter garbage. Now even the memory manufacturers are starting to do ECC internally because they finally owned up to the fact that they absolutely have to.

If you're not familiar with ECC RAM, it's probably because you don't build or spec dedicated servers using server-grade CPUs and motherboards—which, unfortunately, is about the only place you actually find ECC. In a nutshell, ECC RAM includes a tiny amount of extra memory used for detection and correction of errors.

Memory errors and probability

In most modern implementations, this means for every 64-bit word stored in RAM, there are eight checking bits. A single bit error—a 0 flipped to 1, or a 1 flipped to 0—can be both detected and corrected automatically. Two bits flipped in the same word can be detected but not corrected. Three or more bits flipped in the same word will probably be detected, but detection is not guaranteed.

Bit flips can happen for many reasons, beginning with cosmic-ray impact or simple hardware failure. A large-scale study of Google servers found that roughly 32 percent of all servers (and 8 percent of all DIMMs) in Google's fleet experience at least one memory error per year. But the vast majority of these are single-bit errors—and since Google is using server CPUs and ECC RAM, this means the machines in question keep right on trucking.

In consumer machines, even these single-bit errors—which are over 40 times more likely to occur than multiple-bit errors, according to Google's data—go undetected and can introduce instability into systems and corruption into data.

Bit flips aren’t always accidental

Not every RAM error is the result of a hardware failure or unintentional EMF problem. In recent years, researchers have developed increasingly practical physics-based side channel attacks, using controlled, rapid bit flips in areas of RAM accessible to one application to deduce or modify the values of data in adjacent areas of RAM they shouldn't be able to.
Although ECC RAM can't mitigate RAMBleed-style attacks that deduce the values of adjacent memory, it can generally stop Rowhammer attacks—in which rapidly flipping bits in one area of RAM cause bits in an adjacent area to change.

Even when ECC can't actively prevent a Rowhammer attack from having an impact on the system—for example, when it flips multiple bits in one word—it can at least alert the system of the problem and, in most cases, prevent the Rowhammer attack from doing anything other than causing downtime. (Most ECC systems are configured to halt the entire machine if an uncorrectable error is detected.)

Torvalds blames Intel

And the memory manufacturers claim it's because of economics and lower power. And they are lying bastards—let me once again point to row-hammer about how those problems have existed for several generations already, but these f*ckers happily sold broken hardware to consumers and claimed it was an “attack,” when it always was “we're cutting corners.”

How many times has a row-hammer like bit-flip happened just by pure bad luck on real non-attack loads? We will never know. Because Intel was pushing shit to consumers.

Torvalds takes the bold position that the lack of ECC RAM in consumer technology is Intel's fault due to the company's policy of artificial market segmentation. Intel has a vested interest in pushing deeper-pocketed businesses toward its more expensive—and profitable—server-grade CPUs rather than letting those entities effectively use the necessarily lower-margin consumer parts.

Removing support for ECC RAM from CPUs that aren't targeted directly at the server world is one of the ways Intel has kept those markets strongly segmented. Torvalds' argument here is that Intel's refusal to support ECC RAM in its consumer-targeted parts—along with its de facto near-monopoly in that space—is the real reason that ECC is nearly unavailable outside the server space.

The usual argument around why ECC isn't present in consumer tech revolves around cost, but we suspect Torvalds has the right of it here. Despite ECC RAM being essentially a hard-to-find specialty part, it typically only costs about 20 percent more per DIMM than non-ECC does at retail. The real problem is that without motherboards and CPUs which support it, it won't do you any good.

                                <div id="action_button_container"></div>

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.