Keeping Spectre secret

When Graz University of Technology researcher Michael Schwarz first reached out to Intel, he thought he was about to ruin the company’s day. He had found a problem with their chips, together with his colleagues Daniel Gruss, Moritz Lipp, and Stefan Mangard. The vulnerability was both profound and immediately exploitable. His team finished the exploit on December 3rd, a Sunday afternoon. Realizing the gravity of what they’d found, they emailed Intel immediately.

It would be nine days until Schwarz heard back. But when he got on the phone with someone from Intel, Schwarz got a surprise: the company already knew about the CPU problems and was desperately figuring out how to fix them. Moreover, the company was doing its best to make sure no one else found out. They thanked Schwarz for his contribution, but told him what he had found was top secret, and gave him a precise day when the secret could be revealed.

The flaw Schwarz — and, he learned, many others — had discovered was potentially devastating: a design-level chip flaw that could slow down every processor in the world, with no perfect fix short of a gut redesign. It affected almost every major tech company in the world, from Amazon’s server farms to the chipmakers like Intel and ARM. But Schwarz had also come up against a secondary problem: how do you keep a flaw this big a secret long enough for everyone involved to fix it?

Disclosure is an old problem in the security world. Whenever a researcher finds a bug, the custom is to give vendors a few months to fix the problem before it goes public and bad guys have a chance to exploit it. But as those bugs affect more companies and more products, the dance becomes more complex. More people need to be told and kept in confidence as more software needs to be quietly developed and pushed out. With Meltdown and Spectre, that multi-party coordination broke down and the secret spilled out before anyone was ready.

That early breakdown had consequences. After the release, basic questions of fact became muddled, like whether AMD chips are vulnerable to Spectre attacks (they are), or whether Meltdown is specific to Intel. (ARM chips are also affected.) Antivirus systems were caught off guard, unintentionally blocking many of the crucial patches from being deployed. Other patches had to be stopped mid-deployment after crashing machines. One of the best tools available for dealing with the vulnerability has been a tool called Retpoline, developed by Google’s incident response team, initially planned for release alongside the bug itself. But while the Retpoline team says they weren’t caught off guard, the code for the tool wasn’t made public until the day after the official announcement of the flaw, in part because of the haphazard break in the embargo.

Perhaps most alarming, some crucial outside response groups were left out of the loop entirely. The most authoritative alert about the flaw came from Carnegie Mellon’s CERT division, which works with Homeland Security on vulnerability disclosures. But according to senior vulnerability analyst Will Dormann, CERT wasn’t aware of the issue until the Meltdown and Spectre websites went live, which led to even more chaos. The initial report recommended replacing the CPU as the only solution. For a processor design flaw, the advice was technically true, but only stoked panic as IT managers imagined prying out and replacing the central processor for every device in their care. A few days later, Dormann and his colleagues decided the advice wasn’t actionable and changed the recommendation to simply installing patches.

“I would have liked to have known,” Dormann says. “If we’d known about it earlier, we would have been able to produce a more accurate document, and people would have been more educated right off the bat, as opposed to the current state, where we’ve been testing patches and updating the document for the past week.”

Still, maybe that damage was inevitable? Even Dormann isn’t sure. “This happens to be the largest multi-party vulnerability we’ve ever been part of,” he told me. “With a vulnerability of this magnitude, there’s no way that it’s going to come out cleanly and everyone’s going to happy.”


The first step in the Meltdown and Spectre disclosures came six months before Schwarz’s discovery, with a June 1st email from Google Project Zero’s Jann Horn. Sent to Intel, AMD and ARM, the message laid out the flaw that would become Spectre, with a demonstrated exploit against Intel and AMD processors and troubling implications for ARM. Horn was careful to give just enough information to get the vendors’ attention. He had reached out to the three chipmakers on purpose, calling on each company to figure out its own exposure and notify any other companies that might be affected. At the same time, Horn warned them not to spread the information too far or too fast.

“Please note that so far, we have not notified other parts of Google,” Horn wrote. “When you notify other parties about this issue, please don’t share information unnecessarily.”

Figuring out who was affected would prove difficult. There were chipmakers to start, but soon it became clear that operating systems would need to be patched, which meant looping in another round of researchers. Browsers would be implicated, too, along with the massive cloud platforms run by Google, Microsoft, and Amazon, arguably the most tempting targets for the new bug. By the end, dozens of companies from every corner of the industry would be compelled to issue a patch of some kind.

Project Zero’s official policy is to offer only 90 days before going public with the news, but as more companies joined, Zero seems to have backed down, more than doubling the patch window. As months ticked by, companies began deploying their own patches, doing their best to disguise what they were fixing. Google’s Incident Response Team was notified in July, a month after the initial warning from Project Zero. The Microsoft Insiders program sent out a quiet, early patch in November. (Intel CEO Brian Krzanich was making more controversial moves during the same period, arranging an automated stock sell-off in October to be executed on November 29th.) On December 14th, Amazon Web Server customers got a warning that a wave of reboots on January 5th might affect performance. Another Microsoft patch was compiled and deployed on New Year’s Eve, suggesting the security team was working through the night. In each case, the reasons for the change were vague, leaving users with little clue as to what was being fixed.

Still, you can’t rewrite the basic infrastructure of the internet without someone getting suspicious. The strongest clues came from Linux. Powering most of the cloud servers on the internet, Linux had to be a big part of any fix for the Spectre and Meltdown. But as an open-source system, any changes had to be made in public. Every update was posted to a public Git repository, and all official communications took place on a publicly archived listserve. When kernel patches started to roll out for a mysterious “page table isolation” feature, close observers knew something was up.

The biggest hint came on December 18th, when Linus Torvalds merged a late-breaking patch that changed the way the Linux kernel interacts with x86 processors. “This, besides helping fix KASLR leaks (the pending Page Table Isolation (PTI) work), also robustifies the x86 entry code,” Torvalds explained. The most recent kernel release had come just one day earlier. Normally a patch would wait to be bundled into the next release, but for some reason, this one was too important. Why would the famously cranky Torvalds include an out-of-band update so casually, especially one that seemed likely to slow down the kernel?

It seemed even stranger when month-old emails turned up suggesting that the patch would be applied to old kernels retroactively. Taking stock of the rumors on December 20th, Linux veteran Jonathan Corbet said the page table issue “has all the markings of a security patch being readied under pressure from a deadline.”

Still, they only knew half the story. Page Table Isolation is a way of separating kernel space from user space, so clearly the problem was some kind of leak in the kernel. But it still wasn’t clear how the kernel was breaking or how far the mysterious bug would reach.

The next break came from the chipmakers themselves. Under the new patch, Linux listed all x86-compatible chips as vulnerable, including AMD processors. Since the patch tended to slow down the processor, AMD wasn’t thrilled about being included. The day after Christmas, AMD engineer Tom Lendacky sent an email to the public Linux kernel listserve explaining exactly why AMD chips didn’t need a patch.

“The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault,” Lendacky wrote.

That might sound technical, but for anyone trying to suss out the nature of the bug, it rang out like a fire alarm. Here was an AMD engineer, who surely knew the vulnerability from the source, saying the kernel problem stemmed from something processors had been doing for nearly 20 years. If speculative references were the problem, it was everyone’s problem — and it would take much more than a kernel patch to fix.

“That was the trigger,” says Chris Williams, US bureau chief for The Register. “No one had mentioned speculative memory references up to that point. It was only when that email came out that we realized it was something really serious.”

Once it was clear this was a speculative memory problem, public research papers could fill in the rest of the picture. For years, security researchers had looked for ways to crack the kernel through speculative execution, with Schwarz’s team from Graz publishing a public mitigation paper as recently as June. Anders Fogh had published an attempt at a similar attacks in July, although he’d ultimately come away with a negative result. Just two days after the AMD email, a researcher who goes by “brainsmoke” presented related work at the Chaos Computer Congress in Leipzig, Germany. None of those resulted in an exploitable bug, but they made it clear what an exploitable bug would look like — and it looked very, very bad.

(Fogh said it was clear from the beginning that any workable bug would be disastrous. “When you start looking into something like this, you know already that it’s really bad if you succeed,” he told me. After the Meltdown and Spectre releases and the ensuing chaos, Fogh has decided not to publish any of his further research on the topic.)

In the week that followed, rumors of the bug started to filter downstream through Twitter, listserves, and message boards. A casual benchmark shared on the PostgreSQL listserve found a 17 percent decline in performance — a terrifying number for anyone waiting to patch. Other researchers wrote informal posts rounding up what they knew, careful to present everything they knew as just a rumor. “[This post] mostly represents guesswork until such times as the embargo is lifted,” one recap wrote. “Many fireworks and much drama is likely when that day arrives.”

By New Year’s Day, the rumors had become impossible to ignore. Williams decided it was time to write something. On January 2nd, The Register published its piece on what they called an “Intel processor design flaw.” The piece laid out what had happened on the Linux listserve, the ominous AMD email, and all the early research. “It appears, from what AMD software engineer Tom Lendacky was suggesting above, that Intel’s CPUs speculatively execute code potentially without performing security checks,” the piece read. “That would allow ring-3-level user code to read ring-0-level kernel data. And that is not good.”

Publishing the piece would prove to be a controversial decision. Everyone in the industry assumed there was an embargo to give companies time to patch. Spreading the news early cut into that time, giving criminals more of a chance to exploit the vulnerabilities before patches were in place. But Williams maintains that by the time The Register published, the secret was already out. “I thought we had to give people a heads up that, when the patches come out, these are patches you should really install,” Williams says. “If you’re smart enough to exploit this bug, you probably could have worked it out without us.”

In fact, the embargo would only hold for one more day. The official release had been planned for January 9th, in line with Microsoft’s patch Tuesday cycle and square in the middle of the Consumer Electronics Show, which might dampen the bad news. But the combination of wild rumors and available research made the news impossible to contain. Reporters flooded researchers’ inboxes, and anyone involved had to do their best to keep quiet as it seemed less and less likely that the secret would keep for another week.

The tipping point was brainsmoke himself. One of the few kernel researchers who wasn’t subject to the developer embargo, brainsmoke took the rumors as a roadmap and set out to find the bug. The morning after The Register’s story, he found it, tweeting out a screenshot of his terminal as proof of concept. “No page faults required,” he wrote in a follow-up tweet. “Massaging everything in/out-of the right cache seems to be the crux”

Once researchers saw that tweet, the jig was up. The Graz team was determined not to spill the beans before Google or Intel, but after the public proof of concept spread, word came from Google that the embargo would lift that day, January 3rd, at 2PM PT. At zero hour, the full research went live at two branded websites, complete with pre-arranged logos for each bug. Reports flooded in from ZDNet, Wired, and The New York Times, often with information that had been gathered only hours before. After more than seven months of planning, the secret was finally out.


It’s still hard to know how much that early breakdown cost. Patches are still being deployed, and benchmarks still tallying up the ultimate damage from the fixes. Would things have gone more smoothly with an extra week to prepare? Or would it have only delayed the inevitable?

There are plenty of formal documents telling you how a vulnerability announcement like this should happen, whether from the International Standards Organization, the US Department of Commerce, or CERT itself, although they offer few hard answers for a case as sprawling as this one. Experts have been struggling with these questions for years, and the most experienced have given up looking for a perfect answer.

Katie Moussouris helped write Microsoft’s playbook for these events, along with the ISO standards and countless other guides through the multi-party disclosure mess. When I asked her to rate this week’s response, she was kinder than I expected.

“This is probably the best that could have been done,” Moussouris told me. “The ISO standards will tell you what to consider, but they won’t tell you what to do in the heat of that moment. It’s like reading the instructions and running a couple of fire drills. It’s good to have a plan, but when your building is on fire, the way you act will not be according to plan.”

The stranger thought is that, as technology becomes more centralized and interconnected, this kind of five-alarm fire may be harder to avoid. As protocols like OpenSSL spread, they raise the risk of a massively multi-party bug like Heartbleed, the internet version of a monocrop blight. This week showed the same effect in hardware. Speculative execution became an industry standard before we had time to secure it. With most of the web running on the same chips and the same cloud services, that risk multiplies even further. When a vulnerability finally surfaced, the result was an almost impossible disclosure task.

As messy as it is, that scramble has become hard to avoid whenever a core technology breaks. “In the ‘90s we used to think one-vulnerability, one-vendor, and that was the majority of the vulnerabilities you saw. Now, almost everything has some multi-party coordination element.” says Moussouris. “This is just what multi-party disclosure looks like.”

Tech

Twitch will use machine learning to detect people evading bans

Tech

The latest Quest update brings cloud backups and a mixed reality view

Apple

These transparent Apple prototypes clearly show the greatness of see-through gadgets

View all stories in Tech

Comments

This is a fabulous article. Well done.

Great article!

Speculative execution became an industry standard before we had time to secure it.

Like about every technology released ever. The pressure to be first, or not to be last, to market causes this. Meltdown/Spectre is bad but have we already forgotten the IoT botnets? A billion devices connected to the net before anyone thought to require changing their default admin passwords?

And kudos to the author for one of the most cogent technical articles I’ve read in the past year!

I’ld argue they were different. The IoT botnets was gross negligence. They knew what they where doing.

And that same pressure applies to AI development, which is quite a scary thought.

This article provides the perspective that was missing from Tom Warren’s previous accusatory article regarding this subject.

Tom’s article was based on the limited information available at the time; Russell has had the benefit of time and detailed research. There’s a need for both – websites can’t ignore the existing of a game-changing bug in the wild that the mainstream media are reporting on.

I thought this article was just going to acknowledge that NDAs exist. I read it and was very informative and well written. Thanks!

If anything, the current fallout suggests that 6 months is probably not even enough. The public expect the same response regardless the level of effort required to address each issue can vary greatly. Meltdown/Spectre is squarely in the near-worst-case scenario, "holy uck everything needs to be changed" level of fix.

I’m glad Google Project Zero is wise enough to realize not all bugs are the same, and should not be treated the same way.

In those other companies defense, even google themselves are not completely ready when the news broke.
So unless Google was also slacking, that 90 day policy would have bit them in the ass also.

Heck, even AMD was denying their processors had a flaw 7 month after they were notified by Google they do.

AMD have denied they are vulnerable to Meltdown, and nothing has come out to prove them wrong. They have said the are vulnerable to variant 1 of Spectre, and less vulnerable to variant 2 (near zero but still not zero). Please read: https://www.amd.com/en/corporate/speculative-execution

Nothing they’ve said seems inaccurate. What seems to be inaccurate is reporting on it, especially from tech sites/blogs that are owned by Purch (which is basically Intel’s media arm). Even the most technically accurate analyses from these sites seem to time or frame their comments to do the least damage to Intel and involve AMD’s name in the matter.

The patch being discussed on lkml in that thread was specifically the Meltdown patch, KPTI (or KAISER as it was called earlier). AMD’s architecture is and remains immune to the exploit being targeted by that patch, by design. The AMD engineer was totally accurate, and no further developments have contradicted what he said.

Maybe you should go back and re-read everything. The fact that you misread this basic piece of chronology should tell you how biased other publications are towards Intel.

Why should the engineer have said there are three variants (breaking embargo) when the patch they were discussing dealt specifically with Meltdown? Not very logical no? He was not addressing the public, he was addressing the linux kernel development team. Everyone he was talking to would have known about the issue in great detail.

Intel got the most blowback because they deserved the most blowback. AMD and most ARM designs are not vulnerable to Meltdown, the most serious of the three. AMD is only provably vulnerable to one variant, while it is patching for the second just to be cautious. Why should all CPU makers suffer the exact same consequences, when their mistakes aren’t the same? And they would have gotten much more if it weren’t for the savvy diversion of their Purch media arm, several of whom quoted Intel’s media releases verbatim immediately, while only telling AMD’s side of the story days after, when everything had calmed down.

The timeline you linked is a classic case of the tech media, either through incompetence or malice, taking Intel’s side. AMD gave a clear security update on the 3rd of January, yet TheVerge doesn’t talk about it till more than a week later. No headlines mention AMD until there is ammunition to attack it. And instead of quoting AMD verbatim like it does Intel’s media release, it uses selective quotes calculated at making AMD look as bad as possible. AMD never said that all the vulnerabilities were a "near zero risk", they specifically only said that of Spectre Variant 2. But that is not the impression an uneducated reader would get from reading TheVerge’s hit piece, so mission accomplished.

AMD was aware of all the bugs before their engineer talked on that Linux patch discussion…

The Engineer just said that the patch shouldn’t include AMD processors because that patch which is only there to mitigate Meltdown (and AMD is immune to that). Because yeah if that patch was deployed as it is, it would have impacted AMD for nothing. So he gave just enough information to explain that to the kernel developers.

That’s all. AMD was still on embargo also, why would the guy talk about Spectre which was still on a non-disclosed note and who had no connection at all with that patch? Except breaking the secret it was useless.

I want an insecurity to go public once it no longer is an insecurity.

Then it will never go public.

I dunno. The article talks about how Linux, Microsoft etc were patching over Nov and December. MS released fixed this week that likely would have arrived in time for the intended embargo also. Likewise with Apple.

I think it’s interesting Intel themselves seem to be a bit behind schedule though.

I think this shows Google’s hypocrisy when it comes to their 90-day policy. Google has no problem revealing Microsoft’s security flaw when it makes Microsoft looks bad. But when there’s a potential to slow down their own server, they wait and wait and wait and wait.

Yes, let’s just pretend Android doesn’t exit.

View All Comments
Back to top ↑