When Graz University of Technology researcher Michael Schwarz first reached out to Intel, he thought he was about to ruin the company’s day. He had found a problem with their chips, together with his colleagues Daniel Gruss, Moritz Lipp, and Stefan Mangard. The vulnerability was both profound and immediately exploitable. His team finished the exploit on December 3rd, a Sunday afternoon. Realizing the gravity of what they’d found, they emailed Intel immediately.
It would be nine days until Schwarz heard back. But when he got on the phone with someone from Intel, Schwarz got a surprise: the company already knew about the CPU problems and was desperately figuring out how to fix them. Moreover, the company was doing its best to make sure no one else found out. They thanked Schwarz for his contribution, but told him what he had found was top secret, and gave him a precise day when the secret could be revealed.
The flaw Schwarz — and, he learned, many others — had discovered was potentially devastating: a design-level chip flaw that could slow down every processor in the world, with no perfect fix short of a gut redesign. It affected almost every major tech company in the world, from Amazon’s server farms to the chipmakers like Intel and ARM. But Schwarz had also come up against a secondary problem: how do you keep a flaw this big a secret long enough for everyone involved to fix it?
How do you keep a flaw this big a secret long enough for everyone involved to fix it?
Disclosure is an old problem in the security world. Whenever a researcher finds a bug, the custom is to give vendors a few months to fix the problem before it goes public and bad guys have a chance to exploit it. But as those bugs affect more companies and more products, the dance becomes more complex. More people need to be told and kept in confidence as more software needs to be quietly developed and pushed out. With Meltdown and Spectre, that multi-party coordination broke down and the secret spilled out before anyone was ready.
That early breakdown had consequences. After the release, basic questions of fact became muddled, like whether AMD chips are vulnerable to Spectre attacks (they are), or whether Meltdown is specific to Intel. (ARM chips are also affected.) Antivirus systems were caught off guard, unintentionally blocking many of the crucial patches from being deployed. Other patches had to be stopped mid-deployment after crashing machines. One of the best tools available for dealing with the vulnerability has been a tool called Retpoline, developed by Google’s incident response team, initially planned for release alongside the bug itself. But while the Retpoline team says they weren’t caught off guard, the code for the tool wasn’t made public until the day after the official announcement of the flaw, in part because of the haphazard break in the embargo.
The early breakdown had consequences
Perhaps most alarming, some crucial outside response groups were left out of the loop entirely. The most authoritative alert about the flaw came from Carnegie Mellon’s CERT division, which works with Homeland Security on vulnerability disclosures. But according to senior vulnerability analyst Will Dormann, CERT wasn’t aware of the issue until the Meltdown and Spectre websites went live, which led to even more chaos. The initial report recommended replacing the CPU as the only solution. For a processor design flaw, the advice was technically true, but only stoked panic as IT managers imagined prying out and replacing the central processor for every device in their care. A few days later, Dormann and his colleagues decided the advice wasn’t actionable and changed the recommendation to simply installing patches.
“I would have liked to have known,” Dormann says. “If we’d known about it earlier, we would have been able to produce a more accurate document, and people would have been more educated right off the bat, as opposed to the current state, where we’ve been testing patches and updating the document for the past week.”
“I would have liked to have known.”
Still, maybe that damage was inevitable? Even Dormann isn’t sure. “This happens to be the largest multi-party vulnerability we’ve ever been part of,” he told me. “With a vulnerability of this magnitude, there’s no way that it’s going to come out cleanly and everyone’s going to happy.”
The first step in the Meltdown and Spectre disclosures came six months before Schwarz’s discovery, with a June 1st email from Google Project Zero’s Jann Horn. Sent to Intel, AMD and ARM, the message laid out the flaw that would become Spectre, with a demonstrated exploit against Intel and AMD processors and troubling implications for ARM. Horn was careful to give just enough information to get the vendors’ attention. He had reached out to the three chipmakers on purpose, calling on each company to figure out its own exposure and notify any other companies that might be affected. At the same time, Horn warned them not to spread the information too far or too fast.
“Please note that so far, we have not notified other parts of Google,” Horn wrote. “When you notify other parties about this issue, please don’t share information unnecessarily.”
Figuring out who was affected would prove difficult. There were chipmakers to start, but soon it became clear that operating systems would need to be patched, which meant looping in another round of researchers. Browsers would be implicated, too, along with the massive cloud platforms run by Google, Microsoft, and Amazon, arguably the most tempting targets for the new bug. By the end, dozens of companies from every corner of the industry would be compelled to issue a patch of some kind.
“With a vulnerability of this magnitude, there’s no way that it’s going to come out cleanly.”
Project Zero’s official policy is to offer only 90 days before going public with the news, but as more companies joined, Zero seems to have backed down, more than doubling the patch window. As months ticked by, companies began deploying their own patches, doing their best to disguise what they were fixing. Google’s Incident Response Team was notified in July, a month after the initial warning from Project Zero. The Microsoft Insiders program sent out a quiet, early patch in November. (Intel CEO Brian Krzanich was making more controversial moves during the same period, arranging an automated stock sell-off in October to be executed on November 29th.) On December 14th, Amazon Web Server customers got a warning that a wave of reboots on January 5th might affect performance. Another Microsoft patch was compiled and deployed on New Year’s Eve, suggesting the security team was working through the night. In each case, the reasons for the change were vague, leaving users with little clue as to what was being fixed.
Still, you can’t rewrite the basic infrastructure of the internet without someone getting suspicious. The strongest clues came from Linux. Powering most of the cloud servers on the internet, Linux had to be a big part of any fix for the Spectre and Meltdown. But as an open-source system, any changes had to be made in public. Every update was posted to a public Git repository, and all official communications took place on a publicly archived listserve. When kernel patches started to roll out for a mysterious “page table isolation” feature, close observers knew something was up.
The biggest hint came on December 18th, when Linus Torvalds merged a late-breaking patch that changed the way the Linux kernel interacts with x86 processors. “This, besides helping fix KASLR leaks (the pending Page Table Isolation (PTI) work), also robustifies the x86 entry code,” Torvalds explained. The most recent kernel release had come just one day earlier. Normally a patch would wait to be bundled into the next release, but for some reason, this one was too important. Why would the famously cranky Torvalds include an out-of-band update so casually, especially one that seemed likely to slow down the kernel?
You can’t rewrite the basic infrastructure of the internet without someone getting suspicious
It seemed even stranger when month-old emails turned up suggesting that the patch would be applied to old kernels retroactively. Taking stock of the rumors on December 20th, Linux veteran Jonathan Corbet said the page table issue “has all the markings of a security patch being readied under pressure from a deadline.”
Still, they only knew half the story. Page Table Isolation is a way of separating kernel space from user space, so clearly the problem was some kind of leak in the kernel. But it still wasn’t clear how the kernel was breaking or how far the mysterious bug would reach.
The next break came from the chipmakers themselves. Under the new patch, Linux listed all x86-compatible chips as vulnerable, including AMD processors. Since the patch tended to slow down the processor, AMD wasn’t thrilled about being included. The day after Christmas, AMD engineer Tom Lendacky sent an email to the public Linux kernel listserve explaining exactly why AMD chips didn’t need a patch.
“The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault,” Lendacky wrote.
That might sound technical, but for anyone trying to suss out the nature of the bug, it rang out like a fire alarm. Here was an AMD engineer, who surely knew the vulnerability from the source, saying the kernel problem stemmed from something processors had been doing for nearly 20 years. If speculative references were the problem, it was everyone’s problem — and it would take much more than a kernel patch to fix.
“That was the trigger,” says Chris Williams, US bureau chief for The Register. “No one had mentioned speculative memory references up to that point. It was only when that email came out that we realized it was something really serious.”
“It was only when that email came out that we realized it was something really serious.”
Once it was clear this was a speculative memory problem, public research papers could fill in the rest of the picture. For years, security researchers had looked for ways to crack the kernel through speculative execution, with Schwarz’s team from Graz publishing a public mitigation paper as recently as June. Anders Fogh had published an attempt at a similar attacks in July, although he’d ultimately come away with a negative result. Just two days after the AMD email, a researcher who goes by “brainsmoke” presented related work at the Chaos Computer Congress in Leipzig, Germany. None of those resulted in an exploitable bug, but they made it clear what an exploitable bug would look like — and it looked very, very bad.
(Fogh said it was clear from the beginning that any workable bug would be disastrous. “When you start looking into something like this, you know already that it’s really bad if you succeed,” he told me. After the Meltdown and Spectre releases and the ensuing chaos, Fogh has decided not to publish any of his further research on the topic.)
In the week that followed, rumors of the bug started to filter downstream through Twitter, listserves, and message boards. A casual benchmark shared on the PostgreSQL listserve found a 17 percent decline in performance — a terrifying number for anyone waiting to patch. Other researchers wrote informal posts rounding up what they knew, careful to present everything they knew as just a rumor. “[This post] mostly represents guesswork until such times as the embargo is lifted,” one recap wrote. “Many fireworks and much drama is likely when that day arrives.”
“Many fireworks and much drama is likely when that day arrives.”
By New Year’s Day, the rumors had become impossible to ignore. Williams decided it was time to write something. On January 2nd, The Register published its piece on what they called an “Intel processor design flaw.” The piece laid out what had happened on the Linux listserve, the ominous AMD email, and all the early research. “It appears, from what AMD software engineer Tom Lendacky was suggesting above, that Intel’s CPUs speculatively execute code potentially without performing security checks,” the piece read. “That would allow ring-3-level user code to read ring-0-level kernel data. And that is not good.”
Publishing the piece would prove to be a controversial decision. Everyone in the industry assumed there was an embargo to give companies time to patch. Spreading the news early cut into that time, giving criminals more of a chance to exploit the vulnerabilities before patches were in place. But Williams maintains that by the time The Register published, the secret was already out. “I thought we had to give people a heads up that, when the patches come out, these are patches you should really install,” Williams says. “If you’re smart enough to exploit this bug, you probably could have worked it out without us.”
In fact, the embargo would only hold for one more day. The official release had been planned for January 9th, in line with Microsoft’s patch Tuesday cycle and square in the middle of the Consumer Electronics Show, which might dampen the bad news. But the combination of wild rumors and available research made the news impossible to contain. Reporters flooded researchers’ inboxes, and anyone involved had to do their best to keep quiet as it seemed less and less likely that the secret would keep for another week.
The tipping point was brainsmoke himself. One of the few kernel researchers who wasn’t subject to the developer embargo, brainsmoke took the rumors as a roadmap and set out to find the bug. The morning after The Register’s story, he found it, tweeting out a screenshot of his terminal as proof of concept. “No page faults required,” he wrote in a follow-up tweet. “Massaging everything in/out-of the right cache seems to be the crux”
Once researchers saw that tweet, the jig was up. The Graz team was determined not to spill the beans before Google or Intel, but after the public proof of concept spread, word came from Google that the embargo would lift that day, January 3rd, at 2PM PT. At zero hour, the full research went live at two branded websites, complete with pre-arranged logos for each bug. Reports flooded in from ZDNet, Wired, and The New York Times, often with information that had been gathered only hours before. After more than seven months of planning, the secret was finally out.
It’s still hard to know how much that early breakdown cost. Patches are still being deployed, and benchmarks still tallying up the ultimate damage from the fixes. Would things have gone more smoothly with an extra week to prepare? Or would it have only delayed the inevitable?
There are plenty of formal documents telling you how a vulnerability announcement like this should happen, whether from the International Standards Organization, the US Department of Commerce, or CERT itself, although they offer few hard answers for a case as sprawling as this one. Experts have been struggling with these questions for years, and the most experienced have given up looking for a perfect answer.
Katie Moussouris helped write Microsoft’s playbook for these events, along with the ISO standards and countless other guides through the multi-party disclosure mess. When I asked her to rate this week’s response, she was kinder than I expected.
“When your building is on fire, the way you act will not be according to plan.”
“This is probably the best that could have been done,” Moussouris told me. “The ISO standards will tell you what to consider, but they won’t tell you what to do in the heat of that moment. It’s like reading the instructions and running a couple of fire drills. It’s good to have a plan, but when your building is on fire, the way you act will not be according to plan.”
The stranger thought is that, as technology becomes more centralized and interconnected, this kind of five-alarm fire may be harder to avoid. As protocols like OpenSSL spread, they raise the risk of a massively multi-party bug like Heartbleed, the internet version of a monocrop blight. This week showed the same effect in hardware. Speculative execution became an industry standard before we had time to secure it. With most of the web running on the same chips and the same cloud services, that risk multiplies even further. When a vulnerability finally surfaced, the result was an almost impossible disclosure task.
As messy as it is, that scramble has become hard to avoid whenever a core technology breaks. “In the ‘90s we used to think one-vulnerability, one-vendor, and that was the majority of the vulnerabilities you saw. Now, almost everything has some multi-party coordination element.” says Moussouris. “This is just what multi-party disclosure looks like.”