CPU Vulnerability Security with Matt Linton and Paul Turner

Bringing you a special second episode this week with Matt Linton and Paul Turner sharing insights with Mark and Melanie about the CPU vulnerabilities, Spectre & Meltdown, and how Google coordinated and managed security with the broader community. We talked about how there has been minimal to no performance impact for GCP users and GCP’s Live Migration helped deploy patches and mitigations without requiring maintenance downtime.

Due to the special nature, no cool things or question included on this podcast.

About Matt Linton

Matt is an Incident Manager (aka Chaos Specialists) for Google, which means his team is on-call to handle suspected security incidents and other major urgent issues.

About Paul Turner

Paul is a Software Engineer specializing in operating systems, concurrency, and performance.

Interview
  • Protecting our Google Cloud customers from new vulnerabilities without impacting performance blog
  • What Google Cloud, G Suite and Chrome customers need to know about the CPU vulnerability blog
  • Google Security Blog, Today’s CPU vulnerability: what you need to know blog
  • ProjectZero News and Updates by Yann Horn blog
  • Spectre Attack paper
  • Meltdown Paper paper
  • Intel Security Center site
  • Intel Analysis of Speculative Side Channels site
  • An Update on AMD Processor Security: site
  • ARM Processor Security Update site
  • GCP Compute Engine Live Migration docs
  • GCP Security Overview site

Patch your operating systems and all the things. Keep updated.

MARK: Hi, and welcome to a very special episode of the Google Cloud Platform Podcast. I am Mark Mandel, and I am here with my colleague, Melanie Warrick.

MELANIE: Hey, Mark.

MARK: Hello, Melanie. How you doing?

MELANIE: It's like our after school special.

MARK: It is our after school special, yes. You convinced me to do a midweek episode.

MELANIE: I did. There was some news that broke in the last couple of weeks around the CPU vulnerabilities, sometimes known as-- or known as Meltdown Inspector. And so we were able to get in touch with the security guys on our end, had provided some great content around blog posts that are out there and how it's impacted GCP. So we got them on the podcast so that they could talk about the blog and give us some additional insight, and we're going to go ahead and jump straight into that for everybody. We're going to skip our Cool Things of the Week. We're going to skip the question at the end because we are focused mostly on making sure that we provide this information out for everybody to have access to.

MARK: Excellent. Let's go talk to Matt and Paul.

MELANIE: All right, on this week's podcast, we have some of our security experts joining us. We have Matt Litton, who is our chaos specialist, and we have Paul Turner, who is a software engineer. And we are going to talk about the CPU vulnerabilities that were discussed in the last couple of weeks and had some good content that has come out around how to protect yourself. So before we dive into that a little bit, can you guys tell us a little more about who you are and what you do?

PAUL: My name is Paul, and I work on the operating system [INAUDIBLE] team here. I was involved from a fairly early point in resolving this vulnerability, trying to find mitigations, ways to improve the problem, and how to do it in a way that really had the minimal impact.

MATT: And my name's Matt. I'm an incident manager on Google's security and privacy team. My team is the 24-hour escalation point for suspected security issues and other things that urgently need security engineers' attention.

MELANIE: Great, thank you. So I know you both co-wrote a blog post that was helping to outline what these vulnerabilities are, but can you tell us a little bit more about them.

MATT: Sure, so essentially, the core of the vulnerability is something that's been latent in CPUs for a long time, probably close to 20 years, and it exists in sort of an efficiency trick that CPUs have been pulling to get faster and faster. What happens is when you make a request of the CPU, and there's a couple of different potential answers that may result, sometimes the CPU will start executing all of the instructions and preparing the results, so that when you ask it for the specific answer you wanted, it's already got the answer ready to go and give to you instead of having to wait for you to instruct it. And it turns out that there are ways that someone can take advantage of this by tricking the CPU into executing branches against areas of memory that would not ordinarily be accessible to that person and then retrieving the results from the speculated-- they call it the speculated branches.

PAUL: Yeah, I just wanted to add that I thought the way that Matt describes this as tricking the CPU is a really good way to think about it. Generally speaking, the CPU is trying to do what it thinks should happen next in your program, but these attacks discovered new ways of tricking it into thinking that something should happen, which wouldn't normally happen during correct execution. And the way that occurs allows some protection demands to be bypassed.

MELANIE: So what are things that are vulnerable to this type of attack?

MATT: Well, unlike a lot of really dangerous attacks that we've seen before, the good news first is that this vulnerability is a read only vulnerability. The really core danger here is that someone can trick the CPU into giving you access to areas of system memory. Now, that's very important because we have a lot of secrets in system memory that you want to keep secret. You have passwords and encryption keys and pretty much--

PAUL: Credit card numbers.

MATT: Credit card numbers. Any kind of data that you would normally be processing and care about will touch memory. But unlike previous vulnerabilities, such as Heartbleed, an attacker can't just reach out, touch you, and your done. There already has to be a way for the attacker to get onto the system in order to exploit this vulnerability to steal secrets. Now, I don't want to minimize that because this is the security industry, and every good pentester knows there are hundreds of ways to get onto a system, but it's not once and you're dead, like Heartbleed was.

MELANIE: Well, and I read that this is impacting personal computers, mobile phones, as well as cloud services. Is that accurate?

PAUL: Yes, there's three kind of major variants that were discovered of this attack. Each of them apply to various subcases of how we use computers today. Variant one is particularly important relative to the encapsulation that we use for things like JavaScript engines.

Variant two is particularly important with respect to how we isolate virtualized machines. You can think of cloud hosting. And variant three is also really important in that it's the most direct way, on a consumer device, to steal somebody's data if you are able to arrange for local execution.

MARK: So that that's actually really interesting because I've noticed, as a consumer, I've got information to patch stuff on my cloud provider as well as upgrade my browser, and I think I've seen stuff about upgrading the driver in my GPU. Is that because there are just a variety of ways of taking advantage of this vulnerability?

PAUL: Exactly. When you see suggestions that you should patch your operating system, many of those really are looking at the third variant, where you can attack local system memory, whereas some of these other variants that we see, variant one and variant two, commonly referred to the media as Spectre, you're seeing-- they are seeing providers applying the protections. You're seeing browsers adding protections for variant one. You're seeing cloud providers having protections for variant one and two. Like always in security, protection means fixing everything, and some of those things are intended to be employed by consumers.

MELANIE: Then actually, when you mentioned Spectre, I wanted to ask you-- we have heard of the terms Meltdown and Spectre, but I also know the blog post that you guys provided doesn't necessarily mention those explicitly. Matt, you wanted to mention a little bit about that?

MATT: Yeah, so there's been a decent amount of confusion out there over Spectre and Meltdown and all of the other variants. The background behind this is that there was a little bit of a co-discovery going on with this vulnerability. Jann Horn from Google's Product Zero discovered the vulnerability and kind of teased it out into three distinct variants that we call variant one, two, and three. A few months later, another research team who was also looking into the same sort of way to trick CPUs discovered the vulnerabilities as well, and they chose to give them names.

Naming a vulnerability is something, in the security industry, that is kind of a little bit preferential. If you like to name and logo your vulnerabilities, you'll do it, and if you find it to be uncouth, you don't do it. So Project Zero did not choose to give names to the vulnerabilities, but the other researchers did. And so they named them Spectre for variants one and two and Meltdown for variant three.

And the reason we ended up with two names that covers three vulnerabilities is that the team that named Spectre hadn't discovered all of the things that make variants one and two unique from one another. And so they thought that they had two bugs, but they really had three. So we ended up out here with this naming confusion where Spectre covers variant one, which is most [INAUDIBLE] in JITs and browsers and things; variant two, which is most exploitable as a hypervisor bypass. And then the whole separate variant three that they call Meltdown, which is most exploitable in terms of interacting with the operating system itself.

PAUL: And the thing that's important to know there is where Matt said variant one and two apply it to the browser and the hypervisor, and then that variant three applies to the host operating system-- for all consumers out there, the patches for Linux and Windows are already being included in those operating systems. So as long as they upgrade to use the latest versions, they will be covered.

MELANIE: And that means MacBooks as well as those who use the Linux operating system?

PAUL: That's correct. Apple has also published patches for this one early. I should have included them in the previous list.

MELANIE: No worries.

MARK: So this is interesting, too. So you said that the vulnerability is at a CPU level, but a lot of these patches are coming at what are essentially a software level. How did these fixes kind of work, then? How does that help people?

PAUL: The fundamental thing that all the fixes are trying to do is arrange for the execution to be occurring in a way so that the tricks that are being played no longer apply. In the case of Meltdown, for example, you are tricking the CPU into looking up memory that it shouldn't normally be able to access, and the way that that was mitigated was by, instead of using the regular access protections on that memory, completely unmapping it in those contexts that it should not be accessed. This meant that, even if speculative execution were to try to access it, it didn't actually know how. So that's where variant one and two have similar kind of approaches, where we're trying to find ways to restructure the program control flow so that these ways that you can influence or bias when the CPU is guessing what to do next, are constrained or eliminated.

MATT: I think there's a good opportunity here, too, to talk about how the words fix and mitigate tend to be used very intentionally by security engineers to mean specific things that other people might not really put together. When we say that a vulnerability is fixed, we mean that the actual root vulnerability is no longer present in the device. A lot of the software mitigations that you're seeing come out, these patches for Windows and things, they're not fixing the vulnerabilities so much as they're mitigating it by making it impossible for someone to take advantage of.

And the analogy that I like to use for that is, if you have a broken window, you can fix the window by replacing the window glass. You can also put a board over it to mitigate the hole in the house. You haven't actually fixed the window yet. You've just made the hole in the house impossible for someone to come in through.

MELANIE: So in terms of the fixes and the mitigations, especially from a standpoint of the actual processors, I know at CES, the CEO of Intel, Brian Krzanich had mentioned that they have basically patched-- or they have provided a patch for 90% of the vulnerabilities affecting the processes, or they will be by the end of the week, and that they are expecting to handle the remaining 10 by the end of January without them requiring to recall any chips. Did you guys hear about that? Did you have any additional comments that you wanted to make on that one?

PAUL: I would extend the analogy that Matt just made. It's clear that, for the current generation of CPUs, we will only be able to develop mitigations and not fixes. So we're going to be driving with boards on the windows for a long time. We do expect that, in future generations, that these problems will be fixed at a hardware level. However, we are also optimistic that, for all current CPUs, the mitigation can be deployed so that we do not need to replace or update the hardware.

MELANIE: So what are some of the things we've done for GCP and G Suite, and also additional internal Google products and services that are provided?

PAUL: There, we've deployed, and in many cases developed, the mitigations that are being used across the industry. We have to talk about them on a case by case basis because there are three variants, and they each have their own mitigations or fixes. However, in all cases, we have deployed fixes and mitigations that we believe cover all known classes of the current vulnerabilities and ensure that customer data is-- remains protected.

MATT: Yeah, it's difficult to talk about this distinctly, in cloud especially, because this vulnerability affects so many things, and cloud by nature affects people at different levels. You have software as a service, or somebody's buying the software and access to it from us, like Drive, Docs, Gmail. We're providing everything from the software on down, and it's very easy for us to easily just say, having mitigated our infrastructure that runs those products, and having checked the software for vulnerabilities, we're confident we've protected people there.

But then there's things like GCE, which is infrastructure as a service. People are paying us to run everything from the hardware for them and manage scalability and network and all those things, but they also want to run their own operating system inside a container. It's more complicated there because we've protected all of the things that they rely on us to protect for them, but you can't simply say everything's fixed because if a customer runs their own operating system, and they're worried about attacks occurring from within their OS to their own data stored in it, they also have a little bit of work to do patching for these variants as well.

PAUL: A good example of this would be a multitenant customer, where you're kind of-- someone might be doing re-hosting, where they're using-- they're purchasing hosting from a cloud provider such as Google or Amazon or Microsoft, and then they're internally kind of releasing that space to their own uses. There, the attacks that Matt just described, in many ways, they are a hosting provider themselves. And so the infrastructure that they've contributed, that runs on top of the underlying hosting providers, such as Google, Microsoft, Amazon, that infrastructure must also be protected, and that's something that they must do in their own development and deployment.

MARK: I want to go back a little bit. We were talking about mitigation and fixes, and that's actually a really interesting turn of phrase. And maybe to extend the metaphor, if I have a boarded up window, it's nowhere near as pretty as, say, a fixed window, but it sort of works kind of the same. But maybe I don't have as nice a house. If I have-- we're talking about mitigations for this vulnerability. Are there any sort of side effects or consequences of these mitigations that may end up sort of in the wild that may be necessary, but should be stuff that people should look out for maybe?

PAUL: That's a really good question. Obviously, as you say, a boarded up window is not something that's going to add value to a property, so we've been very careful in how we develop these mitigations to try and make sure that the effects of that were not observable. It's definitely true that, in early versions, when we're kind of prototyping and developing internally, we were seeing overhead from many of these fixes or mitigations. But as we evolved and found newer and better ways to add those protections, we really were able to minimize the overhead that we saw. In many cases we actually see no perceptible overheads for both our own and custom workloads, while still providing these protections.

MATT: Yeah, we boarded up the window with clear plexiglass that's mostly indistinguishable from the old window.

MELANIE: And in terms of GCP, in regards to what was being updated and boarded up, was there any maintenance downtime?

PAUL: There was not.

MELANIE: You've described the guest and host level mitigations that have been taking place. What do you need to do in regards to achieving those host level mitigations being applied to your GCP instances?

PAUL: So when you say host here, what you're really talking about is the Google level mitigations. When we say host, we mean the infrastructure that's running the virtual machines that we're providing to our users. There, we're actually very fortunate in that we have a technology we called Live Migration, which allows us to move a running guest between host instances without requiring a reboot or otherwise interfering with the guest. We were able to actually use this to transparently deploy all of our variant one and variant two mitigations for our infrastructure with no customer reported downtime.

MARK: But this does mean that people just-- I just want to hit this point. People do need to upgrade their own operating systems if they're managing themselves on something like GCE or GKE or something like that?

PAUL: Definitely. As a matter of best practice, users should be patching their own operating system. The vulnerability that's most important there is variant three, also known as Meltdown. While that attack will not allow information to escape between the hypervisor or our infrastructure and their operating system, it is possible that local attacks from within their VMs on their own data may succeed without those patches. This is really a-- this is really a matter of best practice.

MELANIE: Something I keep hearing from people is make sure you update your software. Whether it's OS, whether you're on a phone, whether you're on your operating system on your own laptop, you always want to make sure, when you see new patches come through, to keep updating them. Do you guys have any additional recommendations for best practices around security?

MATT: Yeah, I would say that anytime a complex vulnerability gets discussed, a lot of people turn to wondering, well, what should I do about it? And in security, we kind of have consistent messaging out that most of the time, the answer is keep yourself updated with software patches from the vendor, and make sure that you're using two factor authentication and other protective layers on anything you care deeply about. And the reason we always come back to hammering on that is, most of the time, those are the two most effective things you can do to protect yourself not just against this vulnerability, but in general against vulnerabilities that you know about, vulnerabilities that you don't know about, tomorrow's big discovery that may come out. It's all going to end up boiling down to take care of the things you're running. Keep them updated, and keep your authentication safe.

PAUL: It really does come down to best practice. The guidelines that we're proposing here are uniform within the security industry, with the idea, as Matt said, that they leave you in the best position for unknown vulnerabilities that you may not know about. And because often, when a vulnerability like this is discovered, you have to be able to update your software quickly, and if you have not been keeping up to date with other patches or security improvements in the infrastructure, you could have a large technical debt, which you might have to mitigate in a very quick fashion when something like this comes out. So those practices are both preventative and in making sure that you can be as agile as possible in responding to these issues when they arise.

MARK: So you're not saying that there should be any change in the way that people are handling their security best practices? Really, it's the tried and true methods stay the same way they are?

PAUL: Yes, generally speaking, variant three, or Meltdown, is not externally exploitable. So users are only really vulnerable from their own software because we've already patched our exposure to variant one and two, and their exposure to variant one and two from our infrastructure. But again, we really do believe this is a best practice. We strongly encourage all of our users to apply these fixes just as they would any other security fix to ensure that, as new ways to attack both this and other parts of the system are discovered, they can continue to be protected as quickly as possible.

MELANIE: Can you guys comment a little bit too on the fact-- I know in the blog post, and with what Project Zero released, they were saying how this was identified, and then there was coordinated efforts to try to do the fixing and the patching before it was going to be announced publicly. Would you be able to comment a little bit more about that process and what that looks like for coordination with security?

MATT: To an extent. This is one of those things that is really difficult and always gets handled on a case by case basis, which is you have a giant bug. You have a lot of work you need to do to fix it. Other people you know also have a lot of work that need to happen to fix it. And then you end up with interdependencies.

If Google were to fix this bug all on our own without contacting anyone else, then other operating system vendors like Apple and Microsoft wouldn't be protected, and our users would be left at risk. So for something of this scale, there's always decisions that need to be made around what other companies need to know. They need to know to protect the users. Do they need to know so we can protect our users, so we can protect ourselves? And sort of an ad hoc group ends up forming of people who are affected, who believe they have something to contribute to fix it, and who are all sort of giving and receiving advice and technical data from one another to try to put all the users on the internet in as good a footing as possible when we're ready to talk about the issue without creating such a large group that it becomes impossible to keep the issue contained.

PAUL: And this really was an industry wide effort. Our own attempts at mitigation began at the same time as other industry players. We weren't trying to provide a headstart for ourselves or any other advantage. We really saw this as an industry wide problem that everybody would have to try to contribute, both in terms of finding ways to mitigate these complicated and difficult vulnerabilities, and in making sure that everyone would have the chance to deploy the protection before disclosure occurred.

MATT: Yeah, there's a little bit of a thing called supply chain, which plays into this a lot too, and supply chain is a security industry term which references the fact that you inherit vulnerabilities from those you trust to provide you with things. And so I'm speaking to you right now from a MacBook Pro. I have a supply chain dependency on Apple to make sure Apple is secure if I am to do my job and keep Google secure. So in a lot of ways, these sort of multi industry partnerships, for something this size, are brought about by necessity from the fact that the supply chain forces us all to adopt at least a little bit of each other's problems.

MARK: You kind of said something interesting that I don't know if people who aren't necessarily familiar with how security research goes-- you mentioned that, up until the point, I think you said, that it was going to be public, or it was going to be churned to the outside world. Can you talk a little about how that research process goes, about finding vulnerabilities, and then finally making them public, and the steps that go through that, and why those steps exist?

MATT: Sure, those steps exist as sort of a compromise that has slowly developed over time in the IT industry itself. There are processes often these days called coordinated disclosure, and what happens is a security researcher finds a vulnerability, and they report it to whoever it is that actually makes the product the vulnerability is in. And then the vendor and the researcher agree upon a period of time in which the vendor gets a head start on fixing it to actually protect people before the researcher gets to publish it. That's in contrast to a behavior called full disclosure, which is when researchers find a vulnerability and immediately disclose it to the world.

There's a lot of pros and cons, and I don't think probably this is a good place to go into it unless you have a couple extra hours. But I can tell you what Project Zero does is they have a standard process where they find a vulnerability, they notify the maintainer in whose product the vulnerability exists, and they give them 90 days to remediate it before Product Zero discloses it. And if the vendor is able to remediate it earlier, then they disclose it earlier in coordination with the vendor, and that's sort of a compromise that has been reached between the pros and cons that hit you when you go full disclosure and the pros and cons that hit you when you allow the vendor to completely control disclosure. So much of the security research industry right now lives in that compromise zone of coordinated disclosure with a defined time period.

MELANIE: And this got released a little earlier than expected. Is that right?

PAUL: Yeah, but relative to what Matt was saying, I think this is the actual more interesting point, that is this was an example of an issue that was so pervasive and effected so many users that we actually extended that standard 90-day policy because it was clear that all of the supply chain level mitigation that users would require would not be possible to deploy within that window. So while it's true that, ultimately, the disclosure happened a few days before we had originally intended, that's really almost after the fact relative to the discussions of how that time was being managed.

MATT: So part of the decision making process here in terms of disclosure is there's a really difficult and constantly evolving decision to be made around is keeping this issue on the downlow and giving people time to fix it better for users and better for individuals than telling the world about it? And in the early stages of this, and in a lot of the mitigation phases, where everyone had to develop patches, we had to performance test patches, we had to validate that they worked, obviously keeping it on the downlow and giving people time to fix it, so that when the world learned about it, everyone would be prepared to defend against it, was better for everyone. But to a certain point, people began to realize that there was a lot of fixing going on without a lot of talking about it, and they began to put two and two together. And really clever individuals very close to the end of when we were almost prepared to announce it anyway, some very clever people started putting together public examples of both the flaw and how you might use it to attack user data.

And at the point where there's public examples of someone being able to exploit an attack, that equation has flipped. It's no longer better for users and better for people to have a secret going-- a secret vulnerability that they don't know how to defend against if someone else knows how to attack. And so that kind of forced our hand into talking about the issue a little bit earlier than we were originally planning to.

MELANIE: But you guys were pretty close to getting out there anyways with this information?

MATT: So close.

PAUL: I think the two important things to be aware of there that it really was only a matter of days before the original disclosure date. And then the other thing, calling back some of Matt's other comments, the things that people were seeing that were leading them to ask these questions and start probing these areas were those supply chain level mitigations. They'd seen Linux deploying patches from up down. They'd seen Microsoft deploying patches from up down.

Apple deployed patches from up down. This kind of coordinated improvement doesn't happen by accident, so that was immediately setting off people's alarm bells that something was going on. But also the fact that all of those vendors had the time to develop those patches was why the disclosure was delayed, so that even though people were seeing those things happen in the supply chain, they'd happened so that when disclosure did occur, even though it was actually a few days early, there were protections available and, in many cases, deployed to all of the major computing platforms that users depended on.

MARK: Well, I think we're slightly running out of time here, but before we wrap up, is there anything that you want to make sure you mentioned? Maybe something we haven't covered, or if there's extra information that maybe people should go look at or listen to or watch, that we haven't managed to cover so far?

PAUL: Sure, I think, generally speaking, this was a really good example of a lot of people in the industry coming together to look at a really tough problem, and I expect that really, in many ways, we may have opened a Pandora's box here. There are going to be new ways to attack these hardware features. People are really going to start looking at other parts of the CPU and how can they contentiously be exported in a similar fashion. And while that obviously potentially presents a rich new vein of exploits for malicious users, I'm really encouraged by the fact that, as an industry, we seem to be moving in a good direction to be able to prevent them and protect users to the best of our ability.

MATT: Yeah, I don't think I really have too much to add on top of what Paul said. I think there's a lot of research in this area that's going to go forward into the future, and I think, like it or not, the next 10 years of the industry is still going to be a constant cycle of finding out new and interesting things that are unexpected about our software and then rolling out ways to make them behave the way we originally thought they would behave. And I think that's-- I mean if that's not an integral part of the computer industry at this point, I don't know what is.

MELANIE: Well, great. Thank you both very much for joining us and sharing this information. We appreciate it, and then that's all we got. Thank you.

PAUL: Thanks, guys.

MATT: Yeah, thank you. Have a good one.

MARK: Thanks.

MELANIE: All right, thanks, Matt and Paul. We appreciate that. And, Mark, I think that's going to be it for today's episode because we're keeping it a special episode.

MARK: Sure.

MELANIE: But we'll definitely do our questions and cool things for the next one.

MARK: Yeah, see you on Wednesday, then.

MELANIE: See you then.

Hosts

Mark Mandel and Melanie Warrick

Continue the conversation

Leave us a comment on Reddit