Thoughts about Meltdown and Spectre fallout and dealing with it


I’m pissed. Quite angry.

While I see a lot of users saying “wow, this spectre attack is genius. And it’s been there for 20 years hidden in plain sight!”, there’s a minority of people saying “I f–king told you”. I belong to the latter.

I was under the impression branch predictors and speculative execution were designed to protect against these types of attacks. How? I thought implementations would roll back caches. Or keep a hidden cache used for speculative execution that is later copied to the real cache, Or something like that. Of course I also thought such approaches weren’t easy: For example keeping a hidden cache means now you have two authoritative places that say what value is the correct one; which becomes a synchronization nightmare in a multicore environment. And the final copy once a prediction turns out to be correct would take time. It’s not free. But I just assumed HW engineers had just figured this stuff out. They didn’t. Oh boy, I was so naive.

Side channels exploiting effects of branch predictors and cache effects are not new. The idea of control flow side channel attacks [1] is not new, cache timing attacks [2] are also not new [3], and neither are branch prediction side channel attacks [4]. Put them together and voilá, Spectre.

Cache missing for fun and Profit from 2005 is also a good read for a side channel attack, this time specifically targeting Hyper Threading.

If you were like me: aware of all these things separately with a guess-, vague- and rumour-based assumption of how CPUs worked, you can figure it doesn’t take much brain to understand speculative execution had to be resilient against timing attacks. I was wrong and naive.

In fact, I infer the problem has to be at least partially known to AMD because of Meltdown: Meltdown is a subtype of Spectre. It uses branch prediction, speculative execution and cache timing attack to access an illegal area of memory. AMD chips are smart enough to either serialize memory load and access checks, or roll back the value loaded into cache if the access check failed. Either way, that renders them invulnerable to Meltdown. You wouldn’t do that unless you were aware you can ‘read’ from the cache even if you don’t have direct access to it. Because not doing that gives you higher performance.

This is telling me that at least, someone in the HW department from AMD did take cache timing attacks during speculative execution into their chip design. I wasn’t so naive after all! Kudos to them!

Spectre is just a broader type of type of attack, and the one that works on all HW is the one targeting legal memory accesses (for example, a Javascript webpage reading all of the browser’s process memory).

 

I am sure of two things:

  1. Had Intel designed against Meltdown, like AMD did, this issue wouldn’t be nowhere near as huge as it is right now. Spectre is just more of the same we’ve seen over decades. Probably the biggest fuss would’ve been seeing Web Browsers taking some countermeasures against it, and VMs taking some of the hit.
  2. Had AMD been making decent competition against Intel’s offering (i.e. had they developed Ryzen a lot sooner), this issue also wouldn’t that big of a deal. Right now most datacenters and server farms are running Intel chips. It wouldn’t have been the same if half of these servers were affected. Who would’ve guessed, having competition is healthy.

 

Dealing with the fallout

OK enough pointing fingers and blaming and saying “I told you”. The reality is that we have virtually like 70% or more of our infrastructure running on Intel chips vulnerable to Meltdown (and 100% vulnerable to Spectre), that had to take a big performance hit to workaround it.

Some took it fine. Some are being hit hard.

So how do we deal with it?

Reevaluating API decisions

One of the most affected problems by Meltdown fixes has been IO. IO requires syscalls. Thus IO heavy applications (particularly dealing with lots of small files, and database operations) are taking a big toll.

Something has struck me odd is that IO APIs haven’t changed in decades. We’ve been relying on speculative execution and branch prediction a lot to make them fast, ignoring the most obvious flaw: these APIs are inefficient.

Let’s take git for example. Git needs to stat every file to check for changes. Right now we have to call stat() on every single file. Wait. What? Why?. If I’ve got 1000 files, we’ll perform 2000 kernel<->user transitions. Why can’t we make just 2?

I come from a graphics API background. Statting multiple files at once should be obvious:

int stat_n(const char **pathnames, struct stat **statbufs, size_t num_files);

The same goes for opening and closing file handles. Why does readdir work on 1 folder at a time?

There’s barely any notion of batching IO commands other than buffering inside the lower level implementation whenever possible.

Batching is very common pattern in DX and GL APIs to reduce the number of API calls (and kernel transitions where applicable). It sounds like IO API writers should read Batch! Batch! Batch! presentation.

Git isn’t the only example. Compilation is also another contender.

Compilers could for example benefit from having a server-side process that keeps .cpp files resident in memory while an IDE / text editor is working on it, and can be accessed via shared memory. When a file changes, the server updates the RAM cache. When compilation is triggered, it reads the cpp file from shared memory. No need to perform IO calls at all during compilation.

Compilers can generate a precompiled header to speed up parsing of header files, but rely on the OS to cache this file in RAM every time it gets open by a new compiler instance. Why do we do this? Because we assumed it was good enough.

Now, OS engineers should be asking database developers (e.g. MySQL, Postgre, SQLLite, Mongo, etc) what API changes they could use that would take advantage of batching (while ensuring atomicity). I’m not an expert on that field.

What graphics development has taught me is that you can go trough great lengths to work with the tools we are given. But if we have the chance to change the tools, we can go much further. For decades database implementations had been given only a hammer. So naturally everything looks like a nail.

Asynchronous File Operations are supported on both Linux and Windows but there is no standard unified interface either.

Filesystems lack “Block based operations”. For example a compiler would produce thousands of files with the following output:

  1. FileA.obj (2 MBs)
  2. FileB.obj (360kb)
  3. FileC.obj (1.2MB)
  4. FileD.obj (600kb)
  5. FileE.obj (120kb)
  6. FileF.obj (59kb)
  7. etc..

Which have the following characteristics:

  • It is hard to know the size beforehand. However it is possible to predict a very accurate yet approximate file size based on past compilation iterations
  • Order in which they’re generated is known beforehand or can be controlled
  • They get fragmented a lot due to heavy IO in small increments

I never understood why we have to dump all these object files with a large variance in file size in the same folder when we could store them in chunks (e.g. 40MB file blocks) and load them in order. Want to load/write file FileABA.obj FileABB.obj and FileABC.obj? let’s load blocks 30 and 31, and defragment them on the CPU in user space. Keep these blocks in RAM because we know we’ll be loading FileABD and FileABE soon.

While an application-level solution can be crafted by just embedding the files into another file will work (eg. storing the .obj files inside multiple tar files), this solution would have to be reinvented multiple times (each software package would have its own tar-like solution) and another problem is that individual .obj files get obscured into cryptic .tar files (that not all existing tools would know how to read).

If the OS and applications worked together, it should be possible to load and save these files in blocks (and fragment/defragment them in CPU) instead of doing it in individual basis. OSes already do a lot heuristic to perform a similar job. But why not lend them a hand by giving a little more information of our usage patterns? We have HW-based NCQ (Native Command Queuing), why don’t we collaborate further?

We’ve been assuming computers “got fast enough” that we let these (now insecure) HW-based optimizations hide our dirt under the rug. It’s time we revisit these 30 year old API designs and reclaim the performance that has just been stolen away from us as 2017’s parting gift.

 

Other stuff I want to speak out

This is unrelated to Meltdown and Spectre. But considering what happened was partly negligent and being naive, it’s time I speak up a few other concerns I have:

 

UEFI is a security nightmare. A simple kernel bug was able to render a computer unbootable. And it happened twice. Intel took something so simple such as loading a base system with minimal video output and user input, then load a bootloader from a harddrive and hand off control to the OS, to a full blown Operating System that may be bigger than the one on your harddrive. To make it worse, it has configuration settings that are saved into a flash memory, meaning there is no “clear” or “restore” jumper. Once bricked, it’s dead forever (unless you remove the BIOS chip and manually flash it with specialized equipment)

Of course UEFI when compared to Legacy BIOS boots faster. But that’s just because Legacy BIOS has to boot in 386 compatibility mode. BIOSes could have been modernized without blowing out of proportion.

The biggest fear is that malware could one day exploit a 0-day vulnerability that gets them into UEFI, and becomes always present, regardless of OS running, and impossible to remove through any normal mean.

 

Intel ME (and its equivalent AMD TrustZone Secure Processor). Another security nightmare. Unauditable CPUs that have access to virtually everything in the machine, are invisible to the system, and cannot be controlled. Like UEFI, the biggest fear is that malware could one day exploit a 0-day vulnerability that gets them into the IME, becoming impossible to remove, and could cause DDoS attacks on a certain target, steal information, shutdown entire datacenters, damage hardware beyond repair. Bad people could generate a disaster. Last year Wannacry made a lot of fuss, but Wannacry was just like a monkey with a nuke developed by the NSA. Imagine if this weapon had been used by the Joker instead… an evil mastermind that just likes to watch the world burn.

There’s barely any value added by the IME, and lots of holes. The IME wouldn’t be so bad if:

  1. Code would be open source and auditable
  2. Can easily be flashed externally (to flush out malware that got through)
  3. Wasn’t invisible to the underlying systems

Instead we’re relying on security through obscurity, and “it’s secure, trust me” speeches.

 

Reliance on SSL for everything. You think Spectre is bad? Imagine tomorrow waking in a world where somebody somewhere, e.g, discovered a solution to the RSA problem of very large numbers that is extremely efficient? We rely on banking, secure authentication, paying bills, taxes (and yes, cryptocurrencies). We’d be rolling back 20 years.

Oh yeah, and what happens to military applications relying on it? Like… drones? If crypto ever gets broken, that multi-million dollar drone becomes as hackeable as an RC toy plane. I hope that besides asymmetric keys for authentication, these bad boys also rely on exchange of symmetric, randomly salted, hashed passwords.

Does anybody know what the contingency plan is if this ever happens?

 

Adding computers to cars. I understand computers can improve the engine. But they have no place in e.g. remote control parking.These computers have no place in steering, breaking and wheeling systems. If they do need to exist, they must require difficult physical access, and must be isolated from the rest of the systems.

 

Relying on the Cloud (storage, ‘live’ software and services). Kinda obvious now that Spectre is out which directly affects Cloud services. But still, the rule of thumb is, if it’s not in a computer you own and control, it’s not really yours.

 

Updated 2018/01/18:

BadUSB. There is a hidden microcontroller that controls USB ports. For example, things go very bad when a USB device reports a device name that goes out of bounds or has funny unicode characters (I found out about this by accident when a faulty keyboard managed to work correctly but reported a garbage name. This is a serious unpatched bug. Apparently it’s not new, go to page 7). It’s not just the driver or the hw interface that needs to be protected, but basically anything that will display the name of the USB adapter!

 

AVX2 severely throttles the CPU. See CloudFlare’s On the dangers of Intel’s frequency scaling. This problem is two-fold:

  1. It causes biases in profiling new code. When profiling new code that uses a specific AVX path to optimize a specific routine, we rarely account how it affects the rest of the system. The problem with AVX2, is that a routine that is 4x faster in AVX than regular C on a code that takes 40% of CPU time, can cause the whole system to actually run slower because of overall frequency scaling went down thus that other 60% of CPU time now takes significantly more.
  2. It allows a userspace process to control with fine granularity and low latency at which frequency the CPU operates. Just lock a thread to a single core, run an idle spinloop at 100%, and use an ocassional sequence of AVX2 instructions to lower the frequency at will. This smells a side channel attack opportunity all over the place, again. I just can’t imagine yet how to exploit it, but I do know that having such power over frequency scaling from userspace is extremely dangerous. The good news is that fortunately it is easy to spot a process taking 100% of a single core. The bad news is that changing the power scheme / CPU governor to “performance” usually doesn’t require root privileges, thus you could control frequency at will without requiring to run an idle spinloop. Fortunately that can be addressed with a simple OS patch.

 

 

 

[1] David Molnar, Matt Piotrowski, David Schultz, and David Wagner; The Program Counter Security Model: Automatic Detection and Removal of Control-Flow Side Channel Attacks; https://people.eecs.berkeley.edu/~daw/papers/pcmodel-long.pdf
[2] Mehmet Sinan Inci, Berk Gulmezoglu, Gorka Irazoqui, Thomas Eisenbarth, Berk Sunar; Cache Attacks Enable Bulk Key Recovery on the Cloud; https://eprint.iacr.org/2016/596.pdf
[3] Yuval Yarom, Katrina Falkner; FLUSH+RELOAD: a High Resolution, Low Noise, L3 Cache Side-Channel Attack; https://eprint.iacr.org/2013/448.pdf
[4] Onur Acıiçmez, Jean-Pierre Seifert, and Çetin Kaya Koç; Predicting Secret Keys via Branch Prediction; https://eprint.iacr.org/2006/288.pdf
[5] Colin Percival; Cache Missing for Fun and Profit; http://www.daemonology.net/papers/htt.pdf