Brussels / 31 January & 1 February 2015

schedule

Interview with David Chisnall
The CHERI CPU. RISC in the age of risk

David Chisnall will give a talk about The CHERI CPU. RISC in the age of risk at FOSDEM 2015.
Q: Could you briefly introduce yourself?

I’ve been involved with a variety of open source projects for a while. I’ve been hacking on LLVM since 2008, on GNUstep for longer, and on a few other projects before that. I wrote the GNUstep Objective-C implementation, though I don’t use Objective-C much anymore. I’m in my second term as a member of the FreeBSD Core Team.

By day, I work at the University of Cambridge on a big project that spans operating systems, compilers, and computer architecture, with a focus on security. I also have a few other responsibilities within the university: I teach the masters’ compiler module and am Director of Studies for Computer Science at Murray Edwards, one of the women-only colleges in the University of Cambridge.

Q: What will your talk be about, exactly? Why this topic?

I will describe the CHERI research processor developed by the University of Cambridge and SRI International. For some background, my colleague Jon Woodruff is giving a talk on Saturday about BERI, which is the base processor for CHERI. BERI implements an ISA roughly compatible with the R4K in Bluespec SystemVerilog, a high-level HDL that makes it very easy to make significant changes. We usually synthesise it to run at 100MHz on an Altera FPGA, with 1GB of attached DRAM. This gives a system that is useable at interactive speeds (it even runs X.org with an HDMI monitor and USB keyboard and mouse!).

CHERI extends this base with a capability model for (virtual) memory. CHERI’s memory capabilities are very carefully designed to be useable as pointers in C, but can also be used at a much coarser granularity. At any given point in time, a thread can only access the subset of the process’s memory for which it has valid capabilities. This means that you can do object-granularity bounds checking, or fine-grained in-process compartmentalisation, with the same hardware primitive.

Q: What’s the history of the CHERI project? Why did it start and how did it evolve? What were the biggest challenges you have encountered?

For the full history of CHERI, you have to go all of the way back to Multics. Peter Neumann is one of the project leaders and a lot of the ideas in CHERI are evolutions of his ideas from Multics and PSOS. Robert Watson proposed the project after encountering limitations of conventional MMUs when creating Capsicum.

Security in computer architecture has gone through lots of changes. On the big machines where systems like MULTICS were developed, there were lots of complex security features, but many of these were lost in the transition to minicomputers and then microcomputers. Some didn’t fit well in pipelined designs, some were’t seen as important.

In particular, the threat model has changed a lot over the last decade. Modern memory management units and protection rings are designed to allow operating systems to protect users of a computer from each other. Nothing that I do should be able to seriously impact the experience of other users of a computer. That made a lot of sense when most computers were big multiuser systems. Now, I am the only user of my laptop, of my phone, of my tablet, and so on. Each of these devices runs a web browser, which contains multiple security domains. If you visit a webmail site, for example, and view an image in an attachment, then the image is decoded by something like libpng (which doesn’t have the best security track record imaginable), usually in the same process as the rest of the page. Even if tabs are isolated from each other, a malicious image that exploits libpng can compromise your email account and, for example, send an HTTP request to the server that will send an email containing your login credentials. Process-based isolation can’t scale to the kind of granularity that you need to protect against these threats.

I joined the CHERI project about two years after the start. At that point, a first version of the CHERI processor existed. The team had written a small microkernel (Deimos) with a lot of hand-written assembly that ran some graphical demos, but there was no compiler that could use the capability features. I tried to port LLVM and found a number of limitations in the instruction set as a C target that we fixed with CHERIv2. That version was presented at the International Symposium on Computer Architecture (ISCA) last year. We then refined it to fully support the C abstract machine. It turns out that mostly kind-of supporting C is a lot easier than completely supporting all of the weird and wonderful things that people want to do with pointers. We’re presenting this version at ASPLOS (the International Conference on Architectural Support for Programming Languages and Operating Systems) later this year.

Q: So CHERI is an outgrowth of the Capsicum project, which explored hybrid capability models in the context of UNIX operating system design. Which limitations to current CPU designs did Robert Watson discover that made application compartmentalisation tricky and how did it limit Capsicum?

It’s important to differentiate between isolation and compartmentalisation. The first isolates something so that all of its interactions with the outside world must go via something that makes policy decision. Compartmentalisation involves splitting things apart. Capsicum is an isolation technology: it lets you create a process that can only interact with the outside world via file descriptors that it has been delegated and can’t touch the global namespace.

For compartmentalisation, you need an isolation technology to provide the underlying substrate (compartments that aren’t isolated are good software engineering, but don’t add any security). If you use Capsicum for this, then you’re using one process per compartment. There are some scalability issues from the OS here, in that you’re taking up kernel memory for the process control block and you’re adding another scheduler entry. There are more significant limitations in the hardware, most specifically in the translation lookaside buffer (TLB). This is a small associative structure that maps from virtual to physical addresses and is accessed for every load or store instruction. If a virtual address range isn’t in the TLB, you need to walk the page table to find the correct mapping, which incurs a big performance hit. Because every process has its own set of virtual to physical mappings, increasing the number of processes increases TLB pressure. If two processes have the same physical pages mapped, they still need their own TLB entries for the shared page, so a shared page has the same TLB penalty as an unshared page.

The obvious question to then ask is, given that we’re going to want more processes, why don’t we just make the TLB a lot bigger. There are two answers to this. The first is that the TLB is associative, so the probability of collisions is subject to the birthday paradox. Doubling the size of the TLB does not half the probability of collisions. Doubling the TLB is also expensive. It must be powered all of the time and the area (and therefore power) requirements of associative structures do not scale linearly with their capacity. Worse, increasing the size can increase the latency, which has a significant performance impact.

With process-based compartmentalisation, you can easily scale to around 20 compartments on a modern CPU. Beyond that, you start to see significant performance problems. This isn’t limited to Capsicum.

CHERI is designed to allow applications to cheaply compartmentalise their virtual address space. Sharing is very cheap: you use the same TLB entries and the same cache lines for both compartments and there is no associative lookup, just an offset from a register.

Q: In contrast to other capability models, CHERI takes a hybrid approach combining the traditional page-based protection mechanism with a capability model. Why this choice?

The simple answer: It turns out that there’s a lot of software in the world.

We could design a pure capability machine and say ‘here’s this amazing CPU, all that you have to do is rewrite your OS and all of your software and it will be secure!’ A few other people have tried that. It tends not to work. Even the last part is difficult. Java was backed by Sun/Oracle/IBM and friends, C# was backed by Microsoft. Both had lots of money behind them, pushing people to write new code in typesafe languages. According to Open Hub (which is slightly biased, as it only tracks open source code), there is still more new code being written in C or C++ (individually) than Java and C# combined. In terms of existing legacy code, both C and C++ dwarf Java and C# by an order of magnitude or more.

Even if you do write your new code in Java, a typical JVM has around a million lines of C libraries linked in to provide core Java library functionality. A single pointer error in any of that code can compromise the integrity of the entire Java environment. Do you want to bet that they’ve managed to find a million lines of C code with no pointer bugs?

The CHERI model makes every process a virtual capability machine. How much you use it is up to that process. Incremental adoption is very important to us. We’ve experimented with libraries compiled to expose the same public ABI, but to run sandboxed. Big, security critical, applications like Chromium can afford to spend a lot of time on a compartmentalisation model that works, but there’s a much bigger win if every library can provide isolation. Last year Google found about 300 vulnerabilities in the ffmpeg libraries. These are performance-critical libraries that have high data throughput, so wouldn’t be good candidates for process-based isolation, but could be compartmentalised relatively cheaply with a CHERI-like model. Then, any application that linked to libavcodec (for example) would gain the benefit. Imagine never having to worry about bugs in CODECs, because no matter how buggy they are, the worst that a malicious video file can do is write bad data to the buffers containing the uncompressed image and audio (something that is much easier to do by just putting that data in a well-formed video file).

Q: Which software changes are needed to make use of CHERI’s capability model? Compilers, libraries, applications, …?

It depends on how you want to make use of it. If all that you want is memory safety, then you just need to recompile. I’ve made some fairly invasive (though not very large) changes to LLVM to teach it that not all pointers are integers, and with this you can take ordinary C code (C++ is lots of engineering work to add, but not conceptually difficult) and have a completely memory-safe version. If your code includes any assembly, then that may need some tweaking, but that’s all.

Alternatively, if you want to use just the compartmentalisation, then you don’t need to modify your libraries at all, you just need to write some wrappers for your public API. Currently this is quite a lot of work, but it’s relatively easy to automate. You need to add a bit of policy about what things are sandboxed and what permissions should be available to memory that’s shared with a sandbox, but then it should be possible to automate.

You can, of course, use both.

Compartmentalisation is possible from both ends. An application author can decide that they don’t especially trust a given library (or a part of their own code that is written with speed as the sole design goal) and run it in a sandbox. Alternatively, library authors can implement isolation within their own libraries.

Q: What’s the impact of CHERI’s capability approach on the processor’s performance? How big is the overhead typically?

Our current implementation uses 256-bit capabilities and adds 32 capability registers. For a production implementation, we’d expect fewer capability registers and 128-bit capabilities. If you’re using capabilities for memory safety, then the overhead is similar to the jump from 32-bit to 64-bit pointers: you use more data cache for pointers. Exactly what the overhead is depends a lot on what proportion of your total data contains pointers. For many normal applications, this is typically around 5-10%, so the performance overhead is difficult to measure. For graph algorithms that involve a lot of pointer chasing, it can be be around 20-30%. One thing that we’re currently exploring is whether it can give a speedup - various parts of the memory controller can benefit from being able to differentiate pointers from integers and from being able to tell that a particular pointer will only be used to load or to store.

Q: Do you expect that we’ll see a hybrid capability approach such as that from CHERI in production-ready CPUs in the near future? Are chip companies interested in this type of safety?

CPU makers are rapidly becoming aware that, for a lot of people, raw speed is no longer particularly important. For a lot of the domains where speed really does matter, an on-die accelerator will give far bigger speedups. For everyone else, CPUs got fast enough some time ago. Power efficiency gives one way to differentiate your products, security is another.

I very much hope that we’ll see something CHERI-derived in future CPUs. CPU vendors are aware that some of the problems that CHERI addresses are real problems that their customers care about. CHERI isn’t the only possible solution to these problems (though, in my highly biased opinion, it is the best), so there’s no guarantee that it’s the one that they’ll pick, but I’d be very surprised if we didn’t see mainstream CPUs with support for bounds checking and fine-grained compartmentalisation in the next 10 years (that’s soon, when it comes to getting CPUs from design to production).

Our prototype extends a MIPS ISA and so is relatively easy to port to other RISC ISAs (anything with a load-store architecture). We’ve pondered how to apply it to x86. Reusing the segment registers as capabilities and retaining the same implicit offsetting for various instruction types could provide a nice adoption path. x86 also has the advantage of having an effectively infinite opcode space.

Q: Have you enjoyed previous FOSDEM editions?

I’ve not been able to attend FOSDEM since 2012 due to scheduling conflicts, so I’m looking forward to returning. This is my third main-track talk and a relatively relaxing year: my record for FOSDEM is four talks in one conference (one main track, three in devrooms). I expect this year to be much more relaxed. I’m also looking forward to using the FOSDEM app to avoid my normal experience of realising that a talk I wanted to go to is at the other end of the campus and finished 10 minutes ago.

Creative Commons License
Creative Commons License

This interview is licensed under a Creative Commons Attribution 2.0 Belgium License.