Publication update: using attestation to lift crash resilience to byzantine resilience

I would ordinarily regard this as a rather obscure piece of esoterica, too technical to blog about (and merely a technical report, besides) but:

My beloved readers seem to like topics more technical that I would have thought, and (more importantly)
It's my piece of obscure esoterica.

So, tech report or no, it gets a blog post.

Here's the main question: Suppose that there was some magic technology that allowed your computer to determine whether another computer is 'good' or 'bad,' in the sense of having been subverted by malware or not. How do you then get programmers to incorporate this technology in their products (database clients and servers, operating systems, browsers, etc.)?

But let's back up a step. Would this magic technology be useful? Obviously. First of all, it would benefit you as a consumer. Your web browser could magically detect when a web server has been compromised (possibly now serving malware) and refuse to connect. Before your mail client even shows you some spam, it can confirm that the mail-source has not been subverted into being part of a botnet. And so on. Rainbows and puppies result.

But this technology would be of even bigger benefit to enterprises, like corporations or government agencies. Those types of entities are desperate for any sort of solution to the 'laptop problem': laptops that go home with a worker, get infected, and then bring that infection back to the enterprise with them. Enterprises would love to be able to detect when a laptop has been compromised, and they would love to be able to detect this when the laptop attempts to connect to the enterprise's internal network.

Okay, so this technology would be useful. So useful that it actually has a name: 'attestation.' But does it exist? Partially, but not entirely. At the moment, we have things called Trusted Platform Modules (TPMs). Roughly speaking, a TPM is a little processor that sits on your computer's motherboard. It doesn't really do anything except:

Monitor the state of the actual computer, and
Upon request, issue a (digitally-signed) statement of the form "My computer is in state X. Sincerely, TPM number 34349582345239888."

So, is this attestation? Not quite. Note that the TPM will just notarize a computer's current state, not judge whether it is 'good' or 'bad'. But the TPM is a good start, and it seems like it should be possible to build a house of cards on top of the TPM to actually achieve the magic technology we want.¹

But that's actually the research of a different Herzog. The work described here (co-written with Jonathan Millen, Brian O'Hanlon, John D. Ramsdell, and Ariel Segall) blithely assumes that some Very Smart People will figure out how to build attestation on top of the TPM and asks: then what? What should we do with it? Or to put it slightly differently: how do we package it so that non-security system developers will actually use it-- and use it correctly?

This is less of a ridiculous question than it sounds. It turns out to be really, really hard to get security mechanisms actually adopted in commercial products, much less adopted correctly. There usually isn't a business case for actual security, and so all developers see is the potential downside of the security mechanism failing and pissing off the user. Furthermore, it's very difficult to implement security mechanisms correctly, requiring obsessive attention to detail and a certain sociopathic fascination with how one could break things! This attitude is less common than you would think. History is littered with well-meant security mechanisms rendered completely worthless by a simple arithmetic error in the code.

This leads to our answer to the above question. How do we get programmers to use attestation? We don't. Instead, we do it for them. Specifically, we give them a 'cryptographer in a box': a tool that will build attestation into their system correctly and automatically. They build their system as they would otherwise, completely ignoring issues of attestation. But once they're done, they pass their system through our tool and get out a functionally-identical system which automatically protects itself from subverted machines. (This could mean several different things. I'll explain what I mean here in a second.)

What we are specifically talking about is building the cryptographer into the compiler. That is, programmers write their systems in high-level source code. Computers actually execute commands written in binary. A compiler is a tool which translates the former to the later, and modern compilers do a lot of optimizations and such along the way. In the paper described here, we describe how compilers might make otherwise normal programs (written by in an attestation-oblivious way) into binary executables that have attestation built in.

I mention this because our compiler-based approach is not the only path we might have taken. One can imagine other ways to make attestation both automatic and hidden from the programmer. For example, one might build attestation into the computer itself (like the kernel, or the JVM for Java programs). But it actually doesn't matter very much. Both approaches, it turns out, will run into the same problem: what does it mean for this kind of thing to work? Specifically:

What does it mean for a system to protect itself from other subverted machines?
How do we ensure that our attestation-handling machinery doesn't disrupt the programmer's actual system?

Our answers are:

An honest computer should never communicate with a subverted one. A subverted machine should be completely shunned.
There is no way to hide the fact that communication with a given (subverted) machine is being cut off. So we don't try. Instead, we rely on the fact that modern networking applications already deal with communication cut-offs. All our machinery does, in fact, is make it look like a subverted machine has crashed. After that, we let the programmer's system handle this (expected, normal) form of failure in its own way.

The work is much more formal and rigorous than this, of course. It talks about distributed algorithms instead of general programs, for example, and it has a much more mathematical definition of success:

Suppose a distributed algorithm D is running in a system where participants P1, P2, P3, ... Pn are Byzantine (subverted), but honest participants are separated from the network by [our machinery]. Then experience of the honest participants is exactly as it would be in a system without [our machinery] but where network-connections to participants P1, P2, P3, ... Pn are down.

Thus, the combination of (algorithm D + our machinery) are resilient to Byzantine (subverted) participants the same exact extent that algorithm D is resilient (on its own) to link-crashes.

We even demonstrate this result by implementing our machinery in Erlang.² Erlang is a wonderful programming language designed for networked applications, and so it makes sense to try our approach out on it first. Since it takes care of so much grunt-work for us (particularly the networking aspects) it allows us to really focus on the core of our approach. Its compiler is also amenable to this approach, having general support for user- supplied parse-transforms which can change the programmer's code in interesting ways before compiling it.

All in all, I'm very proud of this work and it was a lot of fun to do. Not only did I get to learn a lot about distributed algorithms and Erlang, but I got to contribute to an important result. From a purely academic viewpoint, the paper has a lot going for it. (The 'bisimulation' result paraphrased above, for example, as well as a new and intriguing definition of a 'Byzantine' participant.) But more importantly, this is my first publication in a while that might actually make the world a better place. (A little. Some day. Eventually. If a number of external factors all line up correctly. But hey-- I'll take what I can get.)

Actually, more of an inverted ziggurat of cards, where the structure gets bigger and more precarious as it goes. And by the way, I've already glossed over another inverted ziggurat of cards when I say that the TPM 'monitors the state of the computer.' What it actually does is much more complicated and a lot more convoluted. ↩
To the extent that we can. We don't actually have attestation working yet, and won't until the aforementioned Very Smart people get it working. But we can build everything but attestation, and can even use a simple placeholder protocol for the mock-attestation. ↩