Quick: what is the following text about?

... the result of the collapse of large portions of the three provinces to have a syntax which can be found in the case of Canada and the UK, for the carriage of goods were no doubt first considered by the British, and the government, and the Soviet Union operated on the basis that they were...

Give up? It's about pwning your computer, actually. That's not 'real' English text, there, but a cleverly-disguised attack on your computer.

To explain why, let me back up a bit. Many attacks on computer vulnerabilities take advantage of the following computer-science truism: code is data, data is code. What that means, specifically, is that the computer really doesn't make any actual distinction between code (like the sequence of executable commands that is Microsoft Word) and data (like your Word files). The same byte, like 00100111, can mean any of:

  • The letter 'G',
  • The number 71, and
  • The instruction 'increment the value in the register %edi'.

More generally, the same byte can mean any of several different things, depending on how it is interpreted by the computer. This is exploited by several attacks, which all do basically the same thing: write some bytes to your computer's memory as data, then trick your computer into executing it as code. The bytes might, for example, be provided by a subverted webserver as part of a web page. The computer stores this page in memory in order to display the page on the screen, but is tricked (through something called a 'buffer-overflow') into executing the page as if it were instructions. Presto-- the webserver has caused your computer to execute malicious code.

Okay, so how do you prevent this from happening? One approach is to try to detect it while it's happening so that you can shut it down. This seems reasonable: this kind of malicious code, called shellcode looks nothing like actual data. A real webpage looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr">

(If you want to see more, you can use your browser's 'View Source' command.) Shellcode, on the other hand, looks very different. Most of it, for example, can't even be printed as a character. (That's why I'm not providing an example-- it may break your browser.) So, if these two things look different, can you detect when when some incoming data--ostensibly a web page--is actually shellcode?

All of that is background for the paper English Shellcode (PDF) by J. Mason, S. Small, F. Monrose, and G. MacManus. Their answer is simply, 'no'. The way they justify that answer, however, is really clever. If code is data, and English text is a kind of data, then can code be represented as English text? Yes, and they show how. Remember how I said that most shellcode is unprintable? Well, some of it can be printed. On the Intel processor, 27 letters can be (mis)interpreted as commands, along with 14 two-letter combinations. Good enough. They show how any piece of shellcode can be represented using those 41 commands, and how the resulting sequence of commands can be turned into valid English text. The steps are very technical, but two details are worth mentioning:

  • One of those commands, 'r', means 'skip the next X letters' (where X can be chosen at will). This means that the shellcode commands don't need to be consecutive-- they can use 'r' to insert non-command letters into their sequence so as to make it English text.
  • They build their text word by word. Specifically, they take the last four words in their text so far, use Google to find all instances of those four words in Wikipedia or Project Guttenberg, and then look at all words which come next. This gives them a set of candidates for the next word in the sequence, always ensuring that the text continues to scan as valid English. (As long as you focus on only five words at a time, at least.)

All in all, it's very clever work, and it demonstrates that it will never be possible to completely detect shellcode on the wire. If the adversary uses the methods of this paper, he/she can always hide their shellcode in valid-looking text (well, valid-looking to a computer, anyway) and it'll always get through.

So that's the paper. Let me pivot from it to a more general consideration: what is this work? Is it mathematics? It does have the feel of a counter- example to some mathematical hypothesis, but it's not even close to being as rigorous as mathematics. Is it science? No, it's not even attempting to measure some naturally occurring phenomena. Is it engineering? Close, but no: it's not attempting to evaluate the suitability of some system for some purpose. So what is it?

The best I can articulate is that it is a counter-argument in a large, field- wide ongoing argument about what might actually work. Everyone agrees that there's a problem. Someone proposes a possible solution. Someone else points out that the adversary can get around the solution in this way. Someone else points out that they can defeat that counter-measure in this way. Yet another person points out that yes, but they can defeat that counter-counter- measure in this way... and so on. This paper is an example of that, showing that the adversary can defeat some defense in a particular way.

Given this, I expect someone to suggest a way to detect their computer- generated text, and someone else to refine their method to slip past that detection method, and someone else to suggest... and so on. We're not engineering new systems in this field right now. We're casting around for systems that might work, and debating whether they would be worth the effort to develop.

I think there's an analogy to be made here: that computer security is like air warfare. How did the art of air warfare develop? An action-reaction cycle: One side develops bombers. The other side develops radar. The first side develops jammers. The second side develops anti-jamming technology. The first side develops stealth fighters. The second side develops... well, we'll see.

But the same cycle is playing itself out in computer security-- just much earlier on. I've heard other people, making the same analogy, say that the bad guys have developed airplanes and we haven't yet invented radar. That sounds right to me. And you can see in the field that we're debating exactly what radar we want to develop-- mostly by playing out the next few rounds of the action-reaction cycle to see how robust the proposed radar system would be.

Will it work? I don't know. But at some point, we're going to have to put some sort of radar in place, no matter how bad it plays out in the debate. (In fact, we have. This is pretty much the state of enterprise intrusion-detection systems currently being fielded.) But any radar, once put in place, will do two things: it will buy us time, and make the bad guys commit to some specific counter-attack. Instead of trying to counter all of their possible reactions, then, we can focus on the ones they actually do.

Also, if the computer-security field is really like air warfare, we (the defenders) may have to make a paradigm shift. Right now, we sneer at 'security through obscurity' as rank foolishness perpetrated by idiots who don't know any better. But security through obscurity had an important role in air warfare-- where it was called 'classified information.' If the adversary can defeat any particular defensive measure but not all of them simultaneously, then it makes sense to keep your specific counter-measures secret. Make them guess, because sometimes they will guess wrong. So while it makes sense for we academicians to engage in public debate about the likely effectiveness of our proposals, we should expect actual enterprises to be very secretive about the measures they actually take.