Robust signatures for kernel data structures

First, I should warn my non-CS readers: I had originally decided that this paper was too technical to blog about. But then I received a request that I actually discuss this specific paper, and who am I to say 'no' to a beloved reader? So be aware that this one is going to be a bit more esoteric than even usual, but I also will go somewhere with it at the end.

With that out of the way, what is this paper all about? Let me provide a little context. Suppose that some bit of malware has made it on to your system. One of the primary things that malware would like to do is hide itself as best it can. The better it hides itself, the more likely it will escape detection and the longer it can sit there sniffing your passwords and sending out spam for its evil overlords. But what does that mean, for a program to 'hide itself'?

To explain that, I first need to explain that your computer is doing a lot of things at once-- it's running the operating system (which is always running), it's running the web browser or RSS reader you're using to read this right now, it's running the music-player you probably have going, it's running any other programs you happen to have open right now to do your actual job (remember that?), and so on. Roughly speaking, each running program corresponds to a process, which is CS-speak for 'running program.'¹ Many programs are written to take advantage of multi-processor and multi-core systems, and so a given process can have started multiple threads of computation.² Threads are the things that are actually executed by the chips/cores, but they are really low-level-- much lower than the concept of 'program.' For that, you need to pop back up and talk about 'processes' again.

Okay, so your computer has all of these processes running-- lots and lots, all the time. (Right now, as I type this, I have 145 of them. Most of them are parts of the operating system.) To keep track of them all, the operating system creates an internal data structure for each of them and keeps those structures in memory. (For non-CS people, imagine that by 'data structure' I mean a standard form the operating system fills out for each process: putting in the name, the user who started it, when it was started, etc.) And by 'operating system' here, I really mean a very core piece called the kernel. Thus, these things are 'kernel data structures'.

(Congratulations on making it this far. We're about halfway there.)

To keep track of all of these data structures, the kernel keeps them in a special place in memory. Malware designers know this. So to keep their malware from being detected, they have the malware go into that special location, find the structure that represents the malware, and move it to a random location in memory.³ The malware can't really delete the structure without deleting itself in the process (the structure in question part of the malware's life- support system, in essence) but it can make the structure hard to find by throwing it somewhere in a very, very, very large memory-space. (But how does the malware keep running if we hide the structure? Remember that these structures represent processes, and it is threads that actually get executed. The malware won't mess with the threads.)

So what is the end result for us honest users? If we want to check for malware running on our system, we can't rely on any of the standard built-in tools of the operating system. They weren't built for this, and simply look through all the structures in the 'special place.' (That phrase is becoming more and more ridiculous every time I type it.) So to find malware, we need to scan the entire memory space of the computer and look for these kernel data structures.

Can we do this? Sure, but first we need to answer an important question: how will we know it if we find one? Put another way, what could these structures look like? The malware writers know that we will come looking for these structures, and so they take pains to camouflage them best they can. They will change everything about they can change without hurting the ability of the malware to run. So what does that leave? Or put another way, how much can we mess with these structures without crashing either the malware or the whole computer?

At last, we have reached the core question of this paper. The authors devise an experimental apparatus for answering that exact question. Specially, the apparatus runs an instance of a virtual machine (sort of a computer within a computer). It starts a process on the virtual machine, freezes the virtual machine, randomly messes with the relevant data structure, lets the virtual machine resume, and watches to see if either the process or the virtual machine dies. By doing this repeatedly, they can begin to build up experimentally-derived rules (called 'signatures') for what about these structures cannot be changed. These signatures can then be built into memory scanners, which use them to scan the memory-space for such hidden structures and find the malware. And these signatures will be robust in the sense that the malware writers can't evade them without hurting their own malware. Pretty slick.

But also completely ridiculous. Why on Earth is it necessary to do this? Computers are human-crafted machines, and their behavior is completely determined by the code they are running. Can't we just look at the code and see what will and will not crash the system?

Given that this was a highly-acclaimed paper at CCS 2009, the answer is obviously 'no.' Modern computers are just too complex for any one human to really understand, and so no, we can't just read the code. We're not smart enough to read and understand that much code well enough to predict what will and will not make it crash. We need to determine that experimentally.

...Which is the first real evidence I've seen that Computer Science deserves that second word. That is, this is the first paper I've read that made me consider the possibility that Computer Science really is a science. Up until now, I had been conflating two seperate definitions of 'science':

An effort to measure some 'natural' phenomenon, and
Determining things experimentally.

In my mind, these two things implied each other. The only way to measure a 'natural' phenomenon is experimentally, right? And if it's not a 'natural' phenomenon, then it's human-made. And if it's human made, wouldn't it be easier to understand it through non-experimental methods?

Obviously not. My mistake was thinking that any human-generated system would be comprehensible to humans. And when I put it that way, I feel stupid for having believed it (even subconsciously). We humans make all kinds of things with simple descriptions but complex and unpredictable behaviors. Societies. Pinball games. And now, operating systems. Given how complex computers really are, the only way to characterize their behavior seems to be through experiment. I guess we are a science.

And if that's the case, we might want to collectively think about taking a class in experimental methods. Right now, unfortunately, we kind of suck. (But that's another post entirely.)

Technically, the word 'process' has a specific technical meaning (which is not worth giving here) and a given program can correspond to more than one process. But every running program must have spawned at least one process. ↩
As in the previous footnote, a process must have started at least one thread. ↩
These are lies, but reasonable ones. For CS people who know the lingo, the kernel does not really keep these structures in a 'special place' but holds them in a dynamic doubly-linked list. So they actually start out scattered all over in memory. The malware simply splices the relevant list-element out of the list, leaving it at its original location but making the kernel 'forget' where it is. ↩