For ten days at the beginning of 2009, a team of computer-security researchers managed to take control of a live, real-world, criminal botnet. Over those days, they observed (and recorded) the botnet harvest over 70GB of stolen data (password, bank-account number, etc.) from almost two hundred thousand subverted machines. Why did they do this? Simple curiosity, probably. But that's not nearly as interesting as how they did it, what they found, and what this means about the field of computer security.

First, a quick overview of how the Torpig botnet works (abridged version). Computers become part of the botnet when they visit malicious websites-- mostly pornographic, but a few 'legitimate' ones. The webserver exploits some vulnerabilities in Windows browsers to embed some shellcode on the victim machine. The shellcode scours the infected system, harvesting name/password pairs from various places. It also waits until the victim machine visits a bank website and (this is especially evil) adds its own content to the bank's webpages. That is, you (the victim user) will get the usual pages that you expect to see from the bank, and the browser tells you that the connection is secure, but the Torpig infection has added some additional stuff-- telling you (for example) that you need to re-enter your password/social-security number/mother's maiden name/etc. That doesn't go to the bank, however. That goes to the malware.

Okay, so the malware on the victim computer has collected all of this sensitive data. That doesn't do the criminals any good, however, until your computer sends it to a central collection point. However, that central collection point can't stay in one place long-- otherwise the law will shut it down. So the Torpig malware on the victim machine will turn the current date into a pseudo-random sequence of characters and send the harvested data to (pseudo-random sequence).com. Then the central collection-point can change every day without ever being out of touch.1

Here's where the computer-security researchers come in. As they describe in Your Botnet is My Botnet: Analysis of a Botnet Takeover (by Brett Stone- Gross, Marco Cova, Lorenzo Cavallaro, Bob Gilbert, Martin Szydlowski, Richard Kemmerer, Chris Kruegel, and Giovanni Vigna, all from UCSB) our intrepid researchers reverse-engineered the method by which dates get turned into domain-names. And then they noticed that the botnet owners hadn't yet registered any of the domain names that would be used after Jan 24th, 2009. So they registered three weeks' worth of names, effectively becoming the botnet owners from Jan 25 to Feb 15, 2009.

It's probably worth the time to be very, very clear about what they did:

  • They registered a bunch of domain names. These names were unregistered, and didn't 'belong' to anyone yet.
  • They caused each of these names to 'point' at a webserver, housed off-campus.
  • They waited for the computers subverted by the Torpig botnet to initiate connections to their server. The researchers took no action to cause these connections to occur-- the subverted computers did this all on their own. (Interesting note: the researchers expected these connections to start on January 25, since that was the 'earliest' name they had registered. But they actually started getting connections right away, probably because some of the subverted machines had their clocks incorrectly set.)
  • When a subverted computer contacted the researcher's server, the server acknowledged the connection with an "ok" message, meaning "okay, go ahead".
  • The subverted computer would then send to the server all of the data it had collected since the last such connection.
  • The researchers logged all of this data (in encrypted form).
  • During this whole project, the researchers worked with the FBI in some undisclosed capacity, and took efforts to contact those financial institutions with hijacked accounts. (Apparently this was a very frustrating part of the project. I'm not surprised-- the banks don't have much incentive to make it easy to report fraud.)

In case you're wondering: the researchers really had complete control over all the subverted machines. They could have patched them all (removing the infection) but they did not. Why? They don't mention this, but I suspect that it would have been illegal to actually use the botnet in anyway, no matter how good their intentions. But their stated reason is also legitimate: any change to a subverted machine runs the risk of damaging or crashing it, and they didn't want to risk the possibility of crashing some critical machine (such as might be found in a hospital).

Even though the researchers registered names for three weeks, they were themselves shut down after only 10 days. And they have since learned that the botnet was updated in a way that makes it much harder to regain control of it again.2 But in those 10 days, they collected 70GB of stolen data. I'm sure you're hungry for details, so here are some from the paper:

  • They were in a much better position to estimate the size of the Torpig botnet then previous researchers. In particular, they believe the botnet to be one-tenth the size previously estimated.
  • They estimated that the botnet grew by over 49,000 new infections over the 10 days they controlled it.
  • In 10 days, they got the credentials to 8,300 financial accounts at 410 different institutions. The biggest target: PayPal, with 1,770 accounts. About 40% of the 8,300 credentials were stolen from the password-managers of web browsers, not actual login sessions.
  • They also got 1,660 credit-card numbers. They usually got one per machine, but there was one case where they got 30 credit-card numbers from a small merchant.
  • They estimate (roughly) that they got stolen data worth $10K to $1M each and every day they had control. (It turns out to be really hard to put a price on stolen data. All told, they estimate the value of the collected data at anywhere from $83K to $8.3 million.
  • From a scientific point of view, the most interesting aspect of their data concerns passwords. It's really, really hard to collect good data on how people choose and manage their passwords, so this is an extremely valuable opportunity to analyze some real-world data. They collected almost 300K credentials (name/password/service triples), containing 173K individual passwords, from 52.5K machines. They found:
    • 28% of victims re-used name/password pairs across services. This roughly confirms previous estimates.
    • John the Ripper, a password-cracking tool, was able to recover about 40% of the individual passwords in about 75 minutes. (This gives a rough estimate of how weak the average password is.)

Unfortunately, neither I nor the paper have any context for these numbers (aside from the first point). Are they large? Are they small? I don't know. But they're actually much less interesting to me than how this research was done, and the fact that this research was done at all.

First: how in hell did they get this past their IRB? I'm going to give the researchers the benefit of the doubt here and assume that they did, in fact, consult their IRB. They were collecting human-subjects data as (presumably) employees of the State of California-- and benefiting from a criminal enterprise as well (albeit in an intangible way). I certainly hope they protected themselves by getting IRB approval.

But that just begs another question: what are the ethical guidelines for doing research like this? Was this research ethical to do? Were the researchers ethically obligated to follow up with the relevant financial institutions, as they did? Are they ethically obligated to contact every actual human with data in their logs? The paper does cite another document regarding the ethics of this kind of research3 but I have to report that I've been in this field for over a decade and (like may of my colleagues) have no idea what the ethics are here.

Also, let us ask this: what is the primary scientific contribution of this work? The statistics quoted above are nice, but the real value of this work is that it managed to collect a corpus of data. As mentioned above, it's really hard to get real-life data of any sort in this field, so the 70 GB of actual botet data collected by these researchers is huge (in many senses of that word). But let me pose this hypothetical: suppose I wanted to do some sort of follow-on study on password-- or even to reproduce the result presented by the original researchers. Can I have their corpus of data? Are they obligated to withhold that data to protect the unwitting victims of the botnet? Or are they obligated to release the data in the name of scientific progress and honesty?

Again, I have no idea. But the questions really say something about the field. We have come to realize, over the past few years, that we can no longer ignore the human component of the system. If we want people to use our systems in a secure way, we need to study how human use computers. (See also my previous post on secret questions.) But if we are going to study humans, then we are becoming a field of social science and we need to recognize that fact. Not only do we need to learn and adopt the research methods of social science (which is another rant of its own) but we need to adopt their code(s) of ethics. They have managed to figure out a workable set of guidelines which allow them to both (a) study private aspects of people's lives and (b) stay on ethically-solid ground. If we computer- security researchers are going to do (a), then we need to follow the social scientists' lead vis-a-vis (b).

  1. My description of Torpig is already greatly simplified. It also has the capacity to update itself, download new modules of malicious content, and so on. It's really a very sophisticated architecture. 

  2. In particular, the subverted machines now not only use the date to compute the name of the central collection point, but also the current day's most popular topic on Twitter. This makes it much, much harder to register domain names in advance like these researchers did. 

  3. A. Burstein. Conducting Cybersecurity Research Legally and Ethically. In USENIX Workshop on Large-Scale Exploits and Emergent Threats, 2008.