A Simple Vulnerability Found in… Lots of Stuff

(Since this blog is read by a lot of poker players, it’s probably a good idea to mention right away that I don’t know of any way that this vulnerability could affect poker client software.)

My friend Thomas Bakker recently showed me an amazing (and hour-long) video from the recent Chaos Communication Congress, which is essentially a conference in which various hackers get together and laugh at the current state of computer security implementations. These types of things are very helpful for people trying to improve security, but of course, they’re also absolutely terrifying. Most of the scary stuff discusses specific vulnerabilities that are unique to a device or piece of software. (Here, for example, is a short video showing a man-in-the-middle attack that works against every iPhone 3G.)

Those things are freaky enough. However, Alexander Klink and Julian Wälde (a penetration tester and a student respectively) demonstrated a much much more general attack. Instead of targeting the specific security implementations of one particular piece of software or device, their attack targets the programming languages that almost all major websites use: PHP, Java, ASP.NET, and Python amongst others. Basically every modern website on the internet uses these, including, for example, Facebook, Twitter, Two Plus Two, Subject: Poker, the New York Times website, YouTube, this blog, Wikipedia, Gmail, whitehouse.gov, and cia.gov.

Before describing the vulnerability, I’ll need to explain two basic concepts: denial-of-service attacks and hash tables. Feel free to skip these next sections if you already know what those are.

Denial of Service

Denial-of-service attacks are a very simple way to take down a website temporarily. The basic idea is this: Any website is run on a computer (or a large network of computers if it’s a large website) that, like your computer, has finite resources; it can only do so many computations in a specific time period. So, if you don’t want people to be able to access a website, a simple strategy is to try to overwhelm its computational resources. If a DoS is successful, the target will be too busy dealing with the attack to do anything else, e.g. to send users content. (This can happen by accident as well; Subject: Poker’s servers have gone down a few times after we published stories that generated particularly large amounts of traffic.)

A common way to do this is to simply send a lot of requests to the target: Get a lot of computers together and have each one access the website a lot of times. You can do this by having a lot of friends with computers (like Anonymous) or by having a botnet of computers infected by a virus that lets you control them (like LulzSec). These types of attacks have taken down tons of websites and are quite common. Some high-profile victims include the CIA and MasterCard and Visa. (Good xkcd comic about the incident)

There’s a less common method, though, that’s more relevant to this discussion: You can try to send the target a much smaller number of requests that take it a very long time to answer. This doesn’t require a botnet or thousands of friends; it just requires a trick to make the target computer(s) do a lot of useless work. For those of you who were around when AOL was incredibly common (i.e., the late nineties, when I was in middle school), you might remember getting sent “punters” that would crash your computer. Most of those worked by exploiting some flaws in AOL to get your computer to run code that would put it in an infinite loop, fill its memory, and/or otherwise overwhelm it. (The ones that just knocked you offline worked differently.) A more famous example was the ping of death, which could crash any computer that was connected to the internet and used a vulnerable operating system.

This second kind of DoS attack doesn’t happen much anymore because software and hardware developers take basic precautions against them.

Hash Tables

Imagine you’re in a really poorly designed library. It has, say, one million unique books organized in the order in which the library got them. You have absolutely no clue what order the library got its books in, so if you want to pick up a copy of Thomas Bakker‘s Analytical No-Limit Hold ’em (You owe me, Thomas. I’m not even an affiliate.), the only thing that you can do is to look at the books one-by-one until you find it. On average, you’ll have to look through 500,000 books if the library has it and all one million books if it doesn’t. So, if it takes you about a tenth of a second to look at each book, you’re likely to be in there for roughly the whole day.

Luckily, libraries tend to organize their books. In an organized library, you might ask someone who works there where such a book would be. You then go to the appropriate section and look for his book amongst only the books in that section. Your time to find Thomas’s book is just the time that it takes you to ask the employee and go to the section plus the time it takes to look through however many books are in that section. No matter how many books are in the library, you won’t have to look very hard as long as the library is broken up into enough sections.

Computers have a similar problem, and programmers use a similar solution–a hash table. I’m not sure how Amazon.com actually organizes its data, but for the sake of sticking with the analogy, let’s pretend they use a hash table. Assuming that this is the case, when you ask Amazon’s server for information about Thomas’s book, it computes a hash function on the title. (This shouldn’t be confused with a cryptographic hash function, which is different in some important ways.) This function takes in any string of abritrary length (like a small number or a book title or the entire contents of the Oxford English Dictionary) and returns a number less than some fixed value that I’ll call k. The hash function acts like the library employee in my analogy–you tell it what you’re looking for and it responds with where it might be. The hash table is a chunk of k slots in the server’s memory that each contains a (possibly empty) list of books. The memory locations are analogous to the library sections–The ith memory location contains a list of books whose hash values equal i.

So, if applying the hash function to “Analytical No-Limit Hold’em” returns the number 42, Amazon’s server will check the 42nd memory location of the hash table. If there’s nothing there, it can be sure that it doesn’t have the book, so it will tell you that. If it finds a list of books in the 42nd memory location, it goes through the list to look for Thomas’s book, and it will either find it and give you the information you were looking for or fail to find it and tell you that it’s not there.

If Amazon has n books that are evenly distributed into k memory locations, then the time to look for a book will be at most the time that it takes to compute the hash function plus the time that it takes to look through a list of n/k books. (It doesn’t take a modern computer any significant amount of time to “go to” a specific memory location, so there’s no equivalent of the time that it takes to walk to a specific section in the library example. This means that we can build huge libraries without having to worry about getting sore feet.) Typical hash tables use very large values of k (often 2^32, or about 4 billion), in which case the time to look for a given book is usually assumed to be independent of the number of books. In other words, the server just computes the hash function and sees if there’s anything in the memory slot. most of the time, a memory location will have at most a few books, so the time to look up a book will typically be less than or equal to the time to compute a hash function plus the time to look through a few books, regardless of how many books Amazon has in its database. Obviously, programmers tend to choose hash functions that are quick to compute, and a computer can certainly check if a few values are equal in a very short amount of time. So, Amazon’s server doesn’t break a sweat looking up a book for you, even though Amazon has a very large library.

The situation is similar for when a book is added to the library: First, the server computes the hash function. Then, it goes through the list of books in the corresponding memory location and checks if it’s the same as the book it’s trying to insert. If it’s equal, it stops and doesn’t add the book. (There’s no point in having a hash table with two different entries for the same thing.) If it gets to the end of the list, it inserts the book there. As long as the list is short, which will almost always be true if the books are evenly distributed and k is large, this will take very little time.

However, if the distribution is not at all even, things fall apart. If it happens that Amazon has, for example, all its books in the same memory location, then we’re back to where we started–Looking for a book in that memory location is like looking for a book in a completely disorganized library. Then, the time that it takes to find a book or add a new one is proportional to the amount of time that it would take the server to look through every book in Amazon’s library. Since Amazon’s got a lot of books, that’s not good.

So, hash functions are designed so that they tend to give roughly even distributions on typical data. Computer scientists say that hash functions are designed to minimize collisions. (Again, this is related to but different from the cryptographic concept of collision resistance.)

(I’ve glossed over two things: In reality, hash tables typically don’t use k slots of memory if they don’t have to. Instead, they often use a much smaller number of memory slots, with each slot representing many different possible hash values. They then increase the number of slots once the table has gotten a certain number of items in it. Also, memory locations don’t actually contain a full list of books, since they have a finite amount of space and the list could be arbitrarily large. Instead, each contains the address to one book (or an empty address if no books have the corresponding hash value), and each book has the address of another book with the same hash value (or an empty address, if it is the last book to have been added with that value). This allows the hash table to contain arbitrarily many books without having to preemptively reserve an arbitrarily large chunk of memory. I’ll ignore these facts in the rest of this post, since they don’t affect my discussion, but I didn’t want to mislead anyone.)

The Vulnerability

It would be an understatement to say that hash tables are ubiquitous. Almost any time a programmer wants the ability to look up something by name or any other property, she uses a hash table. Most modern programming languages will build them automatically. For example, here’s some simple PHP code:

$bookList = array('Thomas Bakker'=>'Analytical No-Limit Hold \'em',
    'Isaac Newton'=>'Philosophiæ Naturalis Principia Mathematica',
    'Kurt Vonnegut'=>'Sirens of Titan') ;
echo $bookList['Thomas Bakker'] ; #prints "Analytical No-Limit Hold 'em"

This code makes a hash table called bookList without forcing me to choose a hash function or to write the code to compute it or anything like that. All of that stuff is under the hood. (In fact, I think it’s fair to say that programmers rarely bother to think of this as a hash table.)

PHP even automatically builds a hash table that stores user-generated data on every webpage that uses PHP–even if the actual webpage doesn’t use it and the programmer didn’t ask for it. This table is called $_POST, and it stores POST data–information that the user has provided to the server. For example, when you fill out a form on a website with, say, your first name and last name, that is typically sent to the server as POST data. The server may then show the values back to you on the next page with this simple PHP code:

echo 'Your name is ' . $_POST['first_name'] . ' ' . $_POST['last_name'] ;

That’s wonderfully convenient, but it’s a huge problem if there’s a flaw in the implementation; then every single website that used PHP (i.e. almost all modern websites) would have the same bug. Klink and Wälde discovered such a flaw.

Their idea is quite simple: First, they figured out the hash function that PHP uses. They then figured out a systematic way to find a massive collision for this hash functions–a large number of strings that get mapped to the same value under the hash function. As it turns happens, PHP’s hash function maps ‘Ez’ and ‘FY’ to the same value, and it also maps any sequence of ‘Ez’s and ‘FY’s of the same length to the same value. For example ‘EzEzEzFY’ and ‘FYFYEzEz’ have the same hash value. This allowed them to easily generate 2^n strings of length 2n that collide. (Just take every non-negative binary integer up to 2^n-1 and replace 0’s with ‘Ez’ and 1’s with ‘FY’. I think in practice they actually had three different two-character strings and generated 3^n colliding strings of length 2n.)

Next, they simply sent those strings as POST data to a website that uses PHP. Even on a website that doesn’t request such data and has no use for it, PHP will dutifully accept it and put it into a hash table. But, since the strings all have the same hash value, they all land in the same bucket. So, every time PHP does this, it computes the hash function, finds that it leads it to a very full bucket, and cycles through all previous strings to make sure that the new string isn’t a duplicate. So, when it receives the ith string, it will check equality $(i-1)$ times before inserting it. If Klink and Wälde send 2^n strings as POST data, the target server will check string equality just under $2^{2n-1}$ times. Once their computer is done generating and sending all those strings, they can simply have it start the process over again, each time causing the server to check string equality $2^{2n-1}$ times.

Effectively, Klink and Wälde are telling the server “Hey, we came here from a page that asked us a lot of questions. Here are our answers: value1 = ‘EzEzEzEzEz….’, value2 = ‘FyEzEzEzEz…’, … value(2^n) = ‘FyFyFyFyFy…’.” PHP responds by storing all this in a hash table in case there happens to be code on the page that uses it. (Note that the 2^n strings are being sent in one request as 2^n distinct pieces of data that all get placed in a hash table. Starting again starts a new request and a new hash table.)

PHP does take some precautions against such things–It limits the amount of time that a single request can take to either 30 or 60 seconds, depending on the implementation, or a server-defined value. That effectively limits the number of strings that an attacker can send in one request. However, Klink and Wälde discovered that, even with the lower default of 30 seconds, a modem of just 100 kbits/second or so was sufficient to continuously overwhelm an Intel i7 processor. In other words, if you run a website off of a decent laptop, someone with a dial-up connection could shut it down indefinitely. This scales linearly, so someone with a 10 Mbit/second connection (e.g., a 4G smartphone in a good location) could shut down a website that’s running on 100 parallel state-of-the-art processors, and someone with a 1 Gbit/second connection could shut down a website running on servers with a total of 10,000 processors.

(I have absolutely no idea how many processors a gigantic website like Facebook and Google have at their disposal, so I’m not sure if this makes it easy to take down all websites or just the vast majority of websites. If anyone knows, please let me know in the comments.)

This problem wasn’t unique to PHP; Klink and Wälde discovered effectively the same vulnerability in Java, Python, ASP.net, v8, and some versions of Ruby. Sites that use at least one of these languages basically get all of the traffic on the web.

The Solution

The correct solution isn’t too hard to come up with: Instead of using the same hash function every single time, languages like PHP could have a large list of distinct hash functions at their disposal and randomly select a hash function from this list for each hash table. In practice, this is quite easy to do: Take any hash function, $hash(\mathrm{abcd})$ , and for each string i in some set of strings, define $hash_i(\mathrm{abcd}) = hash(i\mathrm{abcd})$ . This is called adding a salt. If languages implemented this, then it would be impossible to find collisions without knowing the value of i. The cost for this is totally negligible–The server need only generate one random number when a hash table is created and append a constant prefix before calculating a hash.

Ruby implemented the correct fix for versions that didn’t already have it when Klink and Wälde contacted them, and Perl had used randomized hash functions since 2003. Java refused to patch the vulnerability when contacted, but the very popular server software Tomcat, which runs on Java, implemented a work-around by limiting the number of values that can be sent through POST data in one request. (Does anyone really need to send 2^32 values to a website?) PHP has proposed doing something similar. Klink and Wälde suggest that people who run websites manually make this change. (They say that a 10,000 variable limit is sufficiently small.)

Of course, even if all languages get patched properly, tons of sites will still be vulnerable to this attack. This site, for example, still runs on PHP 5.2.17, a version that is no longer supported and will likely remain permanently susceptible to this simple denial-of-service attack.

Lessons Learned?

I suppose the main lesson to take away from this is that computer security vulnerabilities are everywhere–They’re even embedded in many of our programming languages. Obviously, that’s scary, and it’s important that people like Klink and Wälde, who warned the companies that distribute this software and then made this public, find these things before people who will exploit them.

Actually, one particularly scary fact about this is that K+W didn’t find it first; the vulnerability was discovered in 2003 by Professors Crosby and Wallach of Rice University. They focused on the Perl programming language, and as a result, Perl randomized its hash tables in 2003. What’s terrifying is that only CRuby followed Perl’s example; the companies behind the other major programming languages were either unaware of Crosby and Wallach’s work AND Perl’s patch or they were aware and chose not to do anything. If known vulnerabilities aren’t patched nine years after they’re discovered, then we’re in trouble.

However, there’s also a silver lining here: As far as we know, nobody actually exploited this in spite of the fact that it was known publicly for over eight years and recently became very well-known. We live in a world with tons of security vulnerabilities and tons of really smart people with the will and the ability to exploit them (e.g., LulzSec), but somehow a lot of them don’t get exploited.

This is a strange theme in computer security that I don’t think people address enough, perhaps for fear of waking a sleeping giant. There’s no explanation that I can think of. In particular, law enforcement isn’t stopping anyone–It would be trivially easy to execute this particular attack and many others in a way that’s untraceable. Lack of knowledge doesn’t explain it either, since hackers do many things using complicated and difficult methodology for which there is a better method available that’s less obscure and difficult.

Anyway, that’s today’s rant. If you’d like to read more stuff like this, follow me on Twitter or subscribe to my RSS feed.

Solipsist's Log

Life inside the confused and curious mind of a nerd.