11 Oct 2003 raph   » (Master)

How to defeat Google's PageRank

I've been noticing a lot more evil spam results from Google searches. The easiest way to see them is to try a somewhat dodgy search query, such as "snes pokemon rom". Obviously, I find this interesting, because PageRank is supposed to be an attack-resistant trust metric, just like here at Advogato. If someone has succeeded in attacking it, it would be interesting to find out why.

As far as I was able to figure out, these spam sites use a handful techniques to achieve high Google ranking. Some are related to PageRank, and then there's the generation of random, Markov-chain text to fake out the relevance scores. For example, the top hit, www.jrcrush.com/pc_pokemon_game/ pokemon/pokemon_snes_rom.asp, shows up with this context:

... true to life heaviness. Another toughest pokemon snes rom passionately downloads the evolution for a battle. When you see an avariciousness ...

But this isn't the result you get when you actually visit the page; it seems to be custom generated just for search engines. I've seen other pages that seem to be dynamically generated based on the query in the referer URL. Giving different results than given to search engines has many problems, not the least of which is that it's the best way to get around Google's otherwise solid policy of not returning porn pages for non-adult searches. I'm no prude, but I don't think the average person searching for "pokemon snes roms" ought to be served porn ads.

But this is just relevance. To get to the top of a search, a site has to have good relevance and a high PageRank score. How did such an obvious spam site achieve such a good score? The answer, not surprisingly, is abuse of DNS. In the case of jrcrush.com, it used to be the web site for the Columbus Crush, a junior hockey league based in Ohio. Then, the domain lapsed and got parked at Go Daddy. Within a few months, a scammer took it over. In the meantime, plenty of pages still link to it, even though the link has rotted. There's also evidence that it was listed directly in the Yahoo directory until recently.

Even though Google is showing itself to be vulnerable, the theory of attack resistance is holding up well. According to my analysis, in an attack-resistant system, there should be a near-linear relationship between the "cost" of the attack and the amount of damage done. Quantifying the cost is tricky, of course, because no abstract model will precisely capture real-world cost. The way I do it is to divide up all nodes (in the case of PageRank, a node is roughly equal to a webpage) into bad and otherwise. The latter category is further divided into "good" and "confused". A confused node is one that has a link to a bad node, for whatever reason. My quantification of attack cost is simply the number of confused nodes.

And now we see that by subverting DNS, an attacker can, in one fell swoop, exploit a potentially large number of "confused" nodes. In any situation involving security, the attacker will always go after the most vulnerable link. DNS has many great strengths (without it, URLs, and thus the Web, would have been infinitely more painful), but it sits in a position where all Internet users are forced to trust it, and it has not earned that trust.

There are any number of ways to fix the attack outlined above (and I'm sure Google is working on it), but, long term, the best way is to fix DNS itself. It's clearly broken, but it's not obvious how to best fix it. To me, it's obvious that people need to be building research protypes for better DNS-like service. Obviously, I think that trust needs to be baked-in, but others may have even better ideas.

Another letter quality display

As I've pointed out before, the real movement in high-resolution displays these days is in very small devices. Fujitsu is developing a 250 dpi 4" display, and recently showed a prototype at a Japanese trade show. Still a while before it'll be at your local Fry's, but you can get 216 dpi in Japan now.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!