Which led to yet another anti-patent idea: so, why not let the users deal with the problem? These spam sites are deeply intertwined with each other, so they are in all likelihood fairly easy to tease out topologically: you have a big nest of sites that all link to each other, and link out to real sites but have very few links into that network. Find one such spam site in the network, and it's not that hard to come up with a lot more candidate sites that are likely to be spam.
So who finds the spam site? End users, of course. Just employ the same Kill All the Scum instincts that make Wikipedia bearable. People *hate* spammers and abusers, so if you give them the opportunity to say "This site is Spam" or "This site is Real", lots of people will happily do so. And this question is nicely binary: not the vague accuracy-checking of Wikipedia, but an unsubtle "is this site autogenerated garbage, or not?".
Of course, the spammers will immediately begin to abuse the system, by putting in false "votes". But again, the meta-level is going to show some distinct topological effects. Real users will tend to agree with other real users, to a *very* high degree of accuracy I suspect. Spammers will disagree with the real users, at least frequently. (The smart ones will mix real data with junk data, but the junk data should still stand out.) A modest number of spot-checks among those who disagree ought to produce a pretty good Web of Trust from a fairly small amount of work.
There are lots of handwaves in there, of course: working out the actual topological algorithms is real work, and some careful thought would need to be put into the Web of Trust mechanism. But there's nothing in there that shouldn't be implementable on a startup's budget. (Much less Google's, if they wanted to play with it.) I'm actually a bit surprised this mechanism isn't already in use. Or maybe it is, and I just haven't noticed it yet...