Justin du Coeur (jducoeur) wrote,
Justin du Coeur
jducoeur

Cooperative search filtering?

So I was reading an article in Wired (rather old, of course -- I'm even further behind in my magazine reading than in LJ), complaining about the way that Spam Blogs -- "Splogs" -- were going to cause The Death of the Internet (news at 11). Basically, these interlinked webs of junk were beginning to corrupt the search indexes, by filling them up with garbage pages all linked to each other. And the article made a big deal about the way that these pages were designed to look highly real to search engine spiders, despite looking like obvious garbage to users.

Which led to yet another anti-patent idea: so, why not let the users deal with the problem? These spam sites are deeply intertwined with each other, so they are in all likelihood fairly easy to tease out topologically: you have a big nest of sites that all link to each other, and link out to real sites but have very few links into that network. Find one such spam site in the network, and it's not that hard to come up with a lot more candidate sites that are likely to be spam.

So who finds the spam site? End users, of course. Just employ the same Kill All the Scum instincts that make Wikipedia bearable. People *hate* spammers and abusers, so if you give them the opportunity to say "This site is Spam" or "This site is Real", lots of people will happily do so. And this question is nicely binary: not the vague accuracy-checking of Wikipedia, but an unsubtle "is this site autogenerated garbage, or not?".

Of course, the spammers will immediately begin to abuse the system, by putting in false "votes". But again, the meta-level is going to show some distinct topological effects. Real users will tend to agree with other real users, to a *very* high degree of accuracy I suspect. Spammers will disagree with the real users, at least frequently. (The smart ones will mix real data with junk data, but the junk data should still stand out.) A modest number of spot-checks among those who disagree ought to produce a pretty good Web of Trust from a fairly small amount of work.

There are lots of handwaves in there, of course: working out the actual topological algorithms is real work, and some careful thought would need to be put into the Web of Trust mechanism. But there's nothing in there that shouldn't be implementable on a startup's budget. (Much less Google's, if they wanted to play with it.) I'm actually a bit surprised this mechanism isn't already in use. Or maybe it is, and I just haven't noticed it yet...
Tags: ideas, technology
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 1 comment