Justin du Coeur (jducoeur) wrote,
Justin du Coeur

Robots, robots everywhere

Got a startling email from my ISP this morning, saying that jducoeur.org was running close to its bandwidth limits for the month, and did we want to upgrade? Now granted, it's a very old plan, and the bandwidth limits are modest (10 GB/month), but still -- the site's not exactly the center of the universe. What's going on here?

So I downloaded a log, and I'm doing some spot-checking. Far as I can tell, the problem is simply that one of my intuitive expectations about the Web is now wrong.

The thing is, our site is crammed full of old stuff. Why not? Most of it is things that were interesting briefly, but have long since turned into simple historical curiosities: old versions of the Carolingian website, wikis and websites for old LARPs, stuff like that. Since most of it is of only slight current interest, it should be taking up (cheap) disk space, but no significant bandwidth.

The problem, though, is all those searchbots. The site is *old*, reasonably well-linked, and therefore known to all the bots. So the result is that *every* searchbot appears to be hitting *every* page on the site, pretty frequently. This utterly swamps all other traffic. In the log I'm looking at, I'm finding that "Yahoo Slurp!" is the single worst offender, clicking on everything in every wiki page. But lots of engines are showing up -- some from MSN (although they're actually more measured than most) and Ask Jeeves, lots from Yanga and Baidu and searchme and other lesser engines, even a few I've never heard of like Kosmix. And of course, the expected Googlebot.

The result, of course, is that the total bandwidth for the site is an m*n*o operation, where m is the total size of the site, n is the number of bots out there, and o is the frequency with which they each re-search. Of those terms, m is the only one I control directly. The multiplication of search engines out there -- the n term -- is really the heart of the problem. When it was just Google hitting us every now and then, it was a non-issue, but with robots swaming all over now, they are dragging us down.

(I do find myself curious what fraction of the Web's total traffic is now bots. I suspect it's a small fraction of the big sites, but an enormous fraction of smaller sites like ours, which make up most of the Web.)

Not quite sure offhand what I'm going to do about this. It's not quite a crisis yet, but it's not far off: the robots are driving us to about 80% of our limits. I do want to show up on the search sites, so I'm leery of using robots.txt in too blunt a way, but I'd really like to throttle these damned things down a bit. What I don't know yet is whether there is an easy way to do so...
Tags: technology

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded