Log in

No account? Create an account
Previous Entry Share Next Entry
As if I didn't hate domain thieves enough already...
During today's massive update of the Period Games Homepage, I'm discovering a new horror. Many of the sites I point to are now dead, which isn't a surprise. Many of them have been taken over by domain thieves, which also isn't a surprise.

What *is* a surprise is that many of those thieves have turned on robots.txt files that wind up blocking the Wayback Machine from producing results: it appears that archive.org respects robots.txt a little *too* much. The result is that a large number of useful pages are just plain inaccessible -- I can't even get at their archived versions. Grr...

(BTW, time for another reminder that archive.org is one of the most important and unsung sites on the Web -- the Wayback Machine is the only really good archive of the Web's history, and is often invaluable. I've given them another donation today...)

  • 1
it appears that archive.org respects robots.txt a little *too* much

There's a little about that in Wikipedia.

*Sigh*; I was hoping I was misinterpreting, but that exactly matches the behaviour I observed. I can even understand it from a legal perspective: archive.org never has enough money, so lawsuits are undoubtedly hard for them to deal with. But it does mean that these domain thieves can do even more damage than usual. (Probably intentionally, since their goal is often to blackmail the previous site owner.)

On a moral and ethical level, it clearly is *not* appropriate to respect robots.txt in this case, and even on a legal level it's probably clear. I do wonder if there's a practical way to recognize this case appropriately without falling prey to the legal danger...

They might be able to do it by looking at the domain's whois data. Hard to say just what algorithm they could use, though; the simple ones I've thought all have edge cases where they'd incorrectly believe the domain had changed hands. For example, if the domain lapses and gets hijacked, the domain creation date will reset, so don't apply robots.txt to anything before the creation date—but the same might happen if the domain lapses and then the original owner recreates it.

  • 1