29

I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests.

I just want to know what will happen if I don't follow the robots.txt file and what is the safe way of doing the crawling?

yannis
  • 39,547
  • 40
  • 183
  • 216
user1858027
  • 409
  • 1
  • 4
  • 5
  • You don't want to get on the bad side of a sys admin looking after a site. Its not difficult to spot a crawler and feed it crap. – Martin York Dec 20 '12 at 18:22

3 Answers3

45

The Robot Exclusion Standard is purely advisory, it's completely up to you if you follow it or not, and if you aren't doing something nasty chances are that nothing will happen if you choose to ignore it.

That said, when I catch crawlers not respecting robot.txt in the various websites I support, I go out of my way to block them, regardless of whether they are troublesome or not. Even legit crawlers may bring a site to a halt with too many requests to resources that aren't designed to handle crawling, I'd strongly advise you to reconsider and adjust your crawler to fully respect robots.txt.

yannis
  • 39,547
  • 40
  • 183
  • 216
16

most sites don't have any repercussions

however there are some sites that have crawler traps, links hidden for the normal user but plainly visible for crawlers

these traps can IP block your crawler or do anything really to try and thwart the crawler

ratchet freak
  • 25,706
  • 2
  • 62
  • 97
11

There are no legal repercussions that I'm aware of. If a web master notices you crawling pages that they told you not to crawl, they might contact you and tell you to stop, or even block your IP address from visiting, but that's a rare occurrence. It's possible that one day new laws will be created that add legal sanctions, but I don't think this will become a very big factor. So far, the internet culture used to prefer the technical way of solving things with "rough consensus and running code" rather than asking lawmakers to step in. It would also questionable whether any law could work very well given the international nature of IP connections.

(In fact, my own country is in the process of creating new legislation specifically targeted at Google for re-publishing snippets of online news! The newspapers could easily bar google from spidering them via robots.txt, but that's not what they want - they want to be crawled, because that brings page hits and ad money, they just want Goggle to pay them royalties on top! So you see, sometimes even serious, money-grubbing businesses are more upset for not crawling them than for crawling them.)

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 7
    Be careful with legal advice like this: it's *certainly* not applicable all around the world. In some places behaviour like this might easily fall into an "unauthorized use of computer resources" law (if it's also on a big enough scale, for example). – Joachim Sauer Dec 20 '12 at 08:28
  • 7
    You are completely right, you *can* be sued for doing it. But I have grown tired of "this is not legal advice" disclaimers because the theory and practice of law about online things have both become so insane that no one can understand them anymore, *not even* experts. Technically, where I live you could be indicted just for possessing an `nmap` binary. And anybody can be sued today for anything and will lose unless he has enough money to defend himself. Therefore I think that knowing what is commonly to be expected is now much more important than what is theoretically legal or not. – Kilian Foth Dec 20 '12 at 21:46