84

I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic to their sites. But I digress.

Other than honouring robots.txt, are there any rules or guidelines, written or unwritten, that I ought to be following in order to (a) avoid appearing malicious and potentially being banned, and (b) not cause any problems for the site owners/webmasters?

Some examples I can think of which may or may not matter:

  • Number of parallel requests
  • Time between requests
  • Time between entire crawls
  • Avoiding potentially destructive links (don't want to be the Spider of Doom - but who knows if this is even practical)

That's really just spit-balling, though; is there any tried-and-tested wisdom out there that's broadly applicable for anybody who intends to write or utilize a spider?

dimo414
  • 393
  • 2
  • 14
Aaronaught
  • 44,005
  • 10
  • 92
  • 126
  • 1
    While the responses below provide a great answer on how to respectfully crawl content, please bear in mind acceptable use of said content once you have crawled it. Republishing it, either in full or in part, may be a violation of the owners copyright. – Gavin Coates Jun 03 '15 at 10:25

7 Answers7

85

Besides obeying robots.txt, obey nofollow and noindex in <meta> elements and links:

  • There are many who believe robots.txt is not the proper way to block indexing and because of that viewpoint, have instructed many site owners to rely on the <meta name="robots" content="noindex"> tag to tell web crawlers not to index a page.

  • If you're trying to make a graph of connections between websites (anything similar to PageRank), (and <meta name="robots" content="nofollow">) is supposed to indicate the source site doesn't trust the destination site enough to give it a proper endorsement. So while you can index the destination site, you ought not store the relation between the two sites.

SEO is more of an art than a real science, and it's practiced by a lot of people who know what they're doing, and a lot of people who read the executive summaries of people who know what they're doing. You're going to run into issues where you'll get blocked from sites for doing things that other sites found perfectly acceptable due to some rule someone overheard or read in a blog post on SEOmoz that may or may not be interpreted correctly.

Because of that human element, unless you are Google, Microsoft, or Yahoo!, you are presumed malicious unless proven otherwise. You need to take extra care to act as though you are no threat to a web site owner, and act in accordance with how you would want a potentially malicious (but hopefully benign) crawler to act:

  • stop crawling a site once you detect you're being blocked: 403/401s on pages you know work, throttling, time-outs, etc.
  • avoid exhaustive crawls in relatively short periods of time: crawl a portion of the site, and come back later on (a few days later) to crawl another portion. Don't make parallel requests.
  • avoid crawling potentially sensitive areas: URLs with /admin/ in them, for example.

Even then, it's going to be an up-hill battle unless you resort to black-hat techniques like UA spoofing or purposely masking your crawling patterns: many site owners, for the same reasons above, will block an unknown crawler on sight instead of taking the chance that there's someone not trying to "hack their site". Prepare for a lot of failure.

One thing you could do to combat the negative image an unknown crawler is going to have is to make it clear in your user-agent string who you are:

Aarobot Crawler 0.9 created by John Doe. See http://example.com/aarobot.html for more information.

Where http://example.com/aarobot.html explains what you're trying to accomplish and why you're not a threat. That page should have a few things:

  • Information on how to contact you directly
  • Information about what the crawler collects and why it's collecting it
  • Information on how to opt-out and have any data collected deleted

That last one is key: a good opt-out is like a Money Back Guarantee™ and scores an unreasonable amount of goodwill. It should be humane: one simple step (either an email address or, ideally, a form) and comprehensive (there shouldn't be any "gotchas": opt-out means you stop crawling without exception).

32

While this doesn't answer all of your questions, I believe it will be of help to you and to the sites you crawl.

Similar to the technique used to brute force websites without drawing attention, if you have a large enough pool of sites you need to crawl, don't crawl the next page on the site until you have crawled the next page of all of the other sites. Well, modern servers will allow HTTP connection reuse, so you might want to do more than one to minimise overhead, but the idea still stands. Do not crawl one site to exhaustion until you move to the next. Share the love.

For you at the end of a day, you can still have crawled just as many pages, but average bandwidth usage on a single site will be much lower.

If you want to avoid being the spider of doom, there is no sure-fire method. If someone wants to stick beans up their nose, they will and probably do so in manners you could never predict. Having said that, if you don't mind missing the occasional valid page, have a blacklist of words for a link that will prevent you from following it. For example:

  • Delete
  • Remove
  • Update
  • Edit
  • Modify

Not fool-proof, but sometimes you just cannot prevent people from having to learn the hard way ;)

Dan McGrath
  • 11,163
  • 6
  • 55
  • 81
20

My one bit of advice is to listen to what the website you are crawling is telling you, and dynamically change your crawl in reaction to that.

  • Is the site slow? Crawl slower so you don't DDOS it. Is it fast? Crawl a bit more, then!

  • Is the site erroring? Crawl less so you're not stressing out a site already under duress. Use exponentially increasing retry times, so you retry less the longer the site is erroring. But remember to try back later, eventually, so you can see anything you're missing due to, say, a week long error on a specific URL path.

  • Getting lots of 404s? (remember, our fancy 404 pages take server time too!) Avoid crawling further URLs with that path for now as perhaps everything there is missing; if file001.html - file005.html is not there, I bet you dollars to donuts file999.html isn't either! Or perhaps turn down the percent of time you retrieve anything in that path.

I think this is where a lot of naive crawlers go deeply wrong, by having one robotic strategy that they excute the same regardless of the signals they're getting back from the target site.

A smart crawler is reactive to the target site(s) it is touching.

Jeff Atwood
  • 6,757
  • 10
  • 45
  • 49
19

Others mentioned some of the mantra, but let me add some.

Pay attention to file type and size. Don't pull these huge binaries.

Optimize for some typical webserver "directory listing" pages. In particular, they allow to sort for size, date, name, permissions, and so on. Don't treat each sort method as a separate root for crawling.

Ask for gzip (compression on the fly) whenever available.

Limit depth or detect recursion (or both).

Limit page size. Some pages implement tarpits to thwart email-scrapping bots. It's a page that loads at snail speed and is terabytes long.

Do not index 404 pages. Engines that boast biggest indexes do this, and receive well-deserved hate in exchange.

This may be tricky, but try to detect load-balancing farms. If v329.host.com/pages/article.php?99999 returns the same as v132.host.com/pages/article.php?99999 don't scrape the complete list of servers from v001.host.com up to v999.host.com

SF.
  • 5,078
  • 2
  • 24
  • 36
4

I'll just add one little thing.

Copyright & other legal issues: I know you write they are public websites, so there might not be any copyright, but there might be other legal issues to storing the data.

This will of course depend on which country's data it is you are storing (and where you are storing them). Case in point the problems with the US Patriot Act vs EU's Data Protection Directive. An executive summary of the problem is that US companies have to give their data to eg. the FBI if asked, without informing the users of that, where the Data Protection Directive states that users have to be informed of this. Se http://www.itworld.com/government/179977/eu-upset-microsoft-warning-about-us-access-eu-cloud

Holger
  • 221
  • 1
  • 2
  • 2
    "I know you write they are public websites, so there might not be any copyright". Every website on the internet is public, and every website is copyright, unless it explicitly states otherwise. – Gavin Coates Jun 03 '15 at 10:23
3

Call your webcrawler either that or spider, associated with your name. This is important. Analytics engines and the like look for those to associate you as a ... spider. ;)

The way I've seen that done is via the request header User-Agent

jcolebrand
  • 1,016
  • 1
  • 10
  • 19
2
  • Preserve cookies, when required, to prevent web-site from creating unnecessary sessions.
  • Implement link parsing behavior, closest to browser one. Our live site reports a lot of '404s', due to bot requests for missing files.
Valera Kolupaev
  • 386
  • 1
  • 6