How to be a good citizen when crawling web sites?

Question

I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic to their sites. But I digress.

Other than honouring robots.txt, are there any rules or guidelines, written or unwritten, that I ought to be following in order to (a) avoid appearing malicious and potentially being banned, and (b) not cause any problems for the site owners/webmasters?

Some examples I can think of which may or may not matter:

Number of parallel requests
Time between requests
Time between entire crawls
Avoiding potentially destructive links (don't want to be the Spider of Doom - but who knows if this is even practical)

That's really just spit-balling, though; is there any tried-and-tested wisdom out there that's broadly applicable for anybody who intends to write or utilize a spider?

While the responses below provide a great answer on how to respectfully crawl content, please bear in mind acceptable use of said content once you have crawled it. Republishing it, either in full or in part, may be a violation of the owners copyright. — Gavin Coates, Jun 03 '15 at 10:25

score 85 · Accepted Answer · 2011-07-12T01:12:33.577

Besides obeying robots.txt, obey nofollow and noindex in <meta> elements and links:

There are many who believe robots.txt is not the proper way to block indexing and because of that viewpoint, have instructed many site owners to rely on the <meta name="robots" content="noindex"> tag to tell web crawlers not to index a page.
If you're trying to make a graph of connections between websites (anything similar to PageRank), (and <meta name="robots" content="nofollow">) is supposed to indicate the source site doesn't trust the destination site enough to give it a proper endorsement. So while you can index the destination site, you ought not store the relation between the two sites.

SEO is more of an art than a real science, and it's practiced by a lot of people who know what they're doing, and a lot of people who read the executive summaries of people who know what they're doing. You're going to run into issues where you'll get blocked from sites for doing things that other sites found perfectly acceptable due to some rule someone overheard or read in a blog post on SEOmoz that may or may not be interpreted correctly.

Because of that human element, unless you are Google, Microsoft, or Yahoo!, you are presumed malicious unless proven otherwise. You need to take extra care to act as though you are no threat to a web site owner, and act in accordance with how you would want a potentially malicious (but hopefully benign) crawler to act:

stop crawling a site once you detect you're being blocked: 403/401s on pages you know work, throttling, time-outs, etc.
avoid exhaustive crawls in relatively short periods of time: crawl a portion of the site, and come back later on (a few days later) to crawl another portion. Don't make parallel requests.
avoid crawling potentially sensitive areas: URLs with /admin/ in them, for example.

Even then, it's going to be an up-hill battle unless you resort to black-hat techniques like UA spoofing or purposely masking your crawling patterns: many site owners, for the same reasons above, will block an unknown crawler on sight instead of taking the chance that there's someone not trying to "hack their site". Prepare for a lot of failure.

One thing you could do to combat the negative image an unknown crawler is going to have is to make it clear in your user-agent string who you are:

Aarobot Crawler 0.9 created by John Doe. See http://example.com/aarobot.html for more information.

Where http://example.com/aarobot.html explains what you're trying to accomplish and why you're not a threat. That page should have a few things:

Information on how to contact you directly
Information about what the crawler collects and why it's collecting it
Information on how to opt-out and have any data collected deleted

That last one is key: a good opt-out is like a Money Back Guarantee™ and scores an unreasonable amount of goodwill. It should be humane: one simple step (either an email address or, ideally, a form) and comprehensive (there shouldn't be any "gotchas": opt-out means you stop crawling without exception).

Huge +1 for the suggestion of putting clear info in the User-Agent. I've had the job of poring over webserver logs to figure out who was spidering a big site, and it's no fun trying to trace down who is running all the obscure spiders. — Carson63000, Jul 11 '11 at 04:12
It's quite common to put the URL in the form `(+http://example.com/aarobot.html)`. I don't know what the purpose of the `+` sign is here, but I've seen it often. Web-Sniffer does it, and so do many others. — TRiG, Jul 11 '11 at 10:30
This is great information, but I'm confused on one thing: You make mention of `rel="noindex"` as if it's an `` attribute, but the page you link to describes it as part of the `` tag's `content` attribute. Is it both, or was this a typo in the answer? — Aaronaught, Jul 11 '11 at 23:06
@Aaronaught `rel="noindex"` is just ``; `rel="nofollow"` is both. I fail at coming up with a non-clumsy way to phrase that. — , Jul 11 '11 at 23:34
OK; as far as I can tell it's still actually `content="noindex"` in the meta tags and not `rel="noindex"`. But close enough! — Aaronaught, Jul 12 '11 at 00:11
"SEO is more of an art than a real science" - not true. If you are a statistical programmer, SEO is less an art and more a mathematical recognition skill. Math grads who are skilled in programming or programmers skilled in Maths are in good demand in the web data profiling industry. — שינתיא אבישגנת, Jan 23 '12 at 03:34
@TRiG: [Plus sign in front of URLs in user agents](http://webmasters.stackexchange.com/q/51887/27974). — outis, Aug 05 '15 at 19:04
Should I expect the user-agent string to be read by a human before action is taken against my IP? Or is it likely that some program doesn't find me on a whitelist and bans my IP? — Dodgie, Dec 21 '16 at 03:37

score 32 · Answer 2 · edited Jun 16 '20 at 10:01

While this doesn't answer all of your questions, I believe it will be of help to you and to the sites you crawl.

Similar to the technique used to brute force websites without drawing attention, if you have a large enough pool of sites you need to crawl, don't crawl the next page on the site until you have crawled the next page of all of the other sites. Well, modern servers will allow HTTP connection reuse, so you might want to do more than one to minimise overhead, but the idea still stands. Do not crawl one site to exhaustion until you move to the next. Share the love.

For you at the end of a day, you can still have crawled just as many pages, but average bandwidth usage on a single site will be much lower.

If you want to avoid being the spider of doom, there is no sure-fire method. If someone wants to stick beans up their nose, they will and probably do so in manners you could never predict. Having said that, if you don't mind missing the occasional valid page, have a blacklist of words for a link that will prevent you from following it. For example:

Delete
Remove
Update
Edit
Modify

Not fool-proof, but sometimes you just cannot prevent people from having to learn the hard way ;)

Good advice on "sharing the love" - hadn't considered that at all, although of course it seems obvious in retrospect. — Aaronaught, Jul 11 '11 at 02:34
Your answer will be almost perfect if you mention robots.txt ;) — deadalnix, Jul 11 '11 at 09:17
@deadalnix, but `robots.txt` is already mentioned in the question, and may be taken to be assumed. — TRiG, Jul 11 '11 at 10:28

score 20 · Answer 3 · answered Jul 11 '11 at 10:38

My one bit of advice is to listen to what the website you are crawling is telling you, and dynamically change your crawl in reaction to that.

Is the site slow? Crawl slower so you don't DDOS it. Is it fast? Crawl a bit more, then!
Is the site erroring? Crawl less so you're not stressing out a site already under duress. Use exponentially increasing retry times, so you retry less the longer the site is erroring. But remember to try back later, eventually, so you can see anything you're missing due to, say, a week long error on a specific URL path.
Getting lots of 404s? (remember, our fancy 404 pages take server time too!) Avoid crawling further URLs with that path for now as perhaps everything there is missing; if file001.html - file005.html is not there, I bet you dollars to donuts file999.html isn't either! Or perhaps turn down the percent of time you retrieve anything in that path.

I think this is where a lot of naive crawlers go deeply wrong, by having one robotic strategy that they excute the same regardless of the signals they're getting back from the target site.

A smart crawler is reactive to the target site(s) it is touching.

score 19 · Answer 4 · answered Jul 11 '11 at 08:52

Others mentioned some of the mantra, but let me add some.

Pay attention to file type and size. Don't pull these huge binaries.

Optimize for some typical webserver "directory listing" pages. In particular, they allow to sort for size, date, name, permissions, and so on. Don't treat each sort method as a separate root for crawling.

Ask for gzip (compression on the fly) whenever available.

Limit depth or detect recursion (or both).

Limit page size. Some pages implement tarpits to thwart email-scrapping bots. It's a page that loads at snail speed and is terabytes long.

Do not index 404 pages. Engines that boast biggest indexes do this, and receive well-deserved hate in exchange.

This may be tricky, but try to detect load-balancing farms. If v329.host.com/pages/article.php?99999 returns the same as v132.host.com/pages/article.php?99999 don't scrape the complete list of servers from v001.host.com up to v999.host.com

Nowadays, you can often detect intentionally duplicate content (e.g., load-balancing farms) by checking for canonical tags. — Brian, Jul 18 '16 at 13:19

score 4 · Answer 5 · answered Jul 11 '11 at 08:37

I'll just add one little thing.

Copyright & other legal issues: I know you write they are public websites, so there might not be any copyright, but there might be other legal issues to storing the data.

This will of course depend on which country's data it is you are storing (and where you are storing them). Case in point the problems with the US Patriot Act vs EU's Data Protection Directive. An executive summary of the problem is that US companies have to give their data to eg. the FBI if asked, without informing the users of that, where the Data Protection Directive states that users have to be informed of this. Se http://www.itworld.com/government/179977/eu-upset-microsoft-warning-about-us-access-eu-cloud

"I know you write they are public websites, so there might not be any copyright". Every website on the internet is public, and every website is copyright, unless it explicitly states otherwise. — Gavin Coates, Jun 03 '15 at 10:23

score 3 · Answer 6 · answered Jul 11 '11 at 03:20

3

Call your webcrawler either that or spider, associated with your name. This is important. Analytics engines and the like look for those to associate you as a ... spider. ;)

The way I've seen that done is via the request header User-Agent

answered Jul 11 '11 at 03:20

jcolebrand

1,016
1
10
19

I thought it was usually a "bot" or "robot" - I know that Google's is Googlebot. – Aaronaught Jul 11 '11 at 03:22
Good point. So long as it can be distinguished. There's probably a post on SO laying those out. – jcolebrand Jul 11 '11 at 04:31

score 2 · Answer 7 · answered Jul 11 '11 at 13:26

2

Preserve cookies, when required, to prevent web-site from creating unnecessary sessions.
Implement link parsing behavior, closest to browser one. Our live site reports a lot of '404s', due to bot requests for missing files.

answered Jul 11 '11 at 13:26

Valera Kolupaev

386
1
6

How to be a good citizen when crawling web sites?

7 Answers7

Linked

Related