Questions tagged [web-crawler]
22 questions
84
votes
7 answers
How to be a good citizen when crawling web sites?
I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic…

Aaronaught
- 44,005
- 10
- 92
- 126
29
votes
3 answers
What will happen if I don't follow robots.txt while crawling?
I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests.
I just want to know what will happen if I don't follow the robots.txt file and what is the…

user1858027
- 409
- 1
- 4
- 5
8
votes
3 answers
Looking for good books about the theory behind search engines
I am working on a project that requires that I understand different techniques used by search engines for the web.
I have a strong scientific and development background, so I am not afraid of highly technical information.
I am looking for all forms…

sebpiq
- 365
- 1
- 3
- 8
7
votes
2 answers
How to find a 'good' seed page for a web crawler?
I started building a web crawler and read somewhere that it's a very hard problem to find a good seed page for the crawler. Can anyone explain me if there is any pre-defined procedure/ guidlines of finding a good seed page? or how you say that a…

Hemant
- 1,473
- 2
- 10
- 6
6
votes
4 answers
Development of a bot/web crawler detection system
I am trying to build a system for my company which wants to check for unusual/abusive pattern of users (mainly web scrapers).
Currently the logic I have implemented parses the http access logs and takes into account the following parameters to…

bilkulbekar
- 161
- 1
- 4
5
votes
1 answer
Can I whitelist user agents that will execute JavaScript?
I'm building a SPA (single page application) so that when a browser request a page from my server, it only receives a small HTML and a big JavaScript app that then asks the appropriate data from the server, renders the HTML locally and generally…

Pablo Fernandez
- 313
- 1
- 9
4
votes
1 answer
How much processing to do in the crawler? - good crawling practices
I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a…

nikitautiu
- 143
- 4
4
votes
0 answers
IRLBot Paper DRUM Implementation - Why keep key, value and auxiliary buckets separate?
Repost from here as I think it may be more suited to this exchange.
I'm trying to implement DRUM (Disk Repository with Update Management) as per the IRLBot paper (relevant pages start at 4) but as quick summary it's essentially just an efficient…

Isaac
- 183
- 6
4
votes
1 answer
What is the way to go to extract data from websites?
I've been thinking about a side project that envolves web data scraping.
Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights.
In the discussion Joachim Sauer stated that you can…

salaniojr
- 49
- 1
- 1
- 3
3
votes
1 answer
Is it considered bad practice to crawl through the mobile version of a site?
I am building a web spider to crawl through several different sites, but one of them uses javascript buttons instead of links for several functions. And while I could learn to follow them, it adds an extra layer of complexity I would rather avoid if…

Devon M
- 522
- 1
- 4
- 14
3
votes
1 answer
Ways of Gathering Event Information From the Internet
What are the best ways of gathering information on events (any type) from the internet ?
Keeping in mind that different websites will present information in different ways.
I was thinking 'smart' web crawlers, but that can turn out to be extremely…

J86
- 297
- 2
- 8
2
votes
1 answer
How to reverse engineer URL routes from a bulk of HTTP requests/responses
I am building a web application crawler that crawls for HTTP requests (GET, PUT, POST, ...). It is designed for one specific purpose; bug bounty hunting. It enables pentesters to insert exploit payloads at specific parts of the HTTP…

Tijme
- 31
- 5
2
votes
2 answers
Patterns for creating adaptive web crawler throttling
Im running a service that crawls many websites daily. The crawlers are run as jobs processed by a bunch of independent background worker processes, that picks up the jobs as they get enqueued.
Now currently I'm doing throttling "in-process" meaning…

Niels Kristian
- 181
- 6
2
votes
5 answers
Website Country Detection
I have a web crawler, and I'm looking for hints that will help me automatically detect a website country of origin.
And by country of origin I generally mean the country the website is targeting. For example:
http://www.spiegel.de/ ->…

Filipe Miguel Fonseca
- 161
- 5
1
vote
0 answers
What are the best practices for picking selectors for web scrappers?
The following is an example using https://github.com/GoogleChrome/puppeteer
'use strict';
const puppeteer = require('puppeteer');
(async() => {
// const browser = await puppeteer.launch();
// const page = await browser.newPage();
const browser =…

alex
- 383
- 1
- 8