Highest Voted 'web-crawler' Questions - Software Engineering Stack Exchange

84

votes

7 answers

How to be a good citizen when crawling web sites?

I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic…

web-scraping web-crawler

asked Jul 11 '11 at 01:25

Aaronaught

44,005
10
92
126

29

votes

3 answers

What will happen if I don't follow robots.txt while crawling?

I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests. I just want to know what will happen if I don't follow the robots.txt file and what is the…

web-scraping web-crawler

asked Dec 20 '12 at 07:48

user1858027

409
1
4
5

8

votes

3 answers

Looking for good books about the theory behind search engines

I am working on a project that requires that I understand different techniques used by search engines for the web. I have a strong scientific and development background, so I am not afraid of highly technical information. I am looking for all forms…

search search-engine web-crawler

asked Sep 03 '11 at 18:04

sebpiq

365
1
3
8

7

votes

2 answers

How to find a 'good' seed page for a web crawler?

I started building a web crawler and read somewhere that it's a very hard problem to find a good seed page for the crawler. Can anyone explain me if there is any pre-defined procedure/ guidlines of finding a good seed page? or how you say that a…

web internet web-crawler

asked Feb 01 '13 at 11:52

Hemant

1,473
2
10
6

6

votes

4 answers

Development of a bot/web crawler detection system

I am trying to build a system for my company which wants to check for unusual/abusive pattern of users (mainly web scrapers). Currently the logic I have implemented parses the http access logs and takes into account the following parameters to…

security web-crawler heuristics crawlers

asked Dec 23 '11 at 06:11

bilkulbekar

161
1
4

5

votes

1 answer

Can I whitelist user agents that will execute JavaScript?

I'm building a SPA (single page application) so that when a browser request a page from my server, it only receives a small HTML and a big JavaScript app that then asks the appropriate data from the server, renders the HTML locally and generally…

web-crawler

asked Sep 07 '15 at 20:21

Pablo Fernandez

313
1
9

4

votes

1 answer

How much processing to do in the crawler? - good crawling practices

I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a…

web-scraping web-crawler

asked Jun 17 '16 at 16:58

nikitautiu

143
4

4

votes

0 answers

IRLBot Paper DRUM Implementation - Why keep key, value and auxiliary buckets separate?

Repost from here as I think it may be more suited to this exchange. I'm trying to implement DRUM (Disk Repository with Update Management) as per the IRLBot paper (relevant pages start at 4) but as quick summary it's essentially just an efficient…

architecture data-structures efficiency web-crawler crawlers

asked Apr 07 '15 at 22:47

Isaac

183
6

4

votes

1 answer

What is the way to go to extract data from websites?

I've been thinking about a side project that envolves web data scraping. Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights. In the discussion Joachim Sauer stated that you can…

architecture python web-scraping web-crawler

asked May 23 '13 at 12:21

salaniojr

49
1
1
3

3

votes

1 answer

Is it considered bad practice to crawl through the mobile version of a site?

I am building a web spider to crawl through several different sites, but one of them uses javascript buttons instead of links for several functions. And while I could learn to follow them, it adds an extra layer of complexity I would rather avoid if…

web-crawler

asked Jul 17 '16 at 06:24

Devon M

522
1
4
14

3

votes

1 answer

Ways of Gathering Event Information From the Internet

What are the best ways of gathering information on events (any type) from the internet ? Keeping in mind that different websites will present information in different ways. I was thinking 'smart' web crawlers, but that can turn out to be extremely…

artificial-intelligence social-networks web-crawler

asked Nov 01 '12 at 13:40

J86

297
2
8

2

votes

1 answer

How to reverse engineer URL routes from a bulk of HTTP requests/responses

I am building a web application crawler that crawls for HTTP requests (GET, PUT, POST, ...). It is designed for one specific purpose; bug bounty hunting. It enables pentesters to insert exploit payloads at specific parts of the HTTP…

problem-solving routing url web-crawler

asked Oct 30 '17 at 23:19

Tijme

31
5

2

votes

2 answers

Patterns for creating adaptive web crawler throttling

Im running a service that crawls many websites daily. The crawlers are run as jobs processed by a bunch of independent background worker processes, that picks up the jobs as they get enqueued. Now currently I'm doing throttling "in-process" meaning…

architecture concurrency web-crawler crawlers

asked Sep 30 '14 at 10:27

Niels Kristian

181
6

2

votes

5 answers

Website Country Detection

I have a web crawler, and I'm looking for hints that will help me automatically detect a website country of origin. And by country of origin I generally mean the country the website is targeting. For example: http://www.spiegel.de/ ->…

algorithms automation web-crawler

asked Jan 20 '12 at 10:41

Filipe Miguel Fonseca

161
5

1

vote

0 answers

What are the best practices for picking selectors for web scrappers?

The following is an example using https://github.com/GoogleChrome/puppeteer 'use strict'; const puppeteer = require('puppeteer'); (async() => { // const browser = await puppeteer.launch(); // const page = await browser.newPage(); const browser =…

javascript programming-practices web-crawler

asked Aug 31 '17 at 13:06

alex

383
1
8

Questions tagged [web-crawler]