Highest Voted 'web-scraping' Questions - Software Engineering Stack Exchange

84

votes

7 answers

How to be a good citizen when crawling web sites?

I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic…

web-scraping web-crawler

asked Jul 11 '11 at 01:25

Aaronaught

44,005
10
92
126

29

votes

3 answers

What will happen if I don't follow robots.txt while crawling?

I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests. I just want to know what will happen if I don't follow the robots.txt file and what is the…

web-scraping web-crawler

asked Dec 20 '12 at 07:48

user1858027

409
1
4
5

10

votes

4 answers

Patterns and practices for Web Scraping in .Net (C#)

I will be putting together an application to automate an external web site/application. In some instances I will need to navigate the site as a user would (some links I need to follow cannot be predicted and must be parsed from a response) I am…

c# .net html web-scraping

asked Jul 11 '11 at 16:45

jlnorsworthy

1,266
2
10
18

6

votes

3 answers

Preferred approach to mock a site to test a scraper

Subj. Atm I'm using Selenium and Python, but the same applies to any other scraping solution. I'm wondering: which of the options outlined below are optimal/recommended/best practices if there are existing solutions/helper libraries, which…

acceptance-testing web-scraping

asked Feb 05 '18 at 10:35

ivan_pozdeev

583
3
15

4

votes

2 answers

How to make a webdriver run reliably in Selenium?

I have been having quite a time getting this to work reliably for 100s of thousands of terms and potentially millions of pages per source and ETL the resulting data into a database in an automated fashion. I need to run the tasks in Mesos on a…

java scala selenium web-scraping selenium-webdriver

asked Oct 19 '16 at 19:29

user3916597

197
2
8

4

votes

1 answer

How much processing to do in the crawler? - good crawling practices

I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a…

web-scraping web-crawler

asked Jun 17 '16 at 16:58

nikitautiu

143
4

4

votes

1 answer

What is the way to go to extract data from websites?

I've been thinking about a side project that envolves web data scraping. Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights. In the discussion Joachim Sauer stated that you can…

architecture python web-scraping web-crawler

asked May 23 '13 at 12:21

salaniojr

49
1
1
3

3

votes

4 answers

Which language is the most flexible for scraping websites?

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a…

python objective-c web-scraping

asked May 09 '11 at 19:59

MSe

183
1
1
7

3

votes

1 answer

How to approach a large number of multiple, parallel HttpClient requests?

I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even if the user deletes their page. Since the site…

c# asynchronous-programming parallelism parallel-programming web-scraping

asked Feb 23 '20 at 17:33

nicktheone

41
1
3

3

votes

1 answer

Android App with Ruby Backend Server

I'm working on a personal project to help me branch out and learn some new/different technologies. I'm a .NET programmer but I want to learn Ruby and how to develop Android apps. I have developed pieces already but need the whole system to tie…

android ruby web-scraping web-servers

asked Sep 11 '15 at 20:16

AXG1010

171
5

3

votes

1 answer

Scraping data from website and passing into Office - a lot of restrictions

Recently, I was asked to help with some side optimization project at our company, I've made some good research. I'm still not 100% sure if this is most efficient way to do this. Problem: Scraping for over a dozen different information from a…

architecture web-applications websites web-scraping

asked Jul 08 '15 at 19:18

kuba

133
6

2

votes

1 answer

Is it possible to layer an API (REST, GraphQL, etc.) in front of data that is currently only accessible via an enterprise desktop GUI?

Currently, my thoughts are that GET requests would be feasible by using the concept of screen scraping combined with a cron job that runs at a set interval to scrape data from the GUI and sync to my own database. However, I'm not quite sure how I…

architecture legacy automation desktop-application web-scraping

asked Jan 04 '19 at 01:23

J. Munson

137
3

2

votes

1 answer

Preventing crawler from interfering with user tracking

I'm scraping text from various webshops (no images/videos or other data). I'm no expert on user tracking, so I'd like to know if there's a way for me to write my crawler so it won't interfere with the webshop owners tracking. Perhaps this is already…

node.js web-scraping

asked Mar 19 '13 at 17:03

Jan Sommer

170
8

2

votes

1 answer

Improving performance for web scraping code

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites),…

performance web-scraping

asked Mar 25 '12 at 05:22

Pankaj Upadhyay

5,060
11
44
60

2

votes

5 answers

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are…

c# .net web-scraping

asked Nov 05 '11 at 16:54

JAS

21
3

Questions tagged [web-scraping]