Questions tagged [web-scraping]

Web scraping is automated information extraction from web sites.

44 questions
84
votes
7 answers

How to be a good citizen when crawling web sites?

I'm going to be developing some functionality that will crawl various public web sites and process/aggregate the data on them. Nothing sinister like looking for e-mail addresses - in fact it's something that might actually drive additional traffic…
Aaronaught
  • 44,005
  • 10
  • 92
  • 126
29
votes
3 answers

What will happen if I don't follow robots.txt while crawling?

I am new to web crawling and I am testing my crawlers. I have been doings tests on various sites for testing. I forgot about robots.txt file during my tests. I just want to know what will happen if I don't follow the robots.txt file and what is the…
user1858027
  • 409
  • 1
  • 4
  • 5
10
votes
4 answers

Patterns and practices for Web Scraping in .Net (C#)

I will be putting together an application to automate an external web site/application. In some instances I will need to navigate the site as a user would (some links I need to follow cannot be predicted and must be parsed from a response) I am…
jlnorsworthy
  • 1,266
  • 2
  • 10
  • 18
6
votes
3 answers

Preferred approach to mock a site to test a scraper

Subj. Atm I'm using Selenium and Python, but the same applies to any other scraping solution. I'm wondering: which of the options outlined below are optimal/recommended/best practices if there are existing solutions/helper libraries, which…
ivan_pozdeev
  • 583
  • 3
  • 15
4
votes
2 answers

How to make a webdriver run reliably in Selenium?

I have been having quite a time getting this to work reliably for 100s of thousands of terms and potentially millions of pages per source and ETL the resulting data into a database in an automated fashion. I need to run the tasks in Mesos on a…
user3916597
  • 197
  • 2
  • 8
4
votes
1 answer

How much processing to do in the crawler? - good crawling practices

I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a…
nikitautiu
  • 143
  • 4
4
votes
1 answer

What is the way to go to extract data from websites?

I've been thinking about a side project that envolves web data scraping. Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights. In the discussion Joachim Sauer stated that you can…
salaniojr
  • 49
  • 1
  • 1
  • 3
3
votes
4 answers

Which language is the most flexible for scraping websites?

I'm new to programming. I know a little python and a little objective c, and I've been going through tutorials for each. Then it occurred to me, I need to know which language is more flexible (python, obj c, something else) for screen scraping a…
MSe
  • 183
  • 1
  • 1
  • 7
3
votes
1 answer

How to approach a large number of multiple, parallel HttpClient requests?

I have a website which offers pages in the format of https://www.example.com/X where X is a sequential, unique number increasing by one every time a page is created by the users and never reused even if the user deletes their page. Since the site…
3
votes
1 answer

Android App with Ruby Backend Server

I'm working on a personal project to help me branch out and learn some new/different technologies. I'm a .NET programmer but I want to learn Ruby and how to develop Android apps. I have developed pieces already but need the whole system to tie…
AXG1010
  • 171
  • 5
3
votes
1 answer

Scraping data from website and passing into Office - a lot of restrictions

Recently, I was asked to help with some side optimization project at our company, I've made some good research. I'm still not 100% sure if this is most efficient way to do this. Problem: Scraping for over a dozen different information from a…
kuba
  • 133
  • 6
2
votes
1 answer

Is it possible to layer an API (REST, GraphQL, etc.) in front of data that is currently only accessible via an enterprise desktop GUI?

Currently, my thoughts are that GET requests would be feasible by using the concept of screen scraping combined with a cron job that runs at a set interval to scrape data from the GUI and sync to my own database. However, I'm not quite sure how I…
2
votes
1 answer

Preventing crawler from interfering with user tracking

I'm scraping text from various webshops (no images/videos or other data). I'm no expert on user tracking, so I'd like to know if there's a way for me to write my crawler so it won't interfere with the webshop owners tracking. Perhaps this is already…
Jan Sommer
  • 170
  • 8
2
votes
1 answer

Improving performance for web scraping code

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites),…
Pankaj Upadhyay
  • 5,060
  • 11
  • 44
  • 60
2
votes
5 answers

Data Scraping - One application or multiple?

I have 30+ sources of data I scrape daily in various formats (xml, html, csv). Over the last three years Ive built 20 or so c# console applications that go out, download the data and re-format it into a database. But Im curious what other people are…
JAS
  • 21
  • 3
1
2 3