4

I've been thinking about a side project that envolves web data scraping.

Ok, I read the Getting data from a webpage in a stable and efficient way question and the discussion gave me some insights.

In the discussion Joachim Sauer stated that you can contact the owners of the sites and architect some way to provide the data that I want. The problem I see is that the websites are generally badly created and apparently seldom have changes in HTML (I don't think they will help me), but the data is relevant. I have suffered a lot using those sites so I would like to aggregate and show them in a better way.

So, going with scraping, specifically Scrapy (for python), is a problematic approach? I read that parse.ly uses scraping (Python and Scrapy), but in another context.

Given my context, there's a better approach than going with scraping? If going with scraping, how to deal with website structure's changes?

salaniojr
  • 49
  • 1
  • 1
  • 3
  • I don't think that we really have enough information to fully answer the question, as we have no clue exactly what data you want from the sites (HTML, images, Flash applets, databases, etc.) and don't know what you are then going to do with the data. Perhaps if you reworded the question to account for those two details, then we would be able to point you in the right direction to be able to scrape or crawl a site. – Macslayer May 25 '13 at 22:14

1 Answers1

3

Downloading the contents of a website can cause a wide range of problems for the website owners.

  • Bottleneck the server by using all available resources to feed your script requests.
  • Make a mistake and perform requests that would appear like an attack.
  • Get stuck in what is called a robot trap and keep downloading the same page because the URL constantly changes.
  • You might ignore the robotos.txt file and access parts of the website the owners don't want you too.

It's best practice to use a proper web crawling tool. Using the right tool for the job will ensure that you respect the performance, security and usage of the web server. These simple Python/PHP scripts for scraping websites do nothing but harm to the servers they ambush with thousands of web requests in an uncontrolled manner.

You should use a web crawler like Heritrix to download the website to an archive file. Once the archive file is created you can process it using Python/PHP all you want. Since it's stored locally on your harddrive there is no harm in how many times you read it.

The ethics and legal issues of using content from another website is a completely different issue. I'm not going to even go there, because that's between you and the website owner. What I don't want to see are people hammering websites needlessly as they try to download the content. Be respectful and web crawl with the same rules that companies like Google, Bing and Yahoo follow.

Reactgular
  • 13,040
  • 4
  • 48
  • 81
  • most websites that are worth scraping have a database and without this database your local copy is useless. – Lukei May 23 '13 at 14:05
  • @Templar all one can do is download `HTML`. There are poplite ways of doing it, and nasty ways of doing. I'd rather see the OP do it politely. Why he's doing it isn't something we can change. – Reactgular May 23 '13 at 14:07
  • Well, what I can get out of this discussion is that the biggest problem for scraping is legal issues. Anyway, I don't see the diff between scraping and crawling in relation to your answer. I must read more about it, but me undestanding is that scraping gives you more structured data from a crawling. So I don't see the point of Heritrix being less harmful than Scrapy, for instance. – salaniojr May 23 '13 at 15:37
  • @salaniojr I think you're using the terms incorrectly. `Scraping` refers to the process of extracting data from say `html`, and `crawling` refers to the process of downloading the `html`. Heritrix allows you to download the `html` in a polite manner to an archive. You then perform the `scraping` on the archive that is on your hard drive. Scraping is a brute force iteration over a collection of html to extract the data. Which is harmful to a web server. – Reactgular May 23 '13 at 15:44