4

I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a table of all the utilities the apartment has etc.) while others don't. Therefore I have to do some parsing on the data which I do using the pipeline mechanism of the library.

My question is, however, how much processing should a crawler actually do? Should it just extract raw chunks of text based on some xPaths so that you don't waste processing power on them and let that data be parsed further down the line by some other worker, or should it do some itself?

There seem to be a lot of guidelines online on good web-crawling practices but haven't really found any on good crawler design. Any suggestions or rules of thumb?

nikitautiu
  • 143
  • 4

1 Answers1

3

When I have worked with crawlers in the past we have used a much more decoupled approach, as described below:

The basic content retriever will read from a database of page URLs that they are due to crawl, fetch the page data and store it in a database for that URL in a farmer / worker model. Once content for a page is added to the database a message is pushed into a pub/sub topic for data parsers.

The first of these subscribers is a URL scraper. It will parse the fetched fetched page for URLs, adding any that are new to the database of the basic content retriever.

Any number of additional parsers can then be built to subscribe to the pub/sub topic and extract whatever data you need.

In my opinion this approach gives you a great deal of extensibility and scalability, obviously there are considerations around message visibility and ensuring multiple workers of the same type don't pick up the same job but solutions to this are well documented.

Andy
  • 331
  • 1
  • 5