I am currently working on a pet project in Python with scrapy that scrapes several ebay-like sites for real-estate offers in my area. The thing is that some of the sites do seem to provide more structured data in their web pages(ie. presenting a table of all the utilities the apartment has etc.) while others don't. Therefore I have to do some parsing on the data which I do using the pipeline mechanism of the library.
My question is, however, how much processing should a crawler actually do? Should it just extract raw chunks of text based on some xPaths so that you don't waste processing power on them and let that data be parsed further down the line by some other worker, or should it do some itself?
There seem to be a lot of guidelines online on good web-crawling practices but haven't really found any on good crawler design. Any suggestions or rules of thumb?