Subj. Atm I'm using Selenium and Python, but the same applies to any other scraping solution.
I'm wondering:
- which of the options outlined below are optimal/recommended/best practices
- if there are existing solutions/helper libraries, which keywords I should look them up by.
To stay objective, "optimal/recommended/best practices" means "widely used and/or promoted/endorsed by high-profile projects in the niche."
I couldn't find any Selenium-related or general-purpose material on this topic having spent about a day of net time searching around which probably means I'm lacking some critical piece(s) of information.
The basic operations when scraping are:
- searching for element (by CSS selector/XPath and/or by hand for things that those aren't capable of)
- interacting with an element (input text, click)
- read element data
And the call chain goes like this:
(Test code ->) User code -> Framework (selenium) -> Browser (web driver) -> Site
So, there are 3 hops here that I could mock. Each one poses challenges:
- Mock the site: launch a local HTTP server and direct the browser there
- Have to reimplement the scraped site's interface, in web technologies
- Mock the browser (e.g. populate HtmlUnit (an in-process browser engine) with predefined HTML at appropriate moments)
- much simpler but still need to emulate state transitions/action reactions somehow
- Mock the framework calls
- The truest to the unit testing philosophy, the least work
- I'm however worried that it's too restrictive. E.g. I can find the same element by various means. A mock object can only accept a very specific course of action as it lacks the sophistication to e.g. check if some other selector would produce the same result.
There are also two options for what content to provide -- either
- provide the site's original content that it produced for a test query, compiling it into some sort or self-contained package
- labor-intensive and error-prone, or
- provide the bare minimun to satisfy the tested algorithm
- much simpler but would fail for other possible algorithms that would succeed with the real site
One last concern is the fact that a site is effectively a state machine. I'm not sure which will be more useful:
- implement the complete state machine, probably as some kind of specification, and set/check its states in the tests
- very labor-intensive without some kind of library that reduces the work to writing a formal specification; or
- simply validate the action sequences
- which doesn't seem to actually test the code against anything -- it merely reiterates what the code does
Update to address an expressed concern:
I'm scraping a 3rd-party site -- which can and will change without notice one day. So, I'm fine with testing against "the site's interface as it was at the time of writing" -- to quickly check if a code change broke the scraper's internal logic.