6

Subj. Atm I'm using Selenium and Python, but the same applies to any other scraping solution.

I'm wondering:

  1. which of the options outlined below are optimal/recommended/best practices
  2. if there are existing solutions/helper libraries, which keywords I should look them up by.

To stay objective, "optimal/recommended/best practices" means "widely used and/or promoted/endorsed by high-profile projects in the niche."

I couldn't find any Selenium-related or general-purpose material on this topic having spent about a day of net time searching around which probably means I'm lacking some critical piece(s) of information.


The basic operations when scraping are:

  • searching for element (by CSS selector/XPath and/or by hand for things that those aren't capable of)
  • interacting with an element (input text, click)
  • read element data

And the call chain goes like this:

(Test code ->) User code -> Framework (selenium) -> Browser (web driver) -> Site


So, there are 3 hops here that I could mock. Each one poses challenges:

  • Mock the site: launch a local HTTP server and direct the browser there
    • Have to reimplement the scraped site's interface, in web technologies
  • Mock the browser (e.g. populate HtmlUnit (an in-process browser engine) with predefined HTML at appropriate moments)
    • much simpler but still need to emulate state transitions/action reactions somehow
  • Mock the framework calls
    • The truest to the unit testing philosophy, the least work
    • I'm however worried that it's too restrictive. E.g. I can find the same element by various means. A mock object can only accept a very specific course of action as it lacks the sophistication to e.g. check if some other selector would produce the same result.

There are also two options for what content to provide -- either

  • provide the site's original content that it produced for a test query, compiling it into some sort or self-contained package
    • labor-intensive and error-prone, or
  • provide the bare minimun to satisfy the tested algorithm
    • much simpler but would fail for other possible algorithms that would succeed with the real site

One last concern is the fact that a site is effectively a state machine. I'm not sure which will be more useful:

  • implement the complete state machine, probably as some kind of specification, and set/check its states in the tests
    • very labor-intensive without some kind of library that reduces the work to writing a formal specification; or
  • simply validate the action sequences
    • which doesn't seem to actually test the code against anything -- it merely reiterates what the code does

Update to address an expressed concern:

I'm scraping a 3rd-party site -- which can and will change without notice one day. So, I'm fine with testing against "the site's interface as it was at the time of writing" -- to quickly check if a code change broke the scraper's internal logic.

ivan_pozdeev
  • 583
  • 3
  • 15

3 Answers3

1

You can go crazy mocking every detail. That's probably not feasible given many complicated test cases.

Where possible, it is better to record a complete real-world test input, redact it to get rid of irrelevant details, and then run it through your complete scraping engine. How to represent the site and how to replay it depends on the fidelity you need for these tests. E.g. this might be very difficult if the site is expected to make Ajax requests to multiple domains.

For example, you might get away by simply storing the HTML of a page you want to scrape. In extreme cases, you would want to log and replay all HTTP requests the site makes. I.e., the site makes the request and you replay a response you recorded from the live site.

In all cases, the thing you assert is that your scraper extracted the correct data from the page. How it gets there is going to be secondary.

The advantage of these high-level testing methods is that

  • they are quite realistic,
  • and once the test suite is set up, adding another testcase doesn't require much effort.

The disadvantage is that these tests are somewhat slow – much faster than doing actual requests to live sites, but still slower than targeted unit tests of your scraper.

Over time, you can grow a corpus of realistic test cases. If a test case stops being useful (e.g. because the live site changed), you can always throw it out.

amon
  • 132,749
  • 27
  • 279
  • 375
  • You're simply reiterating the options and their pros and cons that I outlined in the question. I'm specifically interested in whether practice showed that some of the ways are more practical than others, and what kinds of helper tools already exist (so I don't have to reinvent the wheel) to implement them. – ivan_pozdeev Feb 11 '18 at 04:55
0

I'm not sure what you mean by scraper here. When I develop UI automation using Selenium/Java - I define unit tests. Say I have the web application and I write selenium tests to verify successful login, invalid login, etc. Then I really test existing web application that runs to alert that something went down or was broken, went out of sync with what we expect.

On the other hand I was writing scrapers that it wasn't necessarily selenium based. Basic Http calls were used by me to visit pages, get HTML back, extract data using regular expressions and store data in csv/json and then processed , say to collect hotel or shop prices.

Mocks in java are for simulating dependencies to run unit tests in isolation. Say we have atm operation and instead of inserting real card, showing real balance, we create mock object for bankService to return predefined mocked balance, insert "mock" card into ATM with predefined data, etc. The purpose of mocks is to unit test without delays. I don't understand what scraper and site mock means. Site will be built, but not in place? Well, you write selenium tests then and define proper element names and enable them when site is up and running.

I don't understand the purpose of mock site, because usually developers run local instance of server for development. There are dev, qa, production environments.

Well, working in QA team we were basically either covering what is there, so it checks predefined scenario and alerts us when something goes down or was broken by dev. Or you use it for test-driven development. When you define bunch of business rules and while you implement many things go green and pass the test.

I think with selenium you just test your UI part. This is automated version what manual testers do.

Server side is covered with unit tests, integration tests, having mocks in unit tests to verify back end code behavior, but it has no relation to Selenium. Selenium is for UI tests.

Flamaker2018
  • 143
  • 6
  • [Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is, by definition, extracting information from 3rd-party web sites. I do not need to test the website, I need to test the scraping code. For that, I need to provide the code with some mockup of the site since I can't rely on a site I have no control over to always return the same result for the same query. – ivan_pozdeev Feb 06 '18 at 01:17
  • lI think scrapers are always updated according to site changes, because they rename elements, move things around. So you might have scheduled, may be even nightly "smoke test". "Unit Tests" for scraper will be saved predefined html files and "integration tests" will be those that call real site. If integration tests fail - you will know that site was changed and you will know what to change. If it can't find element with id "username" for login scenario, then you know they renamed it - you go, download the file, save as login.html for example, update your unit test and you're fine again. – Flamaker2018 Feb 06 '18 at 03:17
  • 3
    @ivan_pozdeev: I don't see the point of unit testing a scraper at all. You will be unit testing against a fictitious page; as soon as the real page changes, you'll have to change the scraper anyway. If you write scrapers a lot, you will be relying heavily on libraries like HTML Agility Pack and custom methods you write yourself to do the heavy lifting; those are the methods that should be unit tested (if you so choose). The remainder of the scraper is unremarkable stuff like regexes and xpaths which are tested against the actual page *while you write the scraper.* – Robert Harvey Feb 06 '18 at 03:50
  • @RobertHarvey The point is _"to quickly check if a code change broke the scraper's internal logic"_. E.g. when I move chunks of code around. The logic is not straightforward, there are loops, branching, chunks common to a line of similar sites... – ivan_pozdeev Feb 06 '18 at 08:10
0

An idea given by @RobertHarvey in c796112 is to not mock the site at all.

If the goal is to rather test the scraper's internal logic, test exactly that:

  • Split off the code that directly implements elementary page operations into subroutines and mock those.
    • The idea is to make these subroutines as simple as possible (effectively, a glorified selector/XPath) so that they can do without testing.

On your scheme, that'll be a step higher from the "User code -> Framework" hop: User code -> Elementary page operations -> Framework.

ivan_pozdeev
  • 583
  • 3
  • 15