I am a lecturer for a post-graduate module where I expect my students to write Python code that replicates examples from the textbook. The project has been running for a couple of years and this year I want to introduce more automated testing. The problem is writing tests for the functions which primarily produce a matplotlib figure. Note that we are only trying to replicate the figures approximately, so binary comparison with a target image won't be very useful.
The short question is therefore: what strategies can I use to test programs whose primary graphical output can not be compared against a strict reference image?
Some of the problems that I have with the existing code base that prevents automated testing in this case:
- Producing the graph stops execution in many cases
- It is not practical to generate a copy of the figure in the textbook that includes annotations and so on, so algorithmic image comparison is unlikely to be the answer.
To clarify my goals:
- I would like to be able to execute all the code in the codebase to check that it is actually running, even if that means throwing away the output. This would catch regressions where a function is changed
- Instead of investing deeply into fuzzy matching of the graphical output with the target, I believe that visual checking between a reference image and the generated image is probably going to be the simplest, but this should be deferred to happen once at the end of the run rather than during the run
- Since this is a collaborative project, I don't have to assume that the students are going to be adversarial. Mistakes will be good-faith rather than perverse.