Automated testing of programs with graphical output

Question

I am a lecturer for a post-graduate module where I expect my students to write Python code that replicates examples from the textbook. The project has been running for a couple of years and this year I want to introduce more automated testing. The problem is writing tests for the functions which primarily produce a matplotlib figure. Note that we are only trying to replicate the figures approximately, so binary comparison with a target image won't be very useful.

The short question is therefore: what strategies can I use to test programs whose primary graphical output can not be compared against a strict reference image?

Some of the problems that I have with the existing code base that prevents automated testing in this case:

Producing the graph stops execution in many cases
It is not practical to generate a copy of the figure in the textbook that includes annotations and so on, so algorithmic image comparison is unlikely to be the answer.

To clarify my goals:

I would like to be able to execute all the code in the codebase to check that it is actually running, even if that means throwing away the output. This would catch regressions where a function is changed
Instead of investing deeply into fuzzy matching of the graphical output with the target, I believe that visual checking between a reference image and the generated image is probably going to be the simplest, but this should be deferred to happen once at the end of the run rather than during the run
Since this is a collaborative project, I don't have to assume that the students are going to be adversarial. Mistakes will be good-faith rather than perverse.

Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs (OCR, artificial vision, pattern recognition...) This demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer. Also see [ask] — gnat, Jan 15 '14 at 08:21
@gnat I've expanded the question a bit to explain why I haven't even tried to use any kind of pattern recognition technique, because I expect solutions based on direct comparisons to fail. — chthonicdaemon, Jan 15 '14 at 09:10

score 4 · Answer 1 · answered Jan 15 '14 at 08:17

This depends on what the context of such tests is.

In development, testing is usually a barrier against negligence, code rot, or misunderstandings. Software gets very complex, but normally at least it "fights fair": it doesn't deliberately try subvert your intentions with deliverables that are useless, but pass the test on a technicality.

When you deal with students programming for grades, this is a real possibility. Therefore, the balance of pressures shifts subtly against automated testing, particularly when the requirement is not some hard-and-fast mathematical property but something as vague as "replicate this figure approximately". Obviously there are various approximations of the goal criterion that you can use and that may be useful (count the number of black and white pixels, detect edges and compare their structure, perform a full Fourier transform and judge the similarity in the frequency space rather than on the pixels themselves...)

However, it may be that looking at the results with human eyes and classifying them manually is actually more efficient. Think of it as exploiting the amazing hardware acceleration that the human brain provides for pattern recognition.

I agree that it's probably going to be more robust to compare using human eyes, but there is still the issue of producing the comparison and automating that process as much as possible, perhaps with some kind of front-end. — chthonicdaemon, Jan 15 '14 at 09:12
I do actually have some unit tests for chart-generating components, and some of the test are along the lines of "assert that all 12 colors appear in the GIF roughly the same number of times", but for most others I just maintain a crude HTML page with comments about what the picture *should* look like. The unit tests renders the current output into the page, and running the tests now involves occasionally reading the test HTML page and checking that it actually looks like the comments say it should look. — Kilian Foth, Jan 15 '14 at 09:40
I was hoping that someone had something like this. Do you have the code in a re-usable form somewhere? I mean have you wrapped this as a test case that I can add easily to my test suite? — chthonicdaemon, Jan 15 '14 at 13:39

l0b0 · Answer 2 · 2014-01-15T08:21:44.847

If the result cannot match the original perfectly, you'll have to do some fuzzy matching.

One very simple measure would be to use the original picture as the golden standard. Subtract your own result (using the absolute difference as the result), and perhaps average over the whole picture. That should give you an idea of how well at least the shape of things fit the original.

A tweak on this method would be to resize both images until their sizes match. Another would be to resize both to a much smaller size and compare those much more strictly, to fit just overall structure of the original.

I'd also second @KilianFoth's idea of using other phase spaces, but be very careful that you're actually comparing apples and oranges in such cases, by for example ignoring low-frequence ("noise") data.

As a general tip, you have to have something the results can be compared to if you want to be able to test it. No matter what that is - a text description, a function or an image - you have to use that as the goal for the testing.

Automated testing of programs with graphical output

2 Answers2