How to do regression testing on a machine learning based application?

Question

This question is different from What are best practices for testing programs with stochastic behavior? be cause it's particular to Regression testing.

We are building a chat bot, and for regression testing, we are planning to run the new system through all the conversational data that we have received till date. Then we are planning to compare the output from the new system against the output from the old system to see if the system has stopped answering any questions it was answering before.

We were having a discussion, where I was of the opinion that the standard regression testing where I define unit tests targeting features/capabilities of the product will not work for applications that are not deterministic in nature. Especially if there are new variations coming in all the time.

Another approach would be to hand create the test data, and it should be corresponding to any model enhancements that were done.

How is regression testing done is such cases where there is a machine learning model involved? What should be the ideal way of doing Regression testing?

What makes you think that other question is not about regression testing? — Doc Brown, Sep 05 '19 at 12:28
Whether it's deterministic or not, your system (I hope) still has definite requirements. Testing must ensure that those requirements are met. If the spec is probabilistic, the test conditions will be probabilistic, but that doesn't change the nature and principles of testing. — Kilian Foth, Sep 05 '19 at 13:14

score 2 · Accepted Answer · answered Sep 05 '19 at 12:21

If you have a stochastic system, you can still test it

(a) by removing the source of randomness, e.g. using an RNG with a fixed seed, or
(b) by using statistical tests, e.g. ensuring that the system produces the expected output at least 95% of the time.

Your proposed testing approach of running a large corpus of historical data through the system is pretty good, with two caveats:

This will take very long, and is unsuitable for quick tests during development. It is more suitable at a pre-release QA stage.
You cannot reasonably verify that the correct output is produced. If the output doesn't match recorded data, the new output might be better or worse but you won't know which.

Therefore, explicitly testing specific features might be very sensible. If a feature of your chatbot is that it can tell you what the weather is, do write explicit tests that ask the chatbot for the weather. For these feature-level tests you will likely want to mock external data sources so that the test is repeatable. Such isolated tests can run fairly quickly, and have a complementary goal to corpus-based tests.

You may also find that verifying exact responses is too fragile. If your system has some structured intermediate representation of the responses, verifying those might be a better idea. For example, instead of checking the natural language sentence The weather in Anchorage is 12°C you might check a JSON object {"type": "weather_response", "time": "now", "location": "Anchorage", "temperature": 12, "temperature_units": "C"}. Of course, this kind of technique is not applicable for the many ML approaches that do not create a model.

How to do regression testing on a machine learning based application?

1 Answers1