How do you write unit tests for code with difficult to predict results?

Question

I frequently work with very numeric / mathematical programs, where the exact result of a function is difficult to predict in advance.

In trying to apply TDD with this kind of code, I often find writing the code under test significantly easier than writing unit tests for that code, because the only way I know to find the expected result is to apply the algorithm itself (whether in my head, on paper, or by the computer). This feels wrong, because I am effectively using the code under test to verify my unit tests, instead of the other way around.

Are there known techniques for writing unit tests and applying TDD when the result of the code under test is difficult to predict?

A (real) example of code with difficult to predict results:

A function weightedTasksOnTime that, given an amount of work done per day workPerDay in range (0, 24], the current time initialTime > 0, and a list of tasks taskArray; each with a time to complete property time > 0, due date due, and importance value importance; returns a normalized value in range [0, 1] representing the importance of tasks that can be completed before their due date if each task if completed in the order given by taskArray, starting at initialTime.

The algorithm to implement this function is relatively straightforward: iterate over tasks in taskArray. For each task, add time to initialTime. If the new time < due, add importance to an accumulator. Time is adjusted by inverse workPerDay. Before returning the accumulator, divide by sum of task importances to normalize.

function weightedTasksOnTime(workPerDay, initialTime, taskArray) {
    let simulatedTime = initialTime
    let accumulator = 0;
    for (task in taskArray) {
        simulatedTime += task.time * (24 / workPerDay)
        if (simulatedTime < task.due) {
            accumulator += task.importance
        }
    }
    return accumulator / totalImportance(taskArray)
}

I believe the above problem can be simplified, while maintaining its core, by removing workPerDay and the normalization requirement, to give:

function weightedTasksOnTime(initialTime, taskArray) {
    let simulatedTime = initialTime
    let accumulator = 0;
    for (task in taskArray) {
        simulatedTime += task.time
        if (simulatedTime < task.due) {
            accumulator += task.importance
        }
    }
    return accumulator
}

This question addresses situations where the code under test is not a re-implementation of an existing algorithm. If code is a re-implementation, it intrinsically has easy to predict results, because existing trusted implementations of the algorithm act as a natural test oracle.

Can you provide a simple example of a function whose result is difficult to predict? — Robert Harvey, Oct 07 '18 at 22:22
I have added an example based on a real project. It is not the simplest example, and I may be able to reduce it by removing some of the aspects and arguments while still maintaining a function with difficult to predict results, but I worry about oversimplifying because this question is specifically about hard to test code. — JustinLovinger, Oct 07 '18 at 23:01
FWIW you aren’t testing the algorithm. Presumably that is correct. You are testing the implementation. Working out by hand is often fine as a a parallel construction. — Kristian H, Oct 08 '18 at 00:45
Possible duplicate of [How do you write unit tests when you need the implementation to come up with examples?](https://softwareengineering.stackexchange.com/questions/312132/how-do-you-write-unit-tests-when-you-need-the-implementation-to-come-up-with-exa) — Doc Brown, Oct 08 '18 at 06:11
@DocBrown The linked question and most of its answers appear to address the specific case of re-implementing an existing algorithm. I would argue that code based on an existing algorithm intrinsically has easy to predict results, because existing trusted implementations of the algorithm act as a natural test oracle. — JustinLovinger, Oct 08 '18 at 07:16
NP problems cannot be easily verified by definition. You would literally need to brute force the optimal solution and confront it with what your algorithm returns for you to guarantee optimal results (don't do this, just demonstrating the absurdity of it). However you *can* check what you *know* is wrong if you saw it. — Neil, Oct 08 '18 at 07:56
It seems you misunderstood the top answer in the linked question: it suggests to **create a separate implementation for the purpose of testing**, using a different toolset like a spreadsheet (or even better: let someone else do it), and compare the results of both implementations. That is IMHO a really good answer to your question, better than "just compare degenerate cases" or "cases you can calculate manually". — Doc Brown, Oct 08 '18 at 08:36
... and I am pretty sure, for the example you gave in your question, that would work pretty well. — Doc Brown, Oct 08 '18 at 08:43
@DocBrown Sorry for the misunderstanding. I didn't mean to imply that none of the answers in the linked question are relevant to this question. I only meant that, because the linked question provides the specific example of re-implementing an existing algorithm, most (but not all) of the answers are framed in terms of re-implementing an existing algorithm. — JustinLovinger, Oct 08 '18 at 08:46
@DocBrown Regarding the specific suggesting of implementing the algorithm twice. While comparing multiple implementations can serve to test code, it is very non-rigorous and is time consuming. It is certainly an option, but I do not think it the end-all and be-all of the problem of testing code with difficult to predict results. — JustinLovinger, Oct 08 '18 at 08:51
There are situations where an algorithm cannot be reasonnably unit tested - for example if its execution time is multiple days/months. This may happen when solving NP Problems. In these cases, it *may* be more feasible to provide a formal *prove* that the code is correct. — Hulk, Oct 08 '18 at 09:13
"it is very non-rigorous and is time consuming" - well, you asked for a hard problem, don't expect an easy answer ;-) If you have code where the results are difficult to predict, you have to work hard on getting at least some predictable results. You can do this using "pencil and paper" (which is even more time consuming),or stick to degenerate cases (which is even less rigorous), or use tools like a spreadsheet to recalculate at least some input sets. That is what that other answer is about. — Doc Brown, Oct 08 '18 at 11:50
Something I've seen in very tricky numeric code is to treat unit tests only as regression tests. Write the function, run it for several interesting values, validate results manually, then write the unit test to catch regressions from the expected result. Coding horror? Curious what others think. — Chuu, Oct 08 '18 at 13:36
I once worked on a system where the results were ALWAYS wrong. Simple things like average number or min/max were always off by 1 (code was translated from BASIC to C++ without altering bounds). The code could not be fixed as customers results from the previous 10 years would change, and that's not good for their record keeping. So, to unit test that, all you can do is make sure that the results you get today are the same ones you got yesterday. Run the tests, note down the results and that's what you test for. — Neil, Oct 08 '18 at 14:53
@Neil It is easier, by definition, to verify the result of NP-hard problems than to solve them, and writing test code to do that is a viable solution. The test code, of course, should itself be verified, but you can usually do that with cases small enough to be tractable. — sdenham, Oct 08 '18 at 15:47
The only way to verify correctness is to use something other than the implementation to confirm correctness of results. There's no getting around that. Whether you automate verification or manually verify doesn't matter. PreCompute or PostCompute results doesn't matter. If you use results of your implementation to verify the implementation then you aren't testing anything as you don't know if the implementation results are correct or not. They may appear correct, but it doesn't mean they are without verification. — Dunk, Oct 08 '18 at 22:27
...which means your 'feeling that it is wrong' that using the code under test to verify your unit tests is absolutely correct. All you are doing by taking that approach is verifying that your implementation is giving the results that your implementation says is the answer, regardless of what the correct answer really is. It doesn't verify correctness at all. — Dunk, Oct 08 '18 at 22:35
"I often find writing the code under test significantly easier than writing unit tests for that code" - this is generally the case. Writing a function to do X is often relatively easy. However, code is deterministic and its result should be predictable. If there are random influences or time dependencies you may need to refactor your code to allow it to be tested in a repeatable way. For example, if you have a function whose behavior changes based on time-of-day, then the time of day may need to be an input parameter. — Bob Jarvis - Слава Україні, Oct 09 '18 at 18:31
@DocBrown Having **someone else** do the second implementation is a good way to find ambiguities in the specification. — kutschkem, Oct 10 '18 at 15:05
not sure if someone's mentioned it already, but https://hypothesis.works/ is meant to be a specialised suit intended to help with 'unit testing' of numerical / scientific code. — Tasos Papastylianou, Oct 10 '18 at 15:43
@TasosPapastylianou More generally known as property-based testing. Most languages have a library or two. — JustinLovinger, Oct 11 '18 at 10:31

score 257 · Answer 1 · answered Oct 08 '18 at 03:08

257

There are two things you can test in difficult-to-test code. First, the degenerate cases. What happens if you have no elements in your task array, or only one, or two but one is past the due date, etc. Anything that is simpler than your real problem, but still reasonable to calculate manually.

The second is the sanity checks. These are the checks you do where you don't know if an answer is right, but you definitely would know if it's wrong. These are things like time must move forward, values must be in a reasonable range, percentages must add up to 100, etc.

Yes, this isn't as good as a full test, but you'd be surprised how often you mess up on the sanity checks and degenerate cases, that reveals a problem in your full algorithm.

answered Oct 08 '18 at 03:08

Karl Bielefeldt

146,727
38
279
479

54

Think this is very good advice. Start by writing these sort of unit tests. As you develop the software, if you find bugs or incorrect answers - add those as unit tests. Do the same, to some extent, when you find definitely correct answers. Build them up over time, and you (eventually) will have a very complete set of unit tests despite starting out not knowing what they were going to be ... – Algy Taylor Oct 08 '18 at 11:05
21

Another thing that might be helpful in some cases (though perhaps not this one) is to write an inverse function and test that, when chained, your input and output are the same. – Cyberspark Oct 08 '18 at 11:47
2

And you test the inverse function by writing the inverse inverse function ? – iFlo Oct 08 '18 at 11:49
7

sanity check often make good targets for property based tests with something like QuickCheck – jk. Oct 08 '18 at 12:45
10

the one other category of tests I'd recommend are a few to check for unintentional changes in output. You can 'cheat' on these by using the code itself to generate the expected result since the intent of these is to help maintainers by flagging that something intended as an output neutral change unintentionally did affect algorithmic behavior. – Dan Is Fiddling By Firelight Oct 08 '18 at 12:58
5

@iFlo Not sure if you were joking, but the inverse inverse already exists. Worth realizing that the test failing might be a problem in the inverse function though – lucidbrot Oct 08 '18 at 14:55
Regarding inverse, basically all mathematical problems where you are looking for *a* solution (rather than *optimal* solution) have much easier way to check whether given solution is valid than actually finding that solution. For optimal solution, turn to the metamorphic relations (take a problem, and slightly modified problem know to have simpler solution or known not to have the original solution). – Jan Hudec Oct 10 '18 at 20:11
Also, you may find that after you get these basic tests down that you can say "okay let's take this one test and make its test case just a little more complicated." Remember, tests don't exist for the sake of having tests. They exist so you can quickly catch bugs! – corsiKa Oct 11 '18 at 21:52

score 81 · Answer 2 · answered Oct 08 '18 at 12:03

I used to write tests for scientific software with difficult-to-predict outputs. We made a lot of use of Metamorphic Relations. Essentially there are things you know about how your software should behave even if you don't know exact numerical outputs.

A possible example for your case: if you decrease the amount of work you can do each day then the total amount of work you can do will at best stay the same, but likely decrease. So run the function for a number of values of workPerDay and make sure the relation holds.

Metamorphic relations a specific example of [property-based testing](https://hypothesis.works/articles/what-is-property-based-testing/), which is in general a useful tool for situations like these — Dan Oberlam, Oct 08 '18 at 22:29

score 38 · Answer 3 · answered Oct 08 '18 at 03:26

38

The other answers have good ideas for developing tests for edge or error case. For the others, using the algorithm itself is not ideal (obviously) but still useful.

It will detect if the algorithm (or data it depends on) has changed

If the change is an accident, you could roll back a commit. If the change was deliberate, you need to revisit the unit test.

answered Oct 08 '18 at 03:26

user949300

8,679
2
26
35

6

And for the record, these kind of tests are often called "regression tests" as per their purpose, and are basically a safety net for any modification / refactoring. – Pac0 Oct 13 '18 at 14:16

score 21 · Answer 4 · answered Oct 07 '18 at 22:24

21

The same way you write unit tests for any other kind of code:

Find some representative test cases, and test those.
Find edge cases, and test those.
Find error conditions, and test those.

Unless your code involves some random element or is not deterministic (i.e. it won't produce the same output given the same input), it is unit testable.

Avoid side-effects, or functions that are affected by outside forces. Pure functions are easier to test.

answered Oct 07 '18 at 22:24

Robert Harvey

198,589
55
464
673

The question is 'how' do you test those cases without using the algorithm itself to find the expected output? There are also situations where edge cases could be difficult to find. – JustinLovinger Oct 07 '18 at 23:12
1

Maybe it is possible to verify the output after calculating it, or to establish properties that the output should have and check that these properties are there. – gnasher729 Oct 08 '18 at 00:22
@gnasher729 I suspect that testing properties instead of directly testing output is the general solution to this problem. However, that has the analogous problem of testing code with difficult to discover properties, which I find tends to be the case (at least for me) when a function reduces its arguments (such as in the example above, where an array is reduced to a single number). – JustinLovinger Oct 08 '18 at 00:32
2

For non-deterministic algorithms you can save seed of RNG or mock it using either using fixed sequence or low discrepancy determinitistic series e.g. [Halton sequence](https://en.wikipedia.org/wiki/Halton_sequence) – wondra Oct 08 '18 at 05:49
14

@PaintingInAir If it's impossible to verify the algorithm's output, can the algorithm even _be_ incorrect? – WolfgangGroiss Oct 08 '18 at 09:04
5

`Unless your code involves some random element` The trick here is to make your random number generator an injected dependency, so you can then replace it for a number generator which gives the exact result you want it to. This enables you to accurately test again - counting the generated numbers as input parameters as well. `not deterministic (i.e. it won't produce the same output given the same input)` Since a unit test should start from a _controlled_ situation, it can only be non-deterministic if it has a random element - which you can then inject. I can't think of other possibilities here. – Flater Oct 08 '18 at 09:06
@WolfgangGroiss Difficult to verify is not the same as impossible. This question is largely about efficient development and minimizing the time required to test and implement code. – JustinLovinger Oct 08 '18 at 09:06
@PaintingInAir: I'm not saying efficiency doesn't matter at all; but you should place **correctness** (= reliability of the test) over efficiency. In other words, it's better to first do a slow-but-steady test and then see _if_ it can be optimized, rather than compromising the reliability of the test because of premature optimization. – Flater Oct 08 '18 at 09:09
@wondra: Saving seeds is dangerous - if your implementation changes at some point (e.g. refactoring), the order of calls may change, which will likely fail the test even though the new implementation is still correct according to the requirements. It's not guaranteed to change/break, but I see no way to reliably avoid it changing/breaking either. – Flater Oct 08 '18 at 09:11
@Flater Sorry for the misunderstanding. By efficient development, I meant writing code quickly, not writing code that runs fast. As in, it is better to write clean code that works in 3 hours, than it is to write the same code in 12 hours. And good testing techniques can help. – JustinLovinger Oct 08 '18 at 09:16
3

@PaintingInAir: Either or. My comment applies to both fast execution or fast test writing. If it takes you three days to calculate a single example by hand (let's assume you use the fastest method available that is not using the code) - then three days is what it shall take. If you instead based your expected test outcome on the actual code itself, then the test is compromising itself. That's like doing `if(x == x)`, it's a pointless comparison. You need your two outcomes (**actual**: comes from the code ; **expected**: comes from your external knowledge) to be independent of one another. – Flater Oct 08 '18 at 09:20
2

It is still unit testable even if not deterministic, providing it complies with specifications and that compliance can be measured (e.g. distribution and spread for random) It may just require a great many samples to eliminate the risk of anomaly. – mckenzm Oct 09 '18 at 02:46
@Flater Strongly disagree. Making things injected doesn't magically make everything testable. If your function randomly selects elements from a list, you don't want your test to use fakes to hard code expectations about what the RNG returns. Ideally, you would invert the problem and make the selection an *argument* to whatever thing processes the selection and test that thing, but then you don't need an RNG or a fake to test and you still can't test the selection itself in any way that would actually verify a single output completely. – jpmc26 Oct 11 '18 at 00:10
@jpmc26 That would require custom backdoors to the code for every tested method with a random component. That parameter's existence would be pointless in a non-testing runtime and is thus irrelevant - we should be testing the code as is, not develop a secondary path meant for testing. Dependency injection, on the other hand, is a simple and reusable pattern that allows you the same option without having to tailor your backdoors every time. Given a predetermined list and a predetermined "random" number, the outcome of the random pick can be deterministically confirmed. – Flater Oct 11 '18 at 06:48
@Flater "Dependency injection, on the other hand, is a simple..." LOL! No. Pervasive DI is spaghetti code and leads to propagation of multiple references to the same state. When you use it as the go-to solution to separating concerns, it leads to endless boiler plate hierarchies that divide logic that is much simpler when kept together. No, a more *functionally minded* procedural approach that leverages parameters makes for much easier to follow code and leads away from objects that hold persistent state. (You can still put those methods in stateless singleton objects to get late binding.) – jpmc26 Oct 11 '18 at 07:15
@Flater Also, DI is by definition a "back door" since the method calls the dependency behind the scenes, while parameterization is a front door that lets you just pass in an intermediate result directly. Notably, the parameter doesn't require any extra boiler plate interfaces, meaning you jump through fewer hoops to reuse it in different situations. – jpmc26 Oct 11 '18 at 07:21
@jpmc26 Generating a random number is a different (and reusable) responsibility and should not be bound to (the responsibility of) a particular method that for some reason needs a random number. What you're telling me is that if you have multiple classes which all rely on RNG, that you're going to develop the RNG separately in each class? That is _clearly_ avoiding reusability. – Flater Oct 11 '18 at 07:24
@jpmc26 My issue wasn't with it being a back door in and of itself. I agree that DI is a form of backdoor. My issue is in that it (1) adds a parameter that needs to be used during testing and should not be used in production, which means you're going to end up in all sorts of conflicts and (2) custom-made for this particular implementation, thus requiring effort to tailor the backdoor to the specific thing you want to test. (3) Also, when you pass the object that needs to be selected, you are cuttung out the actual selection process which is the thing you're supposed to be testing. – Flater Oct 11 '18 at 07:25
@Flater Firstly, we're talking primarily about segregating the random generation from whatever post-processing you do on the result of it. Secondly, the odds of me having to change to a different random number generator are beyond slim. Thirdly, the RNG itself is one of those rare instances where you *want* the entire application to be using the same one to maximize randomness. It should be a static, global singleton that's accessed directly so devs aren't tempted to pass in a different instance; otherwise, you'll *lose* randomness because different devs will new up different ones. – jpmc26 Oct 11 '18 at 07:28
@jpmc26 You clearly don't like mocking objects. The subsequent discussion is futile. I cannot tersely explain all the issues with globalizing your objects, omitting that singletons can still be injected, writing secondary code meant to run the test which does not actually test the original code, and trying to avoid reusability and instead write code that needs to be tailor made to every situation yet provide no improvement on the reusable approach. This is a rabbithole of bad decisions, each one trying to cover for the last one's issues. – Flater Oct 11 '18 at 07:31
@Flater And I'm saying you can nearly always avoid all that by ensuring the logic that needs to be unit tested is separated from those dependencies by having it take results as an input and hoisting the coordination between the two pieces up a level to a caller. 100% *unit* coverage is a fools' errand and actively makes your code worse. My dislike for DI and mocking objects comes from actual experience seeing them go horribly wrong and twisting the code base into a highly coupled system that's difficult to reuse anything from, not pie in the sky idealism that has no basis in reality. – jpmc26 Oct 21 '18 at 23:52

Flater · Answer 5 · 2018-10-09T06:15:09.450

17

Update due to posted comments

The original answer was removed for brevity's sake - you can find it in the edit history.

PaintingInAir For context: as an entrepreneur and academic, most of the algorithms I design are not requested by anyone other than myself. The example given in the question is part of a derivative-free optimizer to maximize the quality of an ordering of tasks. In terms of how I described the need for the example function internally: "I need an objective function to maximize the importance of tasks that are completed on time". However, there still seems to be a large gap between this request and the implementation of unit tests.

First, a TL;DR to avoid an otherwise lengthy answer:

Think of it this way:
A customer enters McDonald's, and asks for a burger with lettuce, tomato and hand soap as toppings. This order is given to the cook, who makes the burger exactly as requested. The customer receives this burger, eats it, and then complains to the cook that this is not a tasty burger!

This is not the cook's fault - he's only doing what the customer explicitly asked. It's not the cook's job to check if the requested order is actually tasty. The cook simply creates that which the customer orders. It's the customer's responsibility of ordering something that they find tasty.

Similarly, it's not the developer's job to question the correctness of the algorithm. Their only job is to implement the algorithm as requested.
Unit testing is a developer's tool. It confirms that the burger matches the order (before it leaves the kitchen). It does not (and should not) try to confirm that the ordered burger is actually tasty.

Even if you are both the customer and the cook, there is still a meaningful distinction between:

I did not prepare this meal properly, it was not tasty (= cook error). A burnt steak is never going to taste good, even if you like steak.

I prepared the meal properly, but I don't like it (= customer error). If you don't like steak, you'll never like eating steak, even if you cooked it to perfection.

The main issue here is that you're not making a separation between the customer and the developer (and the analyst - though that role can be represented by a developer as well).

You need to distinguish between testing the code, and testing the business requirements.

For example, the customer wants it to work like [this]. However, the developer misunderstands, and he writes code that does [that].

The developer will therefore write unit tests that test if [that] works as expected. If he developed the application correctly, his unit tests will pass even though the application doesn't do [this], which the customer was expecting.

If you want to test the customer's expectations (the business requirements), that needs to be done in a separate (and later) step.

A simple development workflow to show you when these tests should be run:

The customer explains the problem they want to solve.
The analyst (or developer) writes this down in an analysis.
The developer writes code that does what the analysis describes.
The developer tests his code (unit tests) to see if he followed the analysis correctly
If the unit tests fail, the developer goes back to developing. This loops indefinitely, until the unit tests all pass.
Now having a tested (confirmed and passed) code base, the developer builds the application.
The application is given to the customer.
The customer now tests if the application he is given actually solves the problem that he sought to solve (QA tests).

You may wonder what the point is of doing two separate tests when the customer and developer are one and the same. Since there is no "hand off" from developer to customer, the tests are run one after the other, but they are still separate steps.

Unit tests are a specialized tool that helps you verify whether your development stage is finished.
QA tests are done by using the application.

If you want to test whether your algorithm itself is correct, that is not part of the developer's job. That is the customer's concern, and the customer will test this by using the application.

As an entrepreneur and academic, you might be missing an important distinction here, which highlights the different responsibilities.

If the application doesn't adhere to what the customer had initially asked, then the subsequent changes to the code are usually done free of charge; since it's a developer error. The developer made a mistake and must pay the cost of rectifying it.
If the application does what the customer had initially asked, but the customer has now changed his mind (e.g. you've decided to use a different and better algorithm), the changes to the code base are charged to the customer, since it's not the developer's fault that the customer asked for something different than what they now want. It's the customer's responsibility (cost) to change their mind and therefore have the developers spend more effort to develop something that was not previously agreed to.

edited Oct 09 '18 at 06:15

answered Oct 08 '18 at 06:13

Flater

44,596
8
88
122

I would be happy to see more elaboration on the "If you came up with the algorithm yourself" situation, as I think this is the situation most likely to present problems. Especially in situations where no "if A then B, else C" examples are provided. (p.s. I am not the downvoter) – JustinLovinger Oct 08 '18 at 08:56
@PaintingInAir: But I can't really elaborate on this as it depends on your situation. If you decided to create this algorithm, you obviously did so to provide a particular feature. Who asked you to do so? How did they describe their request? Did they tell you what they needed to have happen in certain scenarios? (this information is what I refer to as "the analysis" in my answer) **Whatever explanation you received** (that led you to create the algorithm) can be used to test if the algorithm works as requested. In short, **anything but the code/self-created algorithm** can be used. – Flater Oct 08 '18 at 09:00
For context: as an entrepreneur and academic, most of the algorithms I design are not requested by anyone other than myself. The example given in the question is part of a derivative-free optimizer to maximize the quality of an ordering of tasks. In terms of how I described the need for the example function internally: "I need an objective function to maximize the importance of tasks that are completed on time". However, there still seems to be a large gap between this request and the implementation of unit tests. – JustinLovinger Oct 08 '18 at 09:12
2

@PaintingInAir: It's dangerous to tightly couple the customer, analyst and developer; as you're prone to skipping essential steps like **defining the problem outset**. I believe that's what you're doing here. You seem to want to test the _correctness_ of the algorithm, rather than whether it was implemented correctly. But that's not how you do it. Testing the implementation can be done using unit tests. Testing the algorithm itself is a matter of **using** your (tested) application and fact-checking its results - this actual test is out of scope of your codebase (as it **should be**). – Flater Oct 08 '18 at 09:15
@PaintingInAir: In short, a unit test does not tell you if the application does what the customer wants it to do. The unit test simply tells you if the developer implemented the algorithm the way they intended to implement it. If they e.g. misunderstood the customer but wrote good code, the unit tests will succeed even though the application doesn't actually do what the customer wants it to do. That's a very important distinction which I think you're not making. **Unit testing does not replace QA testing!** They each answer a different question. – Flater Oct 08 '18 at 09:24
@PaintingInAir As an illustration of Flater's point, suppose you devise an algorithm, but then implement it incorrectly - and the results are very good, because the algorithm you *accidentally implemented* is better than the one you *Invented.* Think about how different types of testing would disentangle that situation. (The difference between the two algorithms might be small - for example, they exact way that you use the "old" or "new" values of quantities in an iterative algorithm.) – alephzero Oct 08 '18 at 09:33
@PaintingInAir: I updated my answer given your posted comments - including what I hope is a clear analogy. – Flater Oct 08 '18 at 09:43
4

This answer is already enormous. Highly recommend trying to find a way to reformulate the original content so you can just integrate it into the new answer if you don't want to throw it away. – jpmc26 Oct 08 '18 at 20:06
7

Also, I disagree with your premise. Tests can and absolutely should reveal when the code generates an incorrect output according to the specification. It's valid for tests to validate the outputs for some known test cases. Also, the cook should know better than to accept "hand soap" as a valid burger ingredient, and the employer has almost certainly educated the cook on what ingredients are available. – jpmc26 Oct 08 '18 at 20:10
@jpmc26: I think we're in agreement here. Tests should indeed confirm whether the code is built _according to the specification_; and test should not try to confirm whether the specification itself was correct to begin with. Also, the hand soap thing was just an extreme example - you're reading more into it than was intended (the fact that the cook had hand soap to put on the burger indirectly implies that he doesn't throw a `IngredientNotAvailableException` or similar). I could've said onion and then had to specify that the customer hates onion. Same principle, but requires more explanation. – Flater Oct 09 '18 at 06:10
"this is not a tasty burger" I see what you did there. thanks for sneaking that little gem in. – AnoE Oct 09 '18 at 15:43
@AnoE I genuinely don't know the reference you've spotted. – Flater Oct 09 '18 at 20:42
@Flater, oh... https://www.youtube.com/watch?v=QJ61MBX95tw or https://www.youtube.com/watch?v=dBP0Mbc7VFw for context. Pulp Fiction. – AnoE Oct 09 '18 at 21:13
"Similarly, it's not the developer's job to question the correctness of the algorithm." This is a terrible philosophy that leads to bad software. – Iron Gremlin Oct 10 '18 at 16:53
@IronGremlin If the customer defines the algorithm, it's not the developer's job to change the algorithm. That doesn't mean the developer is not allowed to suggest improvements, just that they cannot be _required_ to do so. You cannot hold the developer responsible for the quality of something they had no hand in creating. – Flater Oct 10 '18 at 18:36
@Flater - A good developer should know the context in which their work is taking place and be capable of suggesting improvements. This is absolutely a part of the job. Blaming the client when the project fails because the client didn't understand the project is asinine. Sometimes it's necessary, but it should be necessary because the client is obstinate, not because you're negligent. – Iron Gremlin Oct 10 '18 at 19:36
@IronGremlin Good luck arguing with your customers why you're not going to give them what they explicitly ask you for. – Flater Oct 10 '18 at 19:42
@IronGremlin The same person can wear more than one hat. Or not. – gnasher729 Oct 10 '18 at 20:48
@Abigail Keep in mind that I am not talking about any and all algorithms (which would effectively be all code), but an algorithm that is effectively _the business requirement given by the customer_. Developing an application according to the specifications of the customer does not entail second-guessing the customer's specifications - not unless the customer explicitly asks for it. This is the equivalent of requiring the developer of your restaurant's website to make sure the food on the menu is as tasty as possible. – Flater Oct 11 '18 at 06:37

score 9 · Answer 6 · answered Oct 09 '18 at 16:53

Property Testing

Sometimes mathematical functions are better served by "Property Testing" than by traditional example-based unit testing. For example, imagine you're writing unit tests for something like an integer "multiply" function. While the function itself may seem very simple, if it's the only way to multiply, how do you test it thoroughly without the logic in the function itself? You could use giant tables with expected inputs/outputs, but this is limited and error-prone.

In these cases, you can test known properties of the function, instead of looking for specific expected results. For multiplication, you may know that multiplying a negative number and a positive number should result in a negative number, and that multiplying two negative numbers should result in a positive number, etc. Using randomized values and then checking that these properties are preserved for all test values is a good way to test such functions. You generally need to test for more than one property, but you can often identify a finite set of properties that together validate the correct behavior of a function without necessarily knowing the expected result for every case.

One of the best introductions to Property Testing that I've seen is this one in F#. Hopefully the syntax is not an obstruction to understanding the explanation of the technique.

I would suggest perhaps adding something a bit more specific in your example re multiplication, such as generating random quartets (a,b,c) and confirming that (a-b)(c-d) yields (ac-ad)-(bc-bd). A multiply operation could be pretty broken and still uphold the (negative times negative yields positive) rule, but the distributive rule predicts specific results. — supercat, Oct 11 '18 at 21:47

score 4 · Answer 7 · edited Oct 30 '18 at 12:56

4

It is tempting to write the code and then see if the result "looks right", but, as you rightly intuit, it's not a good idea.

When the algorithm is hard you can do a number of things to make the manual calculation of the result easier.

Use Excel. Set up a spreadsheet that does some or all of the calculation for you. Keep it simple enough so that you can see the steps.
Split your method up into smaller testable methods, each with their own tests. When you are sure the smaller parts work, use them to manually work through the next step.
Use aggregate properties to sanity-check. For example, say you have a probability calculator; you might not know what the individual results should be, but you know they all have to add up to 100%.
Brute force. Write a program that generates all possible results, and check that none are better than what your algorithm generates.

edited Oct 30 '18 at 12:56

Peter Mortensen

1,050
2
12
14

answered Oct 08 '18 at 02:47

Ewan

70,664
5
76
161

For 3., do allow for some rounding errors here. It's possible that your total amount to 100,000001% or similarly close-but-not-exact figures. – Flater Oct 08 '18 at 06:23
2

I'm not quite sure about 4. If you're able to generate the optimal outcome for all possible input combinations (which you then use for testing confirmation), then you're inherently already capable of calculating the optimal outcome and therefore don't need this _second_ pece of code that you're trying to test. At that point, you'd be better off using your existing optimal outcome generator as it's already proven to work. (and if it's not yet proven to work, then you can't rely on its outcome to fact-check your tests to begin with). – Flater Oct 08 '18 at 06:25
6

@flater usually you have other requirements as well as correctness that brute force doesnt meet. eg performance. – Ewan Oct 08 '18 at 10:37
@flater also, only allow for rounding errors if your requirements allow them. – Ewan Oct 08 '18 at 10:39
Performance is a subjective line you draw (reglardless of whether it's brute force or not). But the comment I made logically defeats the intention of the code that needs to be tested. If you're able to brute force check for optimal solutions, that means you already know all optimal solutions, and therefore already possess code that is able to do the job that you're trying to test for now. It's putting the cart before the horse. You can't brute force in any case here, as the ability to do so precludes the need for actually testing the code. – Flater Oct 08 '18 at 10:49
1

@flater I'd hate to use your sort, shortest path, chess engine, etc if you believe that. But id totally gamble in your rounding error allowed casino all day – Ewan Oct 08 '18 at 10:55
Your comment is imprecise in ways I cannot accurately explain here (and are completely off-topic). A chess engine uses forward pruning because it's currently impossible to brute force (it would require give or take 10^43 states to be evaluated which is is **33 orders of magnitude more than contemporarily available RAM** if you assume that a single board state is 1 bit in size (which it obviously isn't - thus making it even worse than this ballpark estimate)); which means you're comparing **apples and oranges**. – Flater Oct 08 '18 at 11:00
Regardless, your brute force suggestion is defeated by its own intention. To do so, you need to know the optimal answer to all possible permutations in order to check them. If you already have those answers, then the code you intend to test is **irrelevant** as it's created for the sole purpose of finding the optimal answer, something you claim to already have anyway (since you're relying on these optimal answers to perform a brute force test). Note that OP's code intends to find the **optimal** answer, not the **most performant calculation**. You'd only be correct in the latter case. – Flater Oct 08 '18 at 11:03
3

@flater do you resign when you get to a king pawn end game? just because the entire game cant be brute forced doesnt mean an indiviual position cant. Just because you brute force the correct shortest path to one network doesnt mean you know the shortest path in all networks – Ewan Oct 08 '18 at 11:09
Number 4 often also goes by the name of test oracle and is quite useful. You could e.g. also have a slower but proven version of using algorithm x (maybe recursive) in language A (e.g. Haskell) but need to implement it using algorithm y (non-recursive) in language B (e.g. C). – René Oct 10 '18 at 09:59

score 3 · Answer 8 · answered Oct 16 '18 at 03:38

Other answers are good, so I'll try to hit on some points they've collectively missed so far.

I have written (and thoroughly tested) software to do image processing using Synthetic Aperture Radar (SAR). It's scientific/numerical in nature (there's a lot of geometry, physics, and math involved).

A couple of tips (for general scientific/numerical testing):

1) Use inverses. What's the fft of [1,2,3,4,5]? No idea. What's ifft(fft([1,2,3,4,5]))? Should be [1,2,3,4,5] (or close to it, floating point errors might come up). Same goes for the 2D case.

2) Use known asserts. If you write a determinant function, it might be hard to say what the determinant is of a random 100x100 matrix. But you do know that the determinant of the identity matrix is 1, even if it's 100x100. You also know that the function should return 0 on a non-invertible matrix (like a 100x100 full of all 0s).

3) Use rough asserts instead of exact asserts. I wrote some code for said SAR processing that would register two images by generating tie points that create a mapping between the images and then doing a warp between them to make them match. It could register at a sub-pixel level. A priori, it's hard to say anything about what the registration of two images might look like. How can you test it? Things like:

EXPECT_TRUE(register(img1, img2).size() < min(img1.size(), img2.size()))

since you can only register on overlapping parts, the registered image must be smaller or equal to your smallest image, and also:

scale = 255
EXPECT_PIXEL_EQ_WITH_TOLERANCE(reg(img, img), img, .05*scale)

since an image registered to itself should be CLOSE to itself, but you might experience a bit more than floating point errors due to the algorithm at hand, so just check each pixel is within +/- 5% of the range the pixels can take on (0-255 is greyscale, common in image processing). Result should at least be the same size as input.

You can even just smoke test (i.e. call it and make sure it doesn't crash). In general, this technique is better for larger tests where the end result can't be (easily) calculated a priori to running the test.

4) Use OR STORE a random number seed for your RNG.

Runs do need to be reproducible. It is false, however, that the only way to get a reproducible run is to provide a specific seed to a random number generator. Sometimes randomness testing is valuable. I've seen/heard about bugs in scientific code that crop up in degenerate cases that were randomly generated (in complicated algorithms it can be hard to see what the degenerate case even is). Instead of always calling your function with the same seed, generate a random seed, and then use that seed, and log the seed's value. That way every run has a different random seed, but if you get a crash, you can re-run the result by using the seed you've logged to debug. I've actually used this in practice and it squashed a bug, so I figured I'd mention it. Admittedly this has only happened once, and I'm positive it's not always worth doing, so use this technique with prudence. Random with the same seed is always safe, though. Downside (as opposed to just using the same seed all the time): You have to log your test runs. Upside: Correctness and bug nuking.

Your particular case

1) Test that an empty taskArray returns 0 (known assert).

2) Generate random input such that task.time > 0, task.due > 0, and task.importance > 0 for all tasks, and assert the result is greater than 0 (rough assert, random input). You don't need to go crazy and generate random seeds, your algorithm just isn't complex enough to warrant it. There's about 0 chance it would pay off: just keep the test simple.

3) Test if task.importance == 0 for all tasks, then result is 0 (known assert)

4) Other answers touched on this, but it might be important for your particular case: If you're making an API to be consumed by users outside of your team, you need to test the degenerate cases. For instance, if workPerDay == 0, make sure you throw a lovely error that tells the user that's invalid input. If you're not making an API, and it's just for you and your team, you can probably skip this step, and just refuse to call it with the degenerate case.

HTH.

score 2 · Answer 9 · edited Jun 16 '20 at 10:01

TL;DR

Head to "comparative testing" section for advice that's not in other answers.

Beginnings

Start by testing the cases that should be rejected by the algorithm (zero or negative workPerDay, for example) and the cases that are trivial (e.g. empty tasks array).

After that, you want to test the simplest cases first. For the tasks input, we need to test different lengths; it should be sufficient to test 0, 1 and 2 elements (2 belongs to the category "many" for this test).

If you can find inputs that can be mentally calculated, that's a good start. A technique I sometimes use is to start from a desired result and work back (in the spec) to inputs that should produce that result.

Comparative testing

Sometimes the relation of the output to the input isn't obvious, but you have a predictable relationship between different outputs when one input is changed. If I've understood the example correctly, then adding a task (without changing other inputs) will never increase the proportion of work done on time, so we can create a test that calls the function twice - once with and once without the extra task - and asserts the inequality between the two results.

Fallbacks

Sometimes I've had to resort to a long comment showing a hand-computed result in steps corresponding to the spec (such a comment is usually longer than the test case). The worst case is when you have to maintain compatibility with an earlier implementation in a different language or for a different environment. Sometimes you just have to label the test data with something like /* derived from v2.6 implementation on ARM system */. That's not very satisfying, but may be acceptable as a fidelity test when porting, or as a short-term crutch.

Reminders

The most important attribute of a test is its readability - if the inputs and outputs are opaque to the reader, then the test has very low value, but if the reader is helped to understand the relationships between them, then the test serves two purposes.

Don't forget to use an appropriate "approximately-equals" for inexact results (e.g. floating-point).

Avoid over-testing - only add a test if it covers something (such as a boundary value) that's not reached by other tests.

score 2 · Answer 10 · answered Oct 09 '18 at 15:52

There is nothing very special about this kind of hard-to-test function. The same applies for code that uses external interfaces (say, a REST API of a 3rd party application which is not under your control and certainly not up to be tested by your test suite; or using a 3rd party library where you are unsure of the exact byte format of return values).

It is a quite valid approach to simply run your algorithm for some sane input, see what it does, make sure that the result is correct, and encapsulate the input and the result as a test case. You can do this for a few cases and thus get several samples. Try to make the input parameters as different as possible. In the case of an external API call, you would do a few calls against the real system, trace them with some tool, and then mock them into your unit tests to see how your program reacts - which is the same as just picking a few runs of your task planning code, verifying them by hand, and then hardcoding the result in your tests.

Then, obviously, bring in edge cases like (in your example) an empty list of tasks; things like that.

Your test suite will maybe not be as great as for a method where you can easily predict results; but still 100% better than no test suite (or just a smoke test).

If your problem, though, is that you find it hard to decide whether a result is correct, then that is an altogether different problem. For example, say you have a method which detects whether an arbitrarily large number is prime. You can hardly throw any random number at it and then just "look" if the result is correct (assuming you cannot decide the prime-ness in your head or on a piece of paper). In this case, there is indeed little you can do - you'd need to either get known results (i.e., some large primes), or implement the functionality with a different algorithm (maybe even a different team - NASA seems to be fond of that) and hope that if either implementation is buggy, at least the bug does not lead to the same wrong results.

If this is a regular case for you, then you have to have a good hard talk with your requirements engineers. If they cannot formulate your requirements in a way that is easy (or at all possible) to check for you, then when do you know whether you are finished?

Tobias Hagge · Answer 11 · 2018-10-10T19:03:43.870

Incorporate assertion testing into your unit test suite for property-based testing of your algorithm. In addition to writing unit tests which check for specific output, write tests designed to fail by triggering assertion failures in the main code.

Many algorithms rely for their correctness proofs on maintaining certain properties throughout the stages of the algorithm. If you can sensibly check these properties by looking at the output of a function, unit testing alone is enough to test your properties. Otherwise, assertion-based testing lets you test that an implementation maintains a property every time the algorithm assumes it.

Assertion-based testing will expose algorithm flaws, coding bugs, and implementation failures due to issues such as numerical instability. Many languages have mechanisms strip assertions at compile time or prior to the code being interpreted so that when run in production mode the assertions do not incur a performance penalty. If your code passes unit tests but fails on a real-life case, you can turn the assertions back on as a debugging tool.

score 1 · Answer 12 · answered Oct 10 '18 at 19:59

Some of the other answers here are very good:

Test base, edge, and corner cases
Perform sanity checks
Perform comparative tests

... I'd add a few other tactics:

Decompose the problem.
Prove the algorithm outside of code.
Test that the [externally proven] algorithm is implemented as-designed.

Decomposition allows you to ensure the components of your algorithm do what you expect them to do. And a "good" decomposition lets you also ensure they're glued together properly. A great decomposition generalizes and simplifies the algorithm to the extent that you can predict the results (of the simplified, generic algorithm(s)) by hand well enough to write thorough tests.

If you can't decompose to that extent, prove the algorithm outside of code by whatever means is sufficient to satisfy you and your peers, stakeholders, and customers. And then, just decompose enough to prove your implementation matches the design.

score 0 · Answer 13 · answered Oct 12 '18 at 16:39

This might seem like somewhat of an idealistic answer but it helps to identify different kinds of testing.

If strict answers are important to the implementation then examples and expected answers really should be provided in the requirements that describes the algorithm. These requirements should be group reviewed and if you don't get the same results, the reason needs to be identified.

Even if you are playing the role of analyst as well as implementer you should actually create requirements and have them reviewed long before you are writing unit tests, so in this case you will know the expected results and can write your tests accordingly.

On the other hand, if this is a part you are implementing that either is not part of the business logic or supports a business logic answer then it should be fine to run the test to see what the results are and then modify the test to expect those results. The end results are already checked against your requirements so if they are correct then all the code feeding those end results must be numerically correct and at that point your unit tests are more to detect edge failure cases and future refactoring changes than to prove that a given algorithm produces correct results.

Michael Kay · Answer 14 · 2018-10-12T17:54:57.317

I think it's perfectly acceptable on occasions to follow the process:

design a test case
use your software to get the answer
check the answer by hand
write a regression test so future versions of the software will continue to give this answer.

This is a reasonable approach in any situation where checking the correctness of an answer by hand is easier than computing the answer by hand from first principles.

I know people who write software for rendering printed pages, and have tests that check that exactly the right pixels are set on the printed page. The only sane way to do that is to write the code to render the page, check by eye that it looks good, and then capture the result as a regression test for future releases.

Just because you read in a book that a particular methodology encourages writing the test cases first, doesn't mean you always have to do it that way. Rules are there to be broken.

score 0 · Answer 15 · answered Oct 15 '18 at 05:19

Other answers answers already have techniques for what a test looks like when the specific result cannot be determined outside the tested function.

What I do additionally which I have not spotted in the other answers is to auto-generate tests in some way:

'Random' inputs
Iteration across ranges of data
Construction of test cases from sets of boundaries
All the above.

For example, if the function takes three parameters each with allowed input range [-1,1], test all combinations of each parameter, {-2,-1.01,-1,-0.99, -0.5,-0.01, 0,0.01,0.5,0.99,1,1.01,2, some more random in (-1,1)}

In short: Sometimes poor quality can be subsidised by quantity.