50

I was asked about how to run a suite of 65.000.000.000 tests and I wonder if it is normal to have a project with such a huge amount of tests.

Have you worked in projects with this characteristic?

juanpavergara
  • 627
  • 5
  • 7
  • If you have worked in a project like this What has been the technical approach? – juanpavergara Jul 10 '13 at 18:15
  • 32
    65 **Billion** (10e9) tests? Is this a practical problem or an interview question? –  Jul 10 '13 at 18:16
  • 3
    Is a practical problem Michael. For real. I want to wonder if this is insane as I suspect. – juanpavergara Jul 10 '13 at 18:17
  • 4
    Dare I ask the percentage of code coverage this provides? – psr Jul 10 '13 at 18:17
  • 40
    I'd be very interested to know who wrote 65 billion tests and how many years it took them. – Rig Jul 10 '13 at 18:17
  • 2
    Is this a lot of permutations of input values? – psr Jul 10 '13 at 18:19
  • 10
    Is this "here are 1 million parameters to run on each of 1000 tests?" or "This is here is a test suite that has 5 terabytes of code (65 Billion * 80 bytes)?" –  Jul 10 '13 at 18:20
  • 2
    I didn't want to go into details but some of them are: it's a huge combination pf variables which is the set of tests for a single functionality of the System. The functionality calculates the price of one of our products whic depends on a set of variables. The guys designing the calculus want to test a reduced set of the combinations. We think the test themselves have to be auto generated via some code generation. – juanpavergara Jul 10 '13 at 18:20
  • 46
    With 65 billion tests, if you can run 1000 tests/second, it will take about 2 years to run. 10,000 tests/second is a bit over two months. 100,000 tests/second will take about a week. This is describing some serious processing power to run the tests in a reasonable timeframe. –  Jul 10 '13 at 18:24
  • 4
    65 billion tests or 65 billion test cases? – whatsisname Jul 10 '13 at 18:30
  • 20
    I don't want to be the guy who writes the traceability matrix... – mouviciel Jul 10 '13 at 18:35
  • 3
    @whatsisname the same test executed 65 billion times with different combination of parameters – juanpavergara Jul 10 '13 at 18:36
  • 1
    How did you arrive at 65 billion? How many things would end up really being a range, of which you'd only need to test critical points? – PixelArtDragon Jul 10 '13 at 18:42
  • 7
    Your question would be stronger if you provide a bit more about the specifics behind this scenario. At first glance, it's a bit unusual. But that's not to say it's impossible. –  Jul 10 '13 at 18:42
  • 5
    How do you test the code generator that creates the tests? – Dan Pichelman Jul 10 '13 at 18:47
  • 24
    @DanPichelman - Clearly, you need to write another half a billion tests to test that the test generator generates tests correctly. – Bobson Jul 10 '13 at 19:04
  • 5
    Let's see: the simplest case when this may happen is combinatorial explosion (36 parameters with 2 equivalence classes each). I'd say a function that has 36 parameters is seriously overweight and in need of decomposition. – Deer Hunter Jul 10 '13 at 19:19
  • I suppose they will be implemented in a random testing manner. Otherwise, quit your job before it is too late :) – py_script Jul 10 '13 at 20:16
  • @JuanPablo: do you have any information about how long one of those tests will run? 1 second? 1 milisecond? – Doc Brown Jul 10 '13 at 20:18
  • I'm not sure how much I agree with this presentation, but [Integration Tests are a Scam](http://www.infoq.com/presentations/integration-tests-scam) does provide a lot of very strong arguments against test suites for which it is not possible to quickly and frequently run the entire test suite in its entirely. Per MichaelT's post, that probably isn't possible with 65 billion tests. – Brian Jul 10 '13 at 20:42
  • 4
    the real *wtf* of this task will be how to store the test result, specially the log / stack trace of the failed runs... and then how to analyze it – woliveirajr Jul 11 '13 at 02:17
  • This sounds like a scalability question in disguise actually, not really about testing. –  Jul 11 '13 at 02:20
  • I'm guessing this isn't from an open source project. If it were... :) – joshin4colours Jul 11 '13 at 03:21
  • 1
    3 options, inversely ordered by size of budget. Run away.... Re design..... Invade China/India/Russia (They have lots of people, many very clever, who could help run and collate the test results.) – mattnz Jul 11 '13 at 04:14
  • 2
    Normally I would say: "one at a time" but in this case I would definitely say: "In parallel". – Pieter B Jul 11 '13 at 06:52
  • I recently built a search function that perform a conditional search based on 1024 possible combinations of parameters. The solution was to add a conditional WHERE clause ONLY if the parameter was actually filled in. A similar approach might work to eliminate a few of the 65 billion cases. – Captain Kenpachi Jul 11 '13 at 07:56
  • Just wanted to ask if they are asking you how to run because they were *never* run? if they were *never* run, what was their purpose until now? – Tallmaris Jul 11 '13 at 08:08
  • 14
    **Is this for a project of the US Federal Government?** – Paul Jul 11 '13 at 13:03
  • 1
    There is a phrase that goes around 'Testing the water' by which you dip a toe or a finger into the water to see if it is too hot or too cold, before you fully submerge yourself. Testing in this case should take the same approach, don't throw yourself into the ocean, dip in a toe and write a much smaller set of tests which test the important logical paths. If you test every permeatation you may aswell not write an application and instead hardcode every input and output –  Jul 11 '13 at 13:44
  • 1
    Assuming this is a real project, I would guess the 65B tests are to verify that a new implementation and old implementation of the same business level logic produce identical results; and we're testing over a subset of possible inputs. I've done testing like this when we were forced to switch vendors or APIs. Granted; not 65B tests; but several million interesting cases plus a tester that generated random cases and ran 24/7. – Chuu Jul 11 '13 at 16:04
  • Rent one of google's server farms (all of it), and you might be able to finish it in a reasonable amount of time. – AJMansfield Jul 11 '13 at 16:51
  • @Chuu how did it go? Did you do it in parallel? – juanpavergara Jul 11 '13 at 22:49
  • 2
    Sounds like there's an error in the script that automatically generates the test scripts. – Joel B Jul 12 '13 at 02:11
  • 1
    if you will look at the testing of stockfish (an open source chess program) they have done about 14 million tests, and that took over 22 CPU years. if you don't have a computer that will require as changing the encryption algorithems, you should find a way to reduce the number of tests. – elyashiv Jul 12 '13 at 04:56

6 Answers6

103

With 65 billion tests, it sounds like you're being asked to test all possible inputs. This is not useful--you'd essentially be testing that your processor functions correctly, not that your code is correct.

You should be testing equivalence classes instead. This will drastically reduce your range of test inputs.

Also consider whether you can subdivide your system into smaller pieces. Each piece will be easier to test in isolation, and then you can perform some integration tests which bring all the pieces together.

If you still want that reassurance that some of those input combinations work, perhaps you could try fuzz testing. You will get some of the benefits of testing lots of different inputs, but without running all 65 billion of them.

sourcenouveau
  • 6,460
  • 4
  • 29
  • 44
  • 12
    +1, especially for "you'd essentially be testing that your processor functions correctly" – Doc Brown Jul 11 '13 at 06:31
  • 4
    For simple enough functions (bit fiddling etc.) I *do* tend to test all possible values. It’s fool-proof and thus gives me much better confidence than testing (derived, and thus potentially erroneous) equivalence classes. Of course that doesn’t work any more when your possible inputs go into the billions. – Konrad Rudolph Jul 11 '13 at 07:14
39

If this is a real test suite, then you don't want to get anywhere near working on it.

The whole job of a tester is to strike a balance between testing thoroughly enough to be confident you've got the "right" results and writing few enough tests that they can be run in a reasonable amount of time.

Many tests can be abstracted into "equivalence classes", which means that rather than run 3 billion tests, you run 1 that gives you a reasonable level of confidence that all other tests in that equivalence class would run successfully, if you decided to waste the time running them.

You should tell whoever is thinking of running 65 billion tests that they need to do a better job abstracting tests into equivalence classes.

dsw88
  • 1,859
  • 1
  • 13
  • 14
23

More than likely, you arrived at your figure of 65 billion tests by calculating all possible combinations of inputs into the system under test, or by computing the cyclomatic complexity and assuming a test must be written for each of these unique execution paths.

This is not how real tests are written, because as other posters and commenters have indicated, the technical power required to execute 65 billion tests is staggering. This would be like writing a test that exercises a method to add two integers by plugging in every possible permutation of two 32-bit values and checking the result. It's utter insanity. You have to draw the line and identify a subset of all possible test cases, which between them would ensure that the system will behave as expected throughout the range of inputs. For instance. you test adding a few "ordinary" numbers, you test a few negative-number scenarios, you test technical limits like overflow scenarios, and you test any scenarios that should result in error. As was mentioned, these various types of tests exercise "equivalence classes"; they allow you to take a representative sample of the possible inputs, along with any known "outliers", and say with an extremely high confidence that because these scenarios pass, all scenarios similar to these will pass.

Consider one of the basic code katas, the Roman Numeral Generator. The task, to be performed using TDD techniques in a "dojo" style, is to write a function that can accept any number from 1 to 3000 and produce the correct Roman numeral for that number value.

You don't solve this problem by writing 3000 unit tests, one at a time, and passing them in turn. That's lunacy; the exercise normally takes between one and two hours, and you'd be there for days testing each individual value. Instead, you get smart. You start with the simplest base case (1 == "I"), implement that using a "least-code" strategy (return "I";), and then look for how the code you have will behave incorrectly in another expected scenario (2 == "II"). Rinse and repeat; more than likely, you replaced your initial implementation with something that repeats the "I" character as often as necessary (like return new String('I',number);). That will obviously pass a test for III, so you don't bother; instead, you write the test for 4 == "IV", which you know the current implementation won't do correctly.

Or, in a more analytical style, you examine each conditional decision that is made by the code (or needs to be), and write a test designed to enter the code for each possible outcome of each decision. If you have 5 if statements (each having a true and false branch), each of them fully independent of the other, you code 10 tests, not 32. Each test will be designed to assert two things about a particular possible decision; first that the correct decision is made, and then that the code entered given that condition is correct. You don't code a test for each possible permutation of independent decisions. If the decisions are dependent, then you have to test more of them in combination, but there are fewer such combinations because some decisions are only ever made when another decision had a particular outcome.

KeithS
  • 21,994
  • 6
  • 52
  • 79
5

Is this "normal"?, no. Where "normal" is defined as the average or typical experience. Can't say I've ever had to work on a project like that, but I have been on a project where one in every few million bits would get flipped. Testing that one was ... a challenge.

Is it potentially required? Well, that depends upon the guarantees and specifics of the project. It's a bit incredulous to comprehend at first, but your question is light on specifics.

As others (MichaelT) have pointed out, the time to complete this task with serial testing makes this impractical. So parallelization becomes your first consideration. How many test systems can you throw at this problem, and what support do you have for collating the results of those multiple systems?

What guarantees do you have that the device or algorithm you're testing is being reliably replicated? Software is pretty reliable in replication, but hardware devices (especially first generation) can have manufacturing issues. A false test failure in that case could indicate either a bad algorithm or the device didn't assemble correctly. Do you need to distinguish between those two cases?

You'll also need to consider how you're going to validate the testing systems themselves. Presuming a legitimate reason for that many test cases, you're going to need a lot of automation. That automation needs to be inspected to make sure it doesn't err in generating your test cases. Spot checks for errors would truly be the equivalent of finding a needle in the haystack.

This arstechnica link may or may not shed some insight on your testing considerations. GPU clusters are commonly used for brute-force cracking passwords. The one cited in the article can can cycle through as many as 350 billion guesses per second, so that kind of puts your 65B tests in perspective. It's likely a different domain, but it shows how approaching the task from different angles may yield a viable solution.

3

I don't think it is feasible to maintain 6.5e+10 tests it the first place, so running them may be moot. Even biggest projects, like Debian with all its packages, have only several hundred million SLOCs total.

But if you have to run a huge number of tests anyway, there are a few strategies.

  • Don't run them all. Most probably not every test depends on every code path. Define dependencies between subsystems and their tests, and between test suites, and you'll be able to only run unit tests relevant to a particular change, only the integration tests depending to these unit tests, etc.

  • Run them in parallel. With code base that huge, you probably have a massive build farm (back at JetBrains, a relatively small operation, we used to have 40-50 build agents running on the IDEA continuous build / integration farm alone). Since unit tests are independent, and integration tests can reuse already built code, tests are relatively easy to parallelize.

  • Stop running early. If you know that a particular test suite depends for its reasonable functioning on correctness of another test suite, you can cut the whole chain once you see one link fail.

Disclaimer: I'm not a professional testing engineer. Take the above with a grain of salt.

9000
  • 24,162
  • 4
  • 51
  • 79
  • 5
    ... Of course, at JetBrains, those build agents are free because they develop TeamCity and own it outright. Other "relatively small operations" would likely have a heart attack at the thought of a roughly $15,000 initial cost (just for the software; add 40-50 blademount units and other hardware, and even using a free Linux distro to host it all you're easily talking a senior dev's annual salary) and $6500 annual maintenance fees, plus the time and skill of the IT staff needed to keep the build farm humming along. – KeithS Jul 10 '13 at 19:41
0

Although there have been several good suggestions here on how to try to sneak by with fewer tests, I seriously doubt your system has only 65 billion input combinations. That is less than 36 bits of input. Let's assume you had already taken all the advice given above.

If each test takes about a millisecond to run and you distribute the tests across only 10 processors (one normal PC), the test will run in a little over 69 days. That is a while, but not completely unreasonable. Distribute across 100 processors (a dozen normal PC's or one reasonable server PC) and the tests will complete in a under 7 days. You could run these every week to check for regressions.