Should test data be checked into version control?

Question

I'm writing some test code for a feature which processes PDF files. The basic idea behind the tests is that I point them towards some PDFs I've selected specially, they process them and I check that the output is what I expect.

My question is: where should I be storing these large-ish PDFs? Should I check them into version control along with the code? Or put them somewhere else? Obviously, the test code is useless without the PDFs (or even with different PDFs) but still, putting them into our repository feels wrong.

possible duplicate of [Should unit tests be stored in the repository?](http://programmers.stackexchange.com/questions/128492/should-unit-tests-be-stored-in-the-repository) — user, Oct 02 '14 at 20:30
@RobertHarvey True, but if the test data is required for the test to work, I feel it should be considered a part of the test. That is also the approach taken by all three answers so far, as I understand them. — user, Oct 04 '14 at 13:38

score 88 · Accepted Answer · answered Oct 02 '14 at 17:21

88

Your version control system should contain everything it needs to build, compile, test, and package an application for distribution (e.g. MSI, RPM). I would also argue build configurations and other scripts should also be in version control.

I should be able to check out a project and have a complete compile, build, and test environment.

There are two approaches to checking in test data. First, you can check in the test data itself (PDFs in this case). Second, you can check in source data that can be used to generate test data (if applicable). This could be a SQL script loaded into a blank database containing test data, or maybe a text-based file that can be compiled into a PDF or other file.

Others may disagree with checking everything into version control, but I have found in my professional experience it is critical to ensuring a complete environment is able to be rebuilt from scratch.

answered Oct 02 '14 at 17:21

21

Yes. Absolutely yes. It's 2014, there is no justification whatsoever for using revision control that doesn't handle binary files seamlessly. – Kilian Foth Oct 02 '14 at 17:38
100% agree. VCS is the system of record for your development effort. – Chris McCall Oct 02 '14 at 18:00
4

I agree, but you definitely want to avoid the situation where you are checking in junk items as well. For example, if the test data includes an "output" folder that contains all the pdf files generated by the tests, then you will want to not include that into repository. But I do agree the tests themselves should be part of the repo as well as any packages needed to run it. – Kenneth Garza Oct 02 '14 at 19:37
1

@KennethGarza It isn't hard, really. As a rule of thumb, any original content (source code, test source code, test data, media, [real] documentation, third party libraries, build scripts, tooling scripts, conversion scripts, etc.) should be included, while all data that can be generated in reasonable time from the original data should not be. Besides, given those are the test outputs, they probably only make sense *after* running the tests yourself, otherwise you are not testing your program, you are testing the VCS software's ability to preserve the integrity of your files :) – Thomas Oct 02 '14 at 20:09
@KennethGarza to expand on Thomas' point, it should be fairly trivial to instruct your VCS to ignore generated files. For example, with SVN, I will check in e.g. an `output` folder with `svn:ignore` set to `*`. That way it is not possible to check in intermediate or generated files without purposefully jumping through hoops, and the project will not show up as dirty after building. – Oct 02 '14 at 20:11
I agree, although I find that it may be wise to have a separate repository for tests if using a DVCS. Sometimes test suites can get pretty big, and having that all in the main repository can make things pretty obnoxious. – whatsisname Oct 02 '14 at 20:26
@whatsisname I put my test suite in the main repo with a DVCS (Git), and generally so should you. If your VCS can't handle that size of repo, get a better VCS. – Marnen Laibow-Koser May 18 '18 at 00:44
2

@MarnenLaibow-Koser: a project I worked on to detect electrical failure in implanted pacemaker leads had a test suite of over 40GB. There isn't a VCS in existence where dealing with that isn't obnoxious. Having two repos is an administration hassle of its own, but it sometimes can be the better choice. – whatsisname May 18 '18 at 02:40
@whatsisname: 40GB? Wow. I'll admit it never occurred to me that a test suite could be of that order of magnitude. OTOH, what benefit is achieved by splitting it into a separate repo? Don't you still have to deal with the 40GB whether it's one repo or two? – Marnen Laibow-Koser May 18 '18 at 04:55
@MarnenLaibow-Koser the benefit of splitting the repo is, that developer does not need to run integration, performance tests on his local machine, so he does not need to deal with those 40GB of data. I also strongly agree on "it is critical to ensuring a complete environment is able to be rebuilt from scratch." But binaries (libraries) should be in other repo than code IMHO. – user482745 Nov 08 '18 at 15:25
@user482745 How do developers run integration tests, then, if not on their local machines? If you have the integration tests in a separate repo, do you at least have some sort of dependency management tool showing what version of the tests the code was developed against? If not...well, I wouldn’t want to use a medical device with that level of testing accountability, at any rate. (I agree that generates binaries and libraries should usually be in separate repos from source, but that’s an entirely separate issue.) – Marnen Laibow-Koser Nov 21 '18 at 21:17
1

@MarnenLaibow-Koser you got it. Integration tests are in separate repo and if user wants to run it locally, dependency management will fetch zip file for him and decompress it. Usually Continuous Integration server/farm is tasked to do integration test and will prevent merge feature branch until integration tests pass. – user482745 Nov 22 '18 at 13:59

score 15 · Answer 2 · 2014-10-02T17:33:46.017

If the tests are useless without the setup files that you have prepared, then it makes sense to include the files in your VCS along with the test code.

While the files used in the test aren't code, you can view them as a dependency that the code relies upon. So there is merit in keeping everything together.

As a counterpoint, some VCSs don't handle large binary files well, and others have strong opposition to including any sort of binary file in a VCS. If either of those cases apply to you, then storing the test files in a well known location that is easily accessed would also make sense.

I would also consider putting a comment in the test code that says "relies upon foo.pdf in order to run all tests."

I don't see anything wrong with having the tests check for the test data, if not found then trying to get it (eg. from a URL) and failing if neither worked. Relying on the network is a bad idea *because* it makes tests slower and more fragile; but trying is less fragile than not, and automatically getting (and caching locally) the right data is quicker than manually reading docs/comments, getting it and putting it in place. — Warbo, Oct 02 '14 at 23:26

score 7 · Answer 3 · answered Oct 02 '14 at 20:19

7

If it's static data, then yes put it in version control. Those files won't really change once they're checked in; they'll either get removed if that functionality's no longer needed, or new test files will be added alongside. Either way, you don't need to worry about poor binary diffs taking up space.

If you're generating test data, eg. randomly, then you should automatically save it when a test fails, but discard it otherwise. Any data saved this way should be turned into regular regression tests, so that those edge-cases are definitely tested in the future rather than relying on the luck of the draw.

answered Oct 02 '14 at 20:19

Warbo

1,205
7
11

2

If you're generating test data randomly, then you should really go out and buy a book about writing reproducible automated tests. – Dawood ibn Kareem Oct 02 '14 at 22:30
1

@DavidWallace I dunno; QuickCheck's my current favourite test framework, and that has a load of free resources online (eg. Real World Haskell) – Warbo Oct 02 '14 at 22:38
1

OK, my point is - random test data leads to a world of hurt. Never do it. – Dawood ibn Kareem Oct 02 '14 at 22:41
5

@DavidWallace So you're saying entire fields like fuzz testing, property checking and statistical software testing are not only wrong, but harmful? – Warbo Oct 02 '14 at 23:17
5

@DavidWallace random != unreproducible. – congusbongus Oct 03 '14 at 02:19
@congusbongus well, if one wanted to be pedantic, one could argue that true randomness is not reproducible, but pseudorandomness is. I'm guessing you were talking about the latter though (and I agree with your point). – David Z Oct 03 '14 at 03:09
Actually, @congusbongus, "unreproducible" is EXACTLY what random means. – Dawood ibn Kareem Oct 03 '14 at 03:26
1

@DavidWallace https://en.wikipedia.org/wiki/Fuzz_testing#Reproduction_and_isolation – congusbongus Oct 03 '14 at 03:29
If your system is recording input data, and your test is recycling that data, then it's not random. Random means unpredictable, therefore unreproducible. You can't change the meaning of an English word by using a new testing technique. @congusbongus – Dawood ibn Kareem Oct 03 '14 at 03:38
6

@DavidWallace you may call it whatever you want then. Random test data, record inputs, recycle if necessary, ergo reproducible. Doesn't lead to a world of hurt. – congusbongus Oct 03 '14 at 04:00
1

Well, your experience may differ. But I've lost count of how many times I've seen bugs sneak through testing because a developer has just used a "random" approach to test data, instead of stopping to think about which test cases are actually needed. Moreover, in such cases, if on one lucky test run, the random data does happen to highlight a bug, it can be well-nigh impossible to identify and fix the bug, because you can't reproduce the data that found the bug; which makes the whole test a waste of time. So, in MY experience, tests that generate random data lead both to bugs in ... – Dawood ibn Kareem Oct 03 '14 at 04:05
... production, and to situations where the developer knows that a bug exists, but can't "find it again". If your experience differs, then count yourself lucky. My advice to anyone reading this is to avoid this kind of test like the plague, because it WILL bite you. @congusbongus – Dawood ibn Kareem Oct 03 '14 at 04:06
2

@DavidWallace "instead of stopping to think about which test cases are actually needed" doesn't mean "no random testing", it means "not *only* random testing". As for "you can't reproduce the data that found the bug", did you actually read the answer you're commenting on? ;) – Warbo Oct 03 '14 at 07:05
4

@DavidWallace "unreproducible is EXACTLY what random means" - ludicrous. It's trivial to record and replay that data later, as Warbo points out. And also as he points out, random input testing makes an excellent *complement to* regular testing, not a replacement for it. Running specific, thought-out test cases is a great practice which isn't going anywhere, but being human, we will err and miss some. – Chris Hayes Oct 03 '14 at 07:11
1

@DavidWallace Here's some copypasta from a script I'm working on: QC generated "Pandoc" values, the 18th caused a failure, it "shrunk" that value 4 times (retesting with truncated lists, strings, etc. to simplify the counterexample) then reported it. This value can be copypasted into a regression test in VCS: `*Main Test.QuickCheck> quickCheck pNoUnwrapRemain *** Failed! Falsifiable (after 18 tests and 4 shrinks): Pandoc (Meta {unMeta = fromList []}) [CodeBlock ("\158M4\161l\FS\171\CAN",["VR\134x\131&\FSh-&\170f3?\STX","!T\237\vW\199>$9\247\ENQl"],[]) "\SYN \178qs\232Rb-X6UEv/c"]` – Warbo Oct 03 '14 at 07:22
3

@David If you really think that fuzz testing is useless you obviously don't work in any field that is security sensitive. Fuzz testing is one of the best ways to find serious bugs in critical software. It's used to test compilers, browsers, OSes and many other things. – Voo Oct 03 '14 at 21:20
2

@Dawood I use random test data extensively. I have never wound up in a "world of hurt" because of it. My rule of thumb, in fact, is to use random test data *as much as possible*. I feel dirty when I have to hard-code test data (unless I'm testing a particular value). – Marnen Laibow-Koser May 18 '18 at 00:47

score 0 · Answer 4 · answered Oct 06 '14 at 17:44

Definitely include that data with your tests and your main application code. It helps to have a really well organised test suite - so if you're testing pdf extraction (and you have that code nicely encapsulated) then you should be able to construct a path to your test data, based on the path to the app code - that's always worked for me.

With git you can set up a .gitignore to prevent any temporary output or test logging from polluting your repo.

Should test data be checked into version control?

4 Answers4

Linked