Reasons behind conventions around what goes into source control?

Question

I've come across some apparently-conflicting patterns of behaviour. I'd like to understand why they each exist. I'll call them "conventions" for the sake of simplicity, though I'm not sure the term is a great fit for all of them.

Convention 1: we do not commit generated artefacts (especially binaries) to source control. I've been aware of this creed for almost as long as I've known about source control. Reasons for this convention include: (1) it's better (more reliable, safer from viruses) to use a build server than to deploy locally-built binaries into production, (2) there could be differences between the source code and the binary, (3) what whatisname says in the comments.

Convention 2: in my organisation (not sure if it's common practice), we commit the .js and .map.js files generated from .ts files to source control. I'm a late arrival at the TypeScript party, and this makes me feel like I brought the wrong drinks.

My Googling has lead to plenty of discussion about "what", but nothing compelling about "why".

What is the fundamental thinking behind each?

Side note: There is also: Convention 3: we commit code (not binaries) generated by Entity Framework from a database schema to source control. I think this gets its definitive exemption from Convention 1 because it means you don't have to have the source schema available everywhere you need to build the binaries (e.g. the build server). There are some convenience benefits too, but I think you could argue for convention 1 with most of those if you felt like it.

One reason for convention 1 is that it forces you to have the entire source in the repository in order to build the software from a fresh checkout. If you have binaries in there, it is easy to "whoops" and not notice some piece is not getting built or its sources are missing for a significant amount of time. — whatsisname, May 21 '21 at 04:02
**Worth noting:** when we include libraries using a library manager like NuGet, there can be large swaths of code that never get recompiled, because they're in library binaries. But we don't include those binaries in source control either. Rather, we rely on the library writer to build those binaries for us and merely *reference* them in source control. — Robert Harvey, May 21 '21 at 20:32
I haven't encountered committing .js and source mapping files to source control in TS codebases - I'm not sure it's as common as you think? Deploying them in the deployed NPM artifact is quite common though. This has the benefit of not forcing consuming applications to use Typescript builds/similar TS configurations. — Daniel, May 23 '21 at 01:39
@Daniel, you might be right, it may not be the de-facto standard. I'm new to it, so I hope nobody would take my word for that. — OutstandingBill, May 25 '21 at 00:24

Doc Brown · Answer 1 · 2021-05-22T06:55:20.290

What you listed under #2 and #3 are examples for generated source code files, so what you are actually asking here is: why/when put generated files into source control, (and why/when not)?

To answer this in general, you have to take a closer look at the files and their process of creation:

Reasons why to check them in:

Starting the generation process isn't (yet) part of a full build, it is a manual step. Hence, after a checkout, without the generated parts, one would not be able to run a build immediately. This would effectively make it impossible to run an automated build on a build server.
Devs want to have a diff-able history of the generated files, which might help to trace errors.
The generated files contain manually added parts (hopefully in a way the generator does not overwrite them when run again), and those would get lost when they were not kept in source control
The generation process takes a lot of time, and the repo serves as a kind of cache.

Reasons against it:

The build process is expected to generate the missing files automatically, so it should not be necessary to keep them in source control - hence not checking them in serves as a test of the build process itself (when it fails or succeeds after a clean checkout).
The top reason is exactly what @whatisname wrote: there is certain risk for forgetting to put required files into source control (especially the generation sources). The automated build, including the generation, serves as a quality gate to make sure noone forgets to add new files, which will fail if the generation is not successful, or code which relies on those generated files.
There could be undetected inconsistencies between the currently checked-in files and the files one gets when running the generation again.
Those files make the repo larger than necessary
The generated files contain a generation time stamp or other ephemeral local information, so everytime a new build would run, one had to check the generated files in again, even if their source did not change.

I recommend you check the examples about .js, .map.js files and generated code from EF against the former list, then you will probably find which reasons apply to your case.

You can also apply the above list of reasons to generated binaries as well. I guess when you do it, it becomes clearly apparent why they are usually not put under source control.

See also:

I suppose a counter-argument to "Devs want to have a diff-able history of the generated files" might be that most of the time the generated code works fine, and devs don't want to be burdened unnecessarily with reviewing the changes. Thank you for the analysis - it's just what I was looking for. — OutstandingBill, May 25 '21 at 00:20

score 1 · Answer 2 · answered May 21 '21 at 07:56

To understand the reasons why we should first wonder what is desirable.

If you want to work on something you ideally pull, build and debug. Anything more you need to do to get up and running requires time and knowledge and is thus undesirable.

If you have source code and you can build your dependencies, by all means do so. If you have dependencies that cannot be created on the fly by the build itself, it will probably be a good idea to include them in the repository. The type of file is irrelevant here, what matters is whether you need them to get your project to run or not.

Typically you need to setup an environment first. This should be a well documented procedure that you need to do just once. So you do not put your development tools in source control. Then there may be some dependencies that are impractical to include, like a large data set. It would not be bad to include a small one though along with some configurations if that helps bringing down the time and knowledge needed to get you rolling. It ultimately depends on the environment: what do people find hard, what is a problem and what is not. So although there may be some rules of thumb, what some may consider wrong for their environment may be a good thing in another.

score 0 · Answer 3 · answered May 21 '21 at 09:47

In my opinion: If a new developer is hired, and they just unpacked a brand new computer for development, there should be a sheet of paper handed to them with instructions what they need to do to download from your source code control system, they should follow these instructions, download everything, then the download might contain further instructions what tools they need and how to get them, and after installing all the tools they should be able to press a "build" button or something like that and build the product.

Everything needed for that, plus documentation, plus things like samples, test cases etc. that they would want to use, should be under source code control. Anything that is created automatically as part of the build process shouldn't.

In addition, build artefacts like compiled libraries, that are not usually changed and may be time consuming to build, may be under source code control, especially if the build process for these libraries is more complicated.

Reasons behind conventions around what goes into source control?

3 Answers3