14

I am a beginner with make and I'm wondering about when to use make clean.

One colleague told me that incremental builds with make are based on the files timestamps. So, if you checkout an old version of a file in your VCS, it'll have an "old" timestamp and it'll be marked as "no need to recompile this file". Then, that file wouldn't be included in the next build.
According to that same colleague, it would be a reason to use make clean.

Anyway, I roughly got the answer to the question "when to use make clean" from other StackExchange questions but my other question then is:

Why do incremental builds using make rely on files timestamps and not on SHA-1 for example? Git, for instance, shows that we can successfully determine if a file was modified using the SHA-1.
Is it for speed issues?

filaton
  • 309
  • 3
  • 9
  • 7
    `make` was created in the 70's. SHA-1 was created in the 90's. Git was created in 00's. The last thing you want is for some obscure builds that were working for 30 years to suddenly fail because somebody decided to go all modern with a tried and tested system. – Ordous May 24 '16 at 16:01
  • 2
    Hashing the files all the time is slow. I think git also uses filesystem metadata to optimize its checks for changed files. – CodesInChaos May 24 '16 at 16:05
  • 4
    The original solution based on file dates is very simple, it does not need any additional files for storing the hash codes, and it worked remarkably well over several decades. Why should someone replace a well working solution by a more complicated one? Moreover, AFAIK most VCS system assign checked out files the "checkout date", so changed files will correctly cause a recompile without "make clean". – Doc Brown May 24 '16 at 16:15
  • @Ordous: Amusing, but is it relevant here? Software doesn't rust out; it gives out because someone changed something in the surrounding environment. Unless they didn't, in which case it should still work. – Robert Harvey May 24 '16 at 16:58
  • 1
    @RobertHarvey Of course it is! Sure, if you don't update your `make` then your software won't break, however `make` makes rather an effort to have backwards compatibility in new versions. Changing core behavior for no good reason is pretty much the opposite of that. And the dates show why it was not originally made to use SHA-1, or why it was not easy to retrofit it when it became available (`make` was already decades old by then). – Ordous May 24 '16 at 17:06
  • I'm not sure about other VCSs, but with git, checking out an old file will not give you an old timestamp. – Vaughn Cato May 25 '16 at 01:00
  • @Ordous: Even if SHA1 didn't exist when `make` was created, I am quite sure that using a checksum to verify if I file has been changed was already a known solution. As @Doc Brown has pointed out, the solution based on timestamp is simpler, so I wouldn't be surprised if using checksums had be considered and discarded by the creators of `make`. – Giorgio Mar 07 '19 at 18:24

3 Answers3

8

An obvious (and arguably superficial) problem would be that the build system would have to keep record of the hashes of the files that were used for the last build. While this problem could certainly be solved, it would require side storage when the time-stamp information is already present in the file-system.

More seriously, though, the hash would not convey the same semantics. If you know that file T was built from dependency D with hash H1 and then find out that D now hashes to H2, should you re-build T? Probably yes, but it could also be that H2 actually refers to an older version of the file. Time-stamps define an ordering while hashes are only comparable for equality.

A feature that time-stamps support is that you can simply update the time-stamp (for example, using the POSIX command-line utility touch) in order to trick make into thinking that a dependency has changed or – more interestingly – a target is more recent than it actually is. While playing with this is a great opportunity to shoot yourself into the foot, it is useful from time to time. In a hash-based system, you would need support from the build-system itself to update its internal database of hashes used for the last build without actually building anything.

While an argument could certainly be made for using hashes over time-stamps, my point is that they are not a better solution to achieve the same goal but a different solution to achieve a different goal. Which of these goals is more desirable might be open to debate.

5gon12eder
  • 6,956
  • 2
  • 23
  • 29
  • 1
    While the semantics differ between hashes and time stamps, it's normally irrelevant in this case as you most likely want a build based on the current files, no matter their age. – axl May 25 '16 at 06:07
  • Most of what you say is correct. However a well-implemented build system that uses hashes like Google blaze/bazel (the internal version of blaze, the open source one is bazel) beats the pants off of a timestamped system like Make. That said, you do have to put a *lot* of effort into repeatable builds so that it is always safe to use old build artifacts rather than rebuilding. – btilly May 25 '16 at 20:28
  • The mapping here isn't many to one, it's one to one. If `D` now hashes to `H2`, and you don't have some output `T2` built from `D@H2`, you need to produce and store it. Thereafter, regardless of what order `D` switches between the `H1` and `H2` states in, you will be able to use cached output. – Asad Saeeduddin Jun 07 '17 at 19:28
  • 1
    `Bazel`, `meson`, `please` - they all absolutely suck usability-wise. Their DSLs are 8 to 20 times as verbose as make's. They're also absurdly opinionated and impose crazy restrictions on where things can be located. If you want to adopt any of those for an existing big project, you will probably have to refactor the entire project structure inside out. GNU Make imposes no restrictions. It allows recipes to read from anywhere and write anywhere your user has permissions to, even using absolute paths. It is perfect BUT for lack of hash support. – Szczepan Hołyszewski Oct 08 '20 at 01:43
  • If your recipe can read from anywhere without sand boxing then it becomes increasingly likely that a mistake in your dependency encoding will appear over time. Either you take a dependency that is not really a dependency leading to unnecessary rebuilds, or you forget a dependency, leading to incorrect builds. It also makes it harder to produce builds that will work on any machine, since it easy to accidentally take a system dependency. These are some common failure modes of Make on large projects. – sdgfsdh Oct 08 '20 at 06:45
  • The main failure mode of Bazel is the impossibility of getting things done at all, or without throwing 20× work hours at it compared to Make. Build system development and maintenance for large projects should not be a separate career path. It is something programmers should be able to do by themselves without the company having to buy a second brain for each of them. Instead of disabling entire dimensions of possibilities in order to logically preclude missed dependencies, build tools should focus on detecting and warning about those situations **only when they actually occur**. – Szczepan Hołyszewski Oct 09 '20 at 00:43
6

A few points about hashes vs timestamps in build-systems:

  1. When you checkout a file, the timestamp should be updated to the current time, which triggers a rebuild. What your colleague describes is not usually a failure mode of timestamp systems.
  2. Timestamps are marginally faster than hashes. A timestamp system only has to check the timestamp, whereas a hash system must check the timestamp and then potentially the hash.
  3. Make is designed to be lightweight and self-contained. To overcome (2), hash based systems will usually run a background process for checking hashes (e.g. Facebook's Watchman). This is counter to the design goals (and history) of Make.
  4. Hashes prevent unnecessary rebuilds when a timestamp has changed but not the contents. Often, this offsets the cost of computing the hash.
  5. Hashes enable artefact caches to be shared across projects and over a network. Again, this more than offsets the cost of computing hashes.
  6. Modern hash-based build-systems include Bazel (Google) and Buck (Facebook).
  7. Most developers should consider using a hash-based system, since they do not have the same requirements as those under which Make was designed.
sdgfsdh
  • 195
  • 1
  • 5
  • Bazel and Buck suck. They are absurdly opinionated and impose crazy restrictions on where things can be located. You try porting your rules one by one and you quickly realize that NOTHING CAN FIND ANYTHING because there's some kind of sandboxing going on and in order to pierce this stupid firewall you must write SCREENFULS of extra declarations in the build specs. And people clench their teeth and deal with it, because hashes. THAT is an indication of how useful this capability is. – Szczepan Hołyszewski Oct 08 '20 at 01:49
  • 2
    The sand boxing is actually orthogonal to the hashing. However, both are features that lead to maintainable and predictable build systems in large projects. I personally don’t find the sand boxing restrictive, and typically projects follow this convention anyway. Buck and Bazel declarations are very terse, being written in Python. The sand boxing trade offs are a bit like type-checking. It makes a few things more difficult but gives many more guarantees. – sdgfsdh Oct 08 '20 at 06:38
  • The user should be IN CONTROL of trade offs. – Szczepan Hołyszewski Oct 08 '20 at 17:25
  • 2
    If you allow users to easily break the sandbox then you lose correctness guarantees across the build. This would preclude features that the Bazel team wanted to prioritize, such as correctness, distributed caching, distributed execution, composability of projects, etc. Sometimes more freedom can actually lead to _fewer_ features. See https://www.youtube.com/watch?v=GqmsQeSzMdw for a good talk on this concept. – sdgfsdh Oct 08 '20 at 17:53
  • If more features lead to complete, ultimate and unworkaroundable impossibility of certain things, then that isn't a good tradeoff. I can accept making things harder in order to provide more features. I cannot accept making things outright impossible. 90% of the features that Bazel boasts are precluded from being effectively usable by the limitations imposed in order to make some _other_ features theoretically possible. Writing build systems should be roughly an entire order of magnitude easier than writing the software. With Bazel, it is HARDER. "Constraints are freedom" reads like Orwell. – Szczepan Hołyszewski Oct 09 '20 at 00:16
  • 1
    @SzczepanHołyszewski Perhaps you should open an issue for the problem you are having with Bazel. It certainly does not make building software impossible, as evidenced by the various companies that are leveraging it successfully. "Constraints are freedom" is just a catchy title, don't read too much into it :) – sdgfsdh Oct 09 '20 at 10:51
1

Hashing an entire project is very slow. You have to read every single byte of every single file. Git doesn't hash every file every time you run a git status either. Nor do VCS checkouts normally set a file's modification time to the original authored time. A backup restore would, if you take care to do so. The whole reason filesystems have timestamps is for use cases like these.

A developer typically runs make clean when a dependency not directly tracked by the Makefile changes. Ironically, this usually includes the Makefile itself. It usually also includes compiler versions. Depending on how well your Makefile is written, it could include external library versions.

These are the sorts of things that tend to get updated when you do a version control update, so most developers just get in the habit of running a make clean at the same time, so you know you're starting from a clean slate. You can get away without doing it a lot of the time, but it's really difficult to predict the times you can't.

Karl Bielefeldt
  • 146,727
  • 38
  • 279
  • 479
  • You can use filesystems like ZFS where the cost of hashing is amortized over the time when the files are being modified, rather than being paid all at once when you build. – Asad Saeeduddin Jun 07 '17 at 19:30