7

In the context of a modern filesystem such as btrfs or ZFS, both of which checksum every piece of data written, is there any additional value in a file format storing internal checksums?

I also note the case of where a file is transferred across a network. TCP does its own checksums, so again, is it necessary for the file itself to contain a checksum?

Finally, in the case of backups and archives, it is usual for archive files (tarballs etc.) to be stored with a sidecar file containing a hash. Where the archive file is intended as a distribution method, a cryptographically secure sidecar hash file is required.

So when should a file format do its own checksums?

jl6
  • 413
  • 2
  • 10
  • 3
    Your question presumes that the file system can prevent every possible form of data corruption, which is not the case. – Robert Harvey Nov 18 '16 at 23:18
  • @robert Harvey: I don't think I'm assuming that. I accept there are cases where a filesystem's own checksums cannot detect corruption (e.g. where the checksum and the data are altered in tandem so as to still match). The question is whether an additional checksum can prevent any of those cases. – jl6 Nov 19 '16 at 01:40
  • 1
    I presume that the internal checksums you refer to in your question have to do with checking data packets in the file, not the entire file. – Robert Harvey Nov 19 '16 at 03:48

3 Answers3

9

The other thing you haven't considered is that files typically don't just exist on disk:

  • They are copied across networks in various ways and under various circumstances.
  • They are copied from one storage media to another, or even within a media.

Each time a file is copied, the bits could get corrupted ...

Now some of these representation or data movement schemes have (or can have) mechanisms to detect corruption. But this doesn't apply to all of them, and someone receiving a file cannot tell whether previous storage / movement schemes that touched the file do error detection. Also, you don't know how good the error detection is. For example, will it detect 2 bits flipped?

Therefore, if the file content warrants error detection, including error detection as part of the file format is a reasonable thing to do. (Indeed, if you don't then you ought to use some kind of external checksumming mechanism, independent of the file system's error detection, etcetera.)

The other thing to note is that while disks, networks, network protocols, file systems, RAM and so on often implement some kind of error detection, they don't always do this. And when they do, they tend to use a mechanism that is optimized for speed rather than high integrity. High integrity tends to be computationally expensive.

A file format where integrity matters cannot assume that something else is taking care of the problem.

(Then there is the issue that you may want / need to detect deliberate file tampering. For that you need something more than simple checksums or even (just) cryptohashes. You need something like digital signatures.)

TL;DR - checksums in file formats are not redundant.

Stephen C
  • 25,180
  • 6
  • 64
  • 87
2

Checksums improve data quality on a statistical basis. So it depends on the factor of security you need for your data. You never can reach 100% since each check sum can alter (though very unlikely) in a way with the data it's going to secure. There's just one rule that the more secured your data needs to be, the more you need to add algorithmic overhead. It's some sigmoid function where to the right you increase the algorithmic effort, but never reach the 100% security on top.

(N.B. I never know when it's safety or security, but you probably guess what I mean.)

  • So I think you are saying that layering on additional checksums can help detect errors that might have passed a "lesser" degree of checksumming, but this scheme cannot attain perfect protection and it is therefore up to the user what counts as good enough. – jl6 Nov 19 '16 at 01:44
  • Exactly. There is nothing to attain 100% security. –  Nov 19 '16 at 08:33
1

Reworked answer following discussion in comments

Checksums in file formats

The checksum in a file format has a different purpose than checksums in the file system. It aims at verifying the integrity of the data at application level. It can detect:

  • accidental corruption of content (e.g. accidental bit flips in file I/O operations, on the storage device, or during network transfer when file was transferred)
  • potential inconsistencies (e.g. file was edited manually or modified without sufficient knowledge of its structure)
  • intentional corruption and fraud (e.g. banking formats foresee more complex checksums, make it more difficult for fraudster hacking in manual changes).

Checksums don't guarantee authenticity of data (for this there are digital signatures), but they reduce risks of altered application data.

Checksums in file systems

In the very large scale (e.g. datacenter), accidental corruption is not a question if it happens, but when it happens :

  • Hard disk had in 2013 a failure rate of 1 bit every 10^16 bits read/written. RAM similarly have an uncorrected failure every 10^14 bits.
  • Silent data corruption can also occur due to cosmic radiations affecting the chips, electromagnetic waves that interfere with signal transmission, and other external physical phenomenons.

This explains the rationale for checksums in filesystems:

  • protect data at storage level against accidental corruption, independently of the content format:

    As an example, ZFS creator Jeff Bonwick stated that the fast database at Greenplum, which is a database software company specializing in large-scale data warehousing and analytics, faces silent corruption every 15 minutes
    Wikipedia article (link above)

  • protect file system metadata against accidental corruption (or tampering attempts), because the loss of critical information, such as references to i-nodes or others could have an even more dramatic effect that data in individual files (e.g. instant loss of thousands of files)

    Some file systems, such as Btrfs, HAMMER, ReFS, and ZFS, use internal data and metadata checksumming to detect silent data corruption. In addition, if a corruption is detected and the file system uses integrated RAID mechanisms that provide data redundancy, such file systems can also reconstruct corrupted data in a transparent way.
    Wikipedia article (link above)

Multilayer protection

The physical protection in the hardware layer (ECC, CRC, RAID...), the filesystem or network protocol checksums in the system layers, and the content embedded checksum in the application layer complement each other and each protect against different phenomenons (e.g. a filesystem checksum does not protect against an intentional write).

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • I'm aware of error correcting codes, and I can see that having additional redundancy gives you options for recovering from corruption. But can you give an example of corruption that a filesystem wouldn't detect but an additional file format checksum would? – jl6 Nov 19 '16 at 01:27
  • @jl6 if you get a bit flip in a 5MB photo, you'll hardly ever notice it. However, if you have a bitflip in a filesystem, causing you to loose track of an i-node that refers to a directory, you could loose thousands of file as a consequence. – Christophe Nov 19 '16 at 01:32
  • @Christophe: If you have a bitflip in a file system, it is most likely inside a file containing a 5MB photo, and you'll hardly ever notice it. – gnasher729 Nov 19 '16 at 10:51
  • @gnasher729 when rfering to the filesystem, i was talking about its metadata. Loosing filesystem data has much higher impact tha a bitflip in any other file – Christophe Nov 19 '16 at 11:15
  • @gnasher729 i have a nice reference book on my shelf about forensic filesystem analysis: it could be possible to recover such a situation after a time consuming allocation hunting. But noticing it too late would mean irrecoverable damage. And anyway, in a mass data business like a datacenter, it's simply less expensive to prevent such problems with a couple of extra checksums. – Christophe Nov 19 '16 at 11:23
  • @jl6 Ok, in view of these comments, I've edited my answer to better focus on your interest. Hope this helps – Christophe Nov 20 '16 at 14:19