I'm designing an application that will be appending blobs to a file on disk (local filesystem) and I'm currently thinking of how to deal with consistency issues that could occur if:
- The application suddenly crashes
- The whole system stops, e.g. due to a power outage
The goal is that, when the file is later-on read, the application processing the blobs should be able to distinguish if a blob has been corrupted (and thus avoid processing it).
My current idea is to write the following on disk for each blob and flush after each one:
[Size of blob]
(4 bytes) [CRC-32 hash of blob]
(4 bytes, more to detect issues as files are aging over time) [actual blob bytes]
Here come the questions:
- Does this guarantee that, should any of the above conditions occur, the file will contain either only valid data, or
n
valid blobs + some extra bytes where interpreting the first four as the size will easily indicate that there are not enough remaining bytes in the file for a proper blob (or the extra bytes are under 4, not enough to hold the proper size)? - Could a power-loss corrupt bytes that have been previously written to disk inside the file?
- Could a power-loss corrupt the file such that it would appear much bigger than it should be (and thus contain various trash at the end)?
- Can various filesystems lead to strange behaviors in this regard? The application will be cross-platform and I'm trying to avoid writing platform-specific code for this.
Some other considerations:
- Blobs will be relatively small (around a few kB, < 100 kB)
- Loosing a few blobs should a sudden stop occur is acceptable
- When the application is restarted, it will create a new, empty file, not append to an already existing one
- Only one thread of one process will be doing the appending
- Reading the file before it is closed will not be allowed
- Should a power-outage occur, a consistency check will be performed on the filesystem after rebooting.