3

I'm designing an application that will be appending blobs to a file on disk (local filesystem) and I'm currently thinking of how to deal with consistency issues that could occur if:

  • The application suddenly crashes
  • The whole system stops, e.g. due to a power outage

The goal is that, when the file is later-on read, the application processing the blobs should be able to distinguish if a blob has been corrupted (and thus avoid processing it).

My current idea is to write the following on disk for each blob and flush after each one:

[Size of blob] (4 bytes) [CRC-32 hash of blob] (4 bytes, more to detect issues as files are aging over time) [actual blob bytes]

Here come the questions:

  • Does this guarantee that, should any of the above conditions occur, the file will contain either only valid data, or n valid blobs + some extra bytes where interpreting the first four as the size will easily indicate that there are not enough remaining bytes in the file for a proper blob (or the extra bytes are under 4, not enough to hold the proper size)?
  • Could a power-loss corrupt bytes that have been previously written to disk inside the file?
  • Could a power-loss corrupt the file such that it would appear much bigger than it should be (and thus contain various trash at the end)?
  • Can various filesystems lead to strange behaviors in this regard? The application will be cross-platform and I'm trying to avoid writing platform-specific code for this.

Some other considerations:

  • Blobs will be relatively small (around a few kB, < 100 kB)
  • Loosing a few blobs should a sudden stop occur is acceptable
  • When the application is restarted, it will create a new, empty file, not append to an already existing one
  • Only one thread of one process will be doing the appending
  • Reading the file before it is closed will not be allowed
  • Should a power-outage occur, a consistency check will be performed on the filesystem after rebooting.
D. Jurcau
  • 537
  • 4
  • 8
  • This sounds like you're trying to write software to protect against hardware issues. While you should certainly be able to **detect** and cope with hardware failures (log the issues, provide the ability to failover to a redundant system, notify a system administrator, etc.) I would consider instead using protective systems and better IT policies. For example, UPS to protect against a power outage, RAID storage to protect against the failure of physical media, regular routine backup procedures to an off-site storage location, a redundant backup system for automatic failover, etc. – Ben Cottrell Apr 14 '17 at 11:03
  • Can't you try to use a logger system that will handle that for you ? – Walfrat Apr 14 '17 at 12:39
  • I could use SQLite for example which implements various features in this regard, but I think it's overkill for only storing blobs (as in event sourcing) and reading them in the order in which they where written. – D. Jurcau Apr 14 '17 at 12:48
  • 1
    Don't forget to account for the possibility that the underlying OS (or the underlying storage beneath that) may have write caching enabled. I would look to leverage an existing mature persistence layer rather than roll your own. All the various database engines will be able to store your data blobs using atomic transactions. – Thomas Carlisle Apr 14 '17 at 14:30

2 Answers2

1

Unfortunately, your simple scheme does not protect you against disk failure or data corruption.

The weakness is in the unprotected size, which you need to read sequentially the file and find the next blob. So in case of a writing failure, on the size you can loose everything from the first bad size to the end of the file. This can happen in two variations:

  • creation of a new object: you write the data properly (say a blob of m bytes), and then eventually writing several other blobs. Imagine that the OS writes the size to the disk with an undetected corruption. When later you'll read again the file, you'll find out the wrong size n. There is a high probability that the CRC will highlight the inconsistency, but it will do on the n bytes that follow (although the blob was fully correct). Worse, the n bytes will be discarded as bad blob, and your code will from then on try to read the next blob at a wrong place (offset +n instead of offset + m).
  • file maintenance outside the application: for example, your blob file is copied from one server to a newer more performant one. If during the transfer a blob payload is corrupted, only this blob will be lost. However, if during the copy a size info gets corrupted, you'll loose all the subsequent blobs.

In a similar way, errors could also affect already written data. For example on an SSD, a hardware error could lead a bit to flip, a hardware defect could also affect disk cache memory (e.g. row hammer like effects). Some filesystems (or even hardware) try to rewrite an allocation unit, if it appears to be located on a defect hard drive sector), etc... But these are issues that affect most data structures, not only yours. One way to reduce them is to read the data you've written to cross-check its consistency.

Christophe
  • 74,672
  • 10
  • 115
  • 187
  • if the blobs are small enough (< pipe_buf size) then the writes will be atomic and size corruption wouldn't happen, right? – d9ngle Jul 16 '20 at 12:28
  • By dump process, you mean the kernel? fsync should solve that, no? – d9ngle Jul 16 '20 at 16:19
  • @d9ngle Sorry, I thought I was on another question. Let me have a second thought in the right context ;-) – Christophe Jul 16 '20 at 16:30
  • @d9ngle Ok, I confirm that even for small blobs the problem persists. First, there is no atomic write for blob because blob means “binary large object”, and large is rarely compatible with atomic. You could hope for some level of atomicity if your data fully fits in a disk block (supposing you’re use a block-mode device) and if you don’t need to cross blocks. But even in a single bloc, a byte might be corrupted. This is why filesystems such as [zfs](https://en.m.wikipedia.org/wiki/ZFS) were created. – Christophe Jul 16 '20 at 16:49
  • @d9ngle The weakness is the size: if it is unprotected it can be corrupted and you’ll not know and never access any subsequent bloc. It it is protected, it can be corrupted but at least you know that you’ve lost what follows. The only way forward is to read the data written to check that size and number of bytes read are ok before writing the next blob. Another way is to use fixed size pages, with page that included the size. If you reach a corrupted blob, you may skip fixed blocs until you find a valid one (but this is a first thought: more analysis needed) – Christophe Jul 16 '20 at 16:56
  • I've been looking this up all day and it seems that in linux, 4kb appends are atomic, which fits my use case. If I get the size part correctly, you're talking about metadata... isn't that solved by journaling? My use case is inserting logs and I need to make sure my data is there, I can afford calling fsync as well, so it's a flat file, won't cross blocks (?), small size, ... I just need to make sure there are no failures on software side (user or sys level) and I'm a bit suprised this is not a solved problem? – d9ngle Jul 16 '20 at 18:23
  • @d9ngle I was talking of the size field which is part of rhe data written in the file. Cross blocks: imagine a block of 1024 bytes. Imagine a blob of 703 bytes. Imagine now writing 2 blobs in a row and you should get the idea. File system has to expand file, needs to find a free block, and then only write: this can’t be atomic. – Christophe Jul 16 '20 at 18:31
  • still confused about this... any suggestion what approach to take for my use case? I just want to eliminate the chance of software-side corruption for a logging system. any link, pointer, etc much appreciated. – d9ngle Jul 16 '20 at 18:34
  • @d9ngle If you look at a text logging, you just flush often. If somethings does corrupt, the reader will still be able to read the log, except that some lines seem weird, and they are then just ignored. If you log blobs, I'd suggest to do likewise. The main problem here is the corruption of the size which can make you loose a lot of data. THe approach with fixed page sizes, allow to do the same as with the weird lines; just instead of looking for the next decent newline, you'd look for the next decent page. Antother approach is to avoid some byte values in your blob and use a blob – Christophe Jul 16 '20 at 19:37
  • @d9ngle terminator at the end of the blob, instead of a size at the beginning. If the blob terminator gets corrupted, you only lose 1 or 2 blobs, not more. – Christophe Jul 16 '20 at 19:38
0

Hrm....Could you simply originally write your blobs to some temp file, then change the file name when it's ready? That will significantly limit your fails on write and you can easily see which files have not made the change.

unflores
  • 402
  • 2
  • 9