So, unless you're doing a lot of file system operations (like creating files, deleting old ones, moving them around) just before the power went out, even on FAT, you'd not regularly corrupt the whole filesystem – since you're just appending to files, the end of that file might not be there when you repair the file system, but that should be about it.
So, I suspect something else happening: What you perceived to do to enhance reliability of writing actively wears down your storage media very rapidly. You touch upon this in your "longer or shorter blocks and frequencies": yep, I think buffering your writes, and doing whole-block writes instead of 2 entries per write (I'm assuming an entry is not 2 kB, but more like 2 B) would help.
What happens inside a NAND flash device like USB sticks is this (simplified below; this is for understanding the principle, not reimplementation):
You want to write 4 Bytes to address 0x10F004 (to 010F007) (really, just a random example). So, the controller inside the stick
- figures out to which logical address block that address belongs. Say, your stick internally has a 4 kB block size (again, just random example), so that means that this data resides on the 271. logical block, and is in positons 7 to 11.
- It then looks at an internal table, figures out that the 271. logical block is currently saved in the 5000. physical block.
- It reads that block into its internal RAM,
- applies the forward error correction that was part of the data on the physical block to correct any errors (as far as possible),
- changes the four bytes at the 7th to 11th position in the decoded data in RAM
- it looks up a physical block in another table that hasn't been used for writing to in a long time. For the heck of it, let's say that's physical block 1234.
- it adds error-correction information to the data in RAM
- zeros-out the whole physical block 1234, so that data can be written to it.
- writes that whole 4 kB + error correction of data to physical block 1234
- changes the entry in the first table to make logical block 271 now point to physical block 1234 instead of 5000.
- Adds physical block 5000 to the end of the "blocks not written to in a while" table.
The reason you're looking at this here is
- cost-effective NAND flash can only work in rather large blocks, not on individual bytes – that's both due to its addressing methodology, as well as the necessity to have an error-correcting code on there that can be used to correct errors, which are virtually guaranteed to appear¹.
- since being written to degrades flash memory, you need to "fairly" distribute writes across the whole set of usable blocks. Thus, you wouldn't write the modified data to the same block as before (you can't overwrite a 1 with a 0, anyways, in flash. You always need to zero out the whole block and then set all the 1s.), you wear level across as much of the chip as you can.
Now, in step 11. above I said that the former physical block is added back to the list of blocks to be written to – but that only happens if during reading of the block, and error correction, nothing suspicious of wear happened. If that happened, block 5000 will not be added to the table of blocks that the controller picks new write targets from. But then the number of blocks that can be written to is decreased – and that only works for as long as the number of used blocks + blocks in the write-to-table is larger than the nominal drive capacity (divided by block size). If that happens, the controller has no place to write new data to. Your stick becomes broken and read-only.
So, what to do?
- In any case, if deleting a file is part of the regular operation of your software, then make sure that the file system layer also tells the USB stick to
TRIM
/DISCARD
that logical block (that allows for these blocks to go through step 11, so you get new write-to blocks, and wear leveling works on all unused space)
- Turn off "modification time" for your filesystem – if you need to update the field in the FAT that says "this file was last written to at…" every time you write something, that instantly doubles your block write rate.
- As you hinted at, accumulating data before writing it sounds like a good idea.
- I don't know your FAT implementation, but maybe it does have write buffers?
- But even in that case, your
f_sync
would suppress that capability
- It would however require you to implement enough power stabilization (large capacitors? Supercaps? Backup battery?) and brownout detection to let you sync your buffered data to stick once the power starts going out
- Avoid using 4 files to store 4 streams. That forces FAT to write to four different blocks! Instead, just interleave into one file.
- If you can: do a simple scheme that has fixed amounts of bytes for each stream (i.e., instead of writing 2
signed char
s to 4 files, you would write a single struct _foo {char a[2]; char b[2]; char c[2]; char d[2]} foo;
to one file)
- Or, if there's not always the same amount of data from each channel, a simple key-value pair (i.e.,
enum _channel { a, b, c, d }; typedef enum _channel channel; struct _key_value { channel chan; char value[]; };
if the length of each channel is constant (not necessarily even the same),
- or if you can have variable-length data chunks, a type-length-value tuple (i.e.,
struct _key_length_value { channel chan; unsigned char length; char value[]: };
so that both type of data and amount of data can be recovered by the reader.
- FAT sounds like the wrong file system altogether. What you describe as outage scenario (sudden loss in power) actively screams "use a journal!", so a journaling file system or an equivalent data structure, written to the raw block device instead of going through a file system (why a file system if you only have a known number of data streams, optimally one?) would be worthwhile investigating.
- Of course, using a different file system than FAT or even no file system at all will put software requirements on the reading end of your ecosystem. In embedded, that's often no problem (as you control which software needs to be available on the reader), but it can be a hassle or a showstopper (especially in consumer electronics).
- Also, fun fact, I know of not a single implementation of a major journaling file system (ext3/4, XFS, JFS, F2FS …) for microcontrollers, so err, this is advice easier given than implemented.
- But if you implement this as a data structure on your own (no matter whether on top of an existing file system or raw device), not so hard:
- say you write in 1kB blocks at the end of your file; you leave the first byte as 0x00 of each block, and only after a complete block of data has been successfully written, update the first byte of the block to be 0xFF. Only after that has been successfully written, you can start writing the next block.
- On the reading end, you round down your file size to whole multiples of 1 kB (everything else must be broken), and if your last block doesn't start with 0xFF, then it's not properly been written, so you ignore it as well.
- instead of a single 0x00 / 0xFF byte, you can also use timestamps (as long as these can't be all-0) so that each block contains info of when it was written.
- and to make things a bit robust against random corruption, add a hash to the beginning of the file, after the "canary" token (that 0x00/0xFF byte or whatever you choose). XXH32 from the xxHash hash suite is very fast to compute, and it only takes 48 bytes of RAM to calculate, and uses 32 bit = 4 B on your storage only. This might be a low price for having the certainty that your data is intact with a probability of being wrong that's basically inexistent.
¹ Zero-error memory would be astronomically expensive, slow and power hungry. Having more cheap memory, some of which you have to reserve for error-correction information, is much more affordable. In the USB stick market, you can be pretty sure that there's even significant parts of the flash memory that are simply not in the table of blocks to be ever used, internally to the stick. Having a few bad blocks on a large wafer and hiding them from usage, thus selling all of that wafer is much more cost-effective than throwing out any flash memory chip that contains a single non-working block.