23

For a project, I am looking to deliberately provoke data rot on a storage unit (e.g., a hard drive or a flash drive). I understand that most modern OSes and filesystems as well as hardware have countermeasures in place to prevent data corruption. How could I set up a simple system such that one could see visible corruption on, e.g., image files within a week?

I am trying to provoke actual physical data rot.

Peter Mortensen
  • 1,676
  • 3
  • 17
  • 23
ddkk
  • 359
  • 2
  • 7
  • Are you somewhere (University with a decent Physics dept) that might be able to get hands on a radiation source? Even my high school had some, but that was a while ago, lol! –  Feb 02 '21 at 14:16
  • 1
    @BrianDrummond thanks, I actually might get my hands on a range of smaller radiation sources. I was wondering about how strong such a source would need to be for data corruption to occur, as I would prefer not to be bound by security regulations – ddkk Feb 02 '21 at 15:05
  • Not sure, proximity to the source will help. Alpha probably won't make it through the case, gamma may not release enough charge, so without having tried it, I'd expect beta sources to be the best bet. But Neil's EPROM suggestion bypasses the issue ... I'd forgotten about EPROM! –  Feb 02 '21 at 15:19
  • 6
    Most filesystems actually don't do much to prevent silent data corruption. The notable exceptions being ZFS (BSDs and Linux) and btrfs (Linux). The drives itself and other hardware do, though. Anyway, I do wonder what your project is about? Testing the error correction of some particular part of the system, or are you just doing it out of interest? – ilkkachu Feb 02 '21 at 19:12
  • 3
    if i want to simulate data rot, i would use [dd](https://en.wikipedia.org/wiki/Dd_(Unix)) =/ (well i'd make a little dd-like script but you get the idea) – hanshenrik Feb 02 '21 at 19:23
  • I think an older hard disk and a strong magnet is worth a try. – bb1950328 Feb 02 '21 at 20:31
  • 5
    @ddkk Make sure you won't suffer from fatal DNA bit rot before the DUT does if you're contemplating using radioactive sources. Or a visit from the (anit-)terrorist police. – Andrew Morton Feb 02 '21 at 21:13
  • 3
    Does it need to be bit rot on the device or are you ok with errors in transmission? You could intercept a SATA signal with something like an FPGA and do a bit flip there (or even just a very long cable). I suspect that a CRC error would just happen on the cable and a retransmit would occur. – Eric Johnson Feb 03 '21 at 02:25
  • 1
    Can't you use a modern PCI-E 4 SSD that can R/W 7GB/s / 6.5GB/s ? They have a life of around 70 hours at max (write) speed. – Gizmo Feb 03 '21 at 07:50
  • To emulate it, you must use kernel functions to act over the filesystem. On a unix-like system, as you probably will try to affect some particular files, learn their physical adresses with the help of filefrag utility which calls fiemap ioctl. Use it with -e option; output is relative to the device which the filesystem resides on and numbers are multiples of blocksize. Then, directly manuplate the device (/dev/sdXY) with an educated use of dd utility. – Ayhan Feb 03 '21 at 16:17
  • 3
    If applicable, check out the Linux Device Mapper, which has modules for fault injection, including `flakey` and `error`. – chrylis -cautiouslyoptimistic- Feb 04 '21 at 01:27
  • @ilkkachu I didn't know that, thanks! It it is out of pure interest. I have a physics background, maybe that explains the specificity of my request. – ddkk Feb 04 '21 at 09:18
  • 1
    @hanshenrik: Also things like [Simulate a faulty block device with read errors?](https://stackoverflow.com/q/1870696) - Linux device mapper, or other Linux fault-injection stuff for testing without actually messing up the underlying files. – Peter Cordes Feb 04 '21 at 21:19
  • Just write and erase the flash memory on an MCU until it fails. 10k writes goes by fast if you do it in a loop. Same thing works for an SSD, but there's a good chance it will detect the corruption and refuse to write more data to a corrupted drive. – Navin Feb 25 '21 at 14:34

7 Answers7

36

If you're happy to use a data storage unit this small, then a UV-EPROM would be very easy to rot. They are still available, online auction sites have them, for not too much money. Even a 27C400, 4 Mbit, is only a few pounds.

Of course they don't store 'files', just raw data, so you'd need some form of external memory controller. Perhaps use an Arduino to address it, and represent its data as a file. FWIW, some CircuitPython modules can be read as USB memory, so could read the data from the EPROM and write a file to their local storage area that represents it.

When I first came across UV-EPROMS early in my career, more decades ago than I care to remember, I was concerned about the recommended write and erase times, so programmed one, then verified it repeatedly, with one minute in the eraser between verifies. It took a few minutes for the first bit to drop out, then most went between 5 and 10 minutes, needing more than 15 minutes to erase the last bit. Quite a range.

Neil_UK
  • 158,152
  • 3
  • 173
  • 387
  • 33
    You can also model transient errors with a camera flash. As I found demonstrating a prototype system with no labels over the windows, when somebody decided to take pictures! –  Feb 02 '21 at 15:15
  • 2
    Little known fact, this is why you see 'no flash photography' signs in your finer art museums. – Marc Bernier Feb 02 '21 at 19:56
  • 6
    @BrianDrummond like what happened to [Raspberry Pi 2](https://www.raspberrypi.org/blog/xenon-death-flash-a-free-physics-lesson/) who was restarting when photographed – Ciprian Tomoiagă Feb 03 '21 at 09:23
  • 1
    @CiprianTomoiagă very similar, yes! (The photo flash was exciting the junction of a power MOSFET, IIRC, just like it would allow isolated charges in memory cells to discharge) – Marcus Müller Feb 03 '21 at 10:33
  • 1
    In the case of the EPROM, every storage cell becomes a photodiode, and the MCU rebooted thanks to the watchdog. –  Feb 03 '21 at 11:28
  • 6
    @MarcBernier what, because of the sensitive electronic spy bugs hidden inside the art? – user253751 Feb 03 '21 at 22:49
  • 1
    @MarcBernier That's a joke right? But I am not fully sure. :) – curious_cat Feb 04 '21 at 15:51
  • 1
    @curios_cat, of course. Pretty sure EPROMS weren't around when the Mona Lisa was painted. At least not good ones. – Marc Bernier Feb 04 '21 at 17:35
20

Neil_UK's answer made me remember a similar method with more commonly available components. If you're building a system which directly interfaces with a storage device, you might as well just use SRAM, where you fill it with data and then gradually reduce its supply voltage below its recommended minimum. The advantage over EPROM (besides availability of the components) is that it's a lot faster, so you can do a lot more experiments quite quickly to figure out the voltage which produces the most useful cases for you.

I once used this with a microcontroller equipped with internal RAM, which had to detect a power loss and then save some data to non-volatile storage before the capacitors on the board went low enough to cause corruption in the RAM. I wasn't explicitly studying the "memory rot" itself (I was only interested in whether it happened or not), but I did notice that if I took too much time, the corruption seemed to be quite random.

vsz
  • 2,554
  • 1
  • 17
  • 32
12

Read the file, flip or set bits randomly according to the profile of error distribution and probability you want, and write the corrupted file back.

Marcus Müller
  • 88,280
  • 5
  • 131
  • 237
Justme
  • 127,425
  • 3
  • 97
  • 261
  • thanks for the suggestion! Due to the nature of the project, it should be physical data rot. – ddkk Feb 02 '21 at 11:04
  • 7
    I don't think you will be able to provoke errors on standard computer storage within a week. All modern storage media use wear leveling to spread the writes to least used parts of the storage area, and even if the data read out has errors, if the error correction can fix the data then it will be rewritten automatically to safer location on the storage area again. Only if the memory has really worn out, it won't give you garbage data back, it will say that data is unreadable. I don't think getting files back with bits flipped is possible. – Justme Feb 02 '21 at 11:24
  • 5
    @ddkk What other kind of "rot" is there but physical rot? Data is stored in physical circuits using physical electrons. Perhaps you need to clarify what exactly you think "rot" is. – Elliot Alderson Feb 02 '21 at 13:26
  • @ElliotAlderson the above answer suggested the simulation of rot by actively flipping/setting bits, which gives the same result as the physical data rot but is not the same process – ddkk Feb 02 '21 at 13:38
  • 1
    I like this idea, in part because it can be made repeatable using a PRNG so you can make accurate comparisons. If the random factors are spread over the entire device you'll see the effect of larger file sizes, and less effect when the daemon flips bits in the unused portions. – Spehro Pefhany Feb 03 '21 at 01:30
  • 6
    @ddkk do tell, what is the difference between artificial and physical rot? Certainly one could not reliably tell. – tuskiomi Feb 03 '21 at 07:07
  • @tuskiomi It's easy to *think* that, but I guarantee that when you get an *actual* failure there will be some difference that makes it break even though you tested it artificially. Maybe it breaks some internal bookkeeping data in the drive which makes a whole swath of data inaccessible. Maybe one of the address lines gets stuck. Maybe the same bit gets stuck in every word you read for a whole second and then goes back to normal. – user253751 Feb 03 '21 at 22:51
  • @user253751 I don't think that's data rot, I think that's called an invalid state. You're speaking in terms of memory onboard a CPU, no? – tuskiomi Feb 03 '21 at 23:52
  • @tuskiomi I do agree with you that one could not reliable tell the difference, but two things that appear the same are not necessarily the same. The project in question requires the physical process to be part of it and not to probabilistically emulate the effect. There is no need to ask me to explain why I want to something in one way instead of the other when I explicitly state what I want to do. – ddkk Feb 04 '21 at 09:04
  • 1
    @ddkk but most SSD media you have available (hard drive, USB stick or SD card) will use error correction and thus even if memory cells get weak, the error correction algorithms will fix the data for you, until too many bits are bad so they return no data at all but a read error. With this kind of media, you have no access to the actual bits as they go through the flash storage controller interface. – Justme Feb 04 '21 at 09:16
  • 1
    @Justme Absolutely. That's why I brought the question to the electronics board instead of a more IT-centred board. – ddkk Feb 04 '21 at 09:23
  • 2
    @tuskiomi On an actual magnetic hard disk, zeroes and ones aren't just written directly on the disk; the data is run through some variant of [RLL encoding](https://en.wikipedia.org/wiki/Run-length_limited) and stored as a series of transitions between two magnetization states. A read error, therefore, isn't just going to be a bit flip, it's going to be the result of the RLL decoding process seeing some change in magnetization (movement of a transition, maybe?). In order to generate the relevant probability distribution, you'd need to understand & simulate that entire process. – Gordon Davisson Feb 04 '21 at 17:37
  • @GordonDavisson and.... is that impossible? – tuskiomi Feb 04 '21 at 17:47
  • 2
    @tuskiomi Not impossible, just seems the amount of research involved would be more difficult than finding an actual flaky HD (or waving a degausser near one, or...). – Gordon Davisson Feb 04 '21 at 17:49
  • @GordonDavisson I'm rather sure this is the biggest placebo I've ever seen. – tuskiomi Feb 07 '21 at 00:31
12

Flash loses charge faster at higher temperatures. However, within a week you probably won't be able to trigger that "rot" at temperature where your flash memory (same for magnetization of hard drives, by the way) doesn't combust so....

So, you'll have to go through different routes. As Justme says, you'll have to stress your medium.

The classical stress here would be write stress. A sensible test would:

  1. Use a pseudo-random number generator (PRNG) (e.g. xoroshiro128+, or really, anything that takes a seed), and a random seed \$a\$.
  2. Seed your PRNG with \$a\$.
  3. Start a timer.
  4. Generate a block size multiple of random data (e.g. 4 MB) and write it straight to your storage device (not through a file system, but to the raw device). While things are writing, prepare the next block of random data (operating systems tend to buffer things, so that you can continue working while it's writing)
  5. repeat 4. until your stick is full.
  6. Close the device and flush the buffers to stick (this is OS-dependent, and easier on Linux than on windows, for example). Note down the time on the timer, and use it as average write speed.
  7. Seed your PRNG with $a$.
  8. Start a timer.
  9. Read a block multiple (e.g. 4MB) of data from your device.
  10. Generate a random number with your PRNG, compare to device-read data.
  11. Repeat step 10 until you've checked your whole block. Accumulate the bit error count.
  12. Repeat 10. – 11. until you've read your whole device.
  13. Close the devices.
  14. Note down the time, and use it as average read time. Note down the amount of bit errors.
  15. Pick a new \$a\$
  16. Go back to 2.

Depending on your device's quality, your luck, and the speed at which you can write, you should be seeing increasing error rates (i.e. memory cells rotting!) and decreasing read speeds.

Read speed reduction mainly stems from the fact that all modern mass memory employs internal checksums and/or error correcting codes. If they detect a broken memory word, error correction kicks in. Decoding erroneous words takes time, and gets more involved the more broken the codeword is.

The harsh truth is that at modern memory densities, physics isn't nice to anyone, and random bit flips will happen. That's not bad – that's why we have modern channel coding / error correcting codes (they are the same thing). Even a perfectly new storage medium will have some bit errors, but a user will never (or, to be precise, with a probability below some threshold that the user is going to be able to ignore) be subjected to them, because the ability to correct this inevitability of physics is built-in.

By writing repeatedly, you're decreasing the physical qualities of your storage medium more than a week of time could ever do. That simply makes the amount of these physical bit errors (which you don't see) higher. If all goes well, the storage controller is still able to correct these – but it will need to calculate more, and hence take more time, and hence be slower at reading. It might happen that you get more errors than the decoder can correct – and then you actually see a bit error.

It's actually not very trivial measure these, because you'll honestly be measuring the bit error rate of your error-corrected storage medium against the bit error rates of your RAM, which isn't error-corrected (unless you use ECC RAM). That's why step 10 generates small amounts of random data: that will stay in CPU cache, and hopefully not get written to external RAM, which tends to have higher error rates. If you just generated your whole stick's worth of data and wrote it to RAM, and then compared that, you'd be checking your RAM more than your storage medium.

Marcus Müller
  • 88,280
  • 5
  • 131
  • 237
8

Data retention on flash media depends strongly on temperature, both when written and when stored. The hotter the memory is when you write to it, the longer the data is retained; the colder it is when you store it, the longer the data is retained.

About five years ago, a presentation made the rounds that put some hard numbers on it: for a typical solid-state drive of the time, writing at 25 C and storing at 55 C would give a retention time of about a week. For faster data loss, you can push the temperatures further: write the data while the memory is in a freezer, then store at the highest temperature the datasheet permits.

Mark
  • 810
  • 6
  • 12
3

At some point above you did mention "it should be physical data rot" - but as far as IT systems are concerned, you wrap abstractions into other abstractions and some adapters and some conversions until some utterly software layer gives you bits. Like, you don't physically scribble a rotating hard disk with a magnetic head, and without programming its firmware you likely don't even have a means to.

Also in IT systems, there are deliberately software storage constructs, like RAID arrays inside a computer or block storage over iSCSI and networking. While it may be more apparent with a crashing storage server, note that beside bit-rot on media, each and any layer involved in transporting and interpreting your bits can have errors - all those little processors, memory buffers, firmwares, drivers, cables, connectors etc. involved, as well as their error-detection/correction algorithms applied.

So after such preface, if you are after investigation of errors seen on media by your system, there are error-injection drivers in many OSes that can present a block device or a NIC with the set level of unreliability, which occasionally lies to your higher software layers, stutters, times out, and so you can develop solutions that work well regardless of such errors. In any case, it should not be prohibitively hard to develop one if the OS you work with does not provide such testbeds - for networking it could be an extension of the ubiquitous tun/tap driver (as in openvpn), for storage you can make a loopback device over a file and occasionally sprinkle random bits into that file as the rotting media device.

Jim Klimov
  • 131
  • 2
  • 2
    The reason to use real bit-rot rather than synthetic errors is that the patterns are different. Sure, single-bit errors are effectively random, but more complicated errors (eg. a stuck cell in MLC flash) are hard to imitate. – Mark Feb 03 '21 at 22:27
  • Please correct me if I am wrong, but as far as I am educated on the matter, data is stored in physical states and due to different effects this storage is ultimately unreliable, depending on environmental influences. It is specifically these effects that I am interested in and I am asking how I can promote these effects and circumvent the error correction systems in place. Still, thanks for your suggestion! – ddkk Feb 04 '21 at 09:15
  • @Mark Yes, but even physical errors have various modes. To actually trap them all you would have to test a bunch of physical devices. – curious_cat Feb 04 '21 at 15:54
  • One example where physical rot does not happen the same as random bit errors is CD/DVD. Damage on CD/DVD are scratches which take out multiple adjacent bits at a time. – slebetman Feb 05 '21 at 09:09
3

Cheap writeable optical discs are pretty notorious for rotting relatively quickly (a few years), especially eraseable discs like DVD+RW or DVD-RW. Even write-once discs like CD-R or DVD+R can decay somewhat quickly, depending on the brand and storage conditions. Double-layer writeable DVD discs are even more unreliable in my experience.

Of course, a scratched disc will have have read errors, if you don't want to just wait. Even dust could provoke some, if you don't want to damage the media.

Or leave a disc in the sun for a while, especially dye-side up, especially not behind glass. (Even indoors behind glass should be sufficient; UV from direct unfiltered sunlight is probably not the most important factor. Probably some heating and some actual photon energy doing stuff to the dye chemicals.)

Some burners / readers let you read the raw media (bypassing the 2 layers of error correction), allowing you to see the error rate. So you'll be able to detect rot before it gets to the point of overwhelming the error-correction for a sector. (Audio CDs only use 1 layer of error correction, allowing slightly more space for uncompressed PCM data. Data CDs normally use 2 because a bit-error is more likely to make the whole thing unusable, although apparently the VCD (Video CD) format uses the same mode as audio CD.)

Initial error rate, when you read back a freshly-burned disc, will vary based on how well your burner likes that brand / model of optical disc, and on write speed / strategy selected by the drive's firmware. (There's usually a sweet spot for a given combo of burner and disc, not always the slowest speed but usually not the fastest supported.)

e.g. http://www.digitalfaq.com/guides/media/dvd-tests.htm shows sample output from a DVD-test program, with max vs. average errors per sector, and total, for each level of ECC.

Peter Cordes
  • 1,336
  • 11
  • 16