13

This occurs as a programming language independent problem to me.

I have a file with the content

aaabddd

When I want to insert C behind b then my code needs to rewrite ddd to get

aaabCddd

Why can I not just insert C at this position?

I can not do this in Java, Python, ... . I can not do this in Linux, Windows, ... . Am I right?

I do not understand why C can't simply be inserted without the additional writes. Would someone please explain why this is so?

User
  • 795
  • 4
  • 9
  • 2
    Think about what happens with the bits on the disk when you want to 'insert' something at byte 128 of a 2 gigabyte file. –  Apr 29 '14 at 18:00
  • You mean with no operating system and no file-system in between? Then it will not work. With the other two in place I have no idea why it can not work. – User Apr 29 '14 at 18:01
  • 12
    Take 500 dominos and lay them end to end in a line. Now try to insert one into that line without moving the others. – GrandmasterB Apr 29 '14 at 18:14
  • 3
    @MichaelT In my dream world, you *should* only have to insert another block into the string of blocks that make up the file and distribute the contents of the current first block onto the first two blocks. Granted, this would require the file system implementers to handle odd-sized blocks - but in the situations where you *do* need this operation, it would improve efficiency so much it's not even funny. – Kilian Foth Apr 29 '14 at 18:16
  • @GrandmasterB Sorry, I mean "Why inserting **must** not work when we have a file system and an operating system." – User Apr 29 '14 at 18:16
  • @User what do you think that the file system and operating system *can* do? –  Apr 29 '14 at 18:33
  • @MichaelT I thought about http://en.wikipedia.org/wiki/File_system_fragmentation and http://en.wikipedia.org/wiki/Ext4 – User Apr 29 '14 at 19:29
  • http://meta.stackexchange.com/questions/225645/what-to-do-when-a-user-radically-changes-their-own-question – gnat Apr 29 '14 at 19:38
  • 1
    @User the questions of filesystem fragmentation and how Ext4 work moves firmly into the realm of SuperUser. Please remember to *fully* specify your problem or they'll be asking about bytes again. You are asking about blocks and file systems and logical volume managers and the like. –  Apr 29 '14 at 19:40
  • @MichaelT and gnat, thanks for the tip with the SuperUser. I will have a retry there and clarify better what I mean. – User Apr 30 '14 at 07:56
  • @Kilian:It is not the "file system" that writes in blocks, it's the hardware. The "file system" may organize files in different sized blocks than the hw writes it, but the data is still read and written in blocks at a time and not bytes/bits. Also, while in your dream world, the very seldom used insert operation would improve efficiency so much for inserts, it would come at the cost of making EVERY reading of that file all the more slower, especially if that new odd-sized block ended up being written a good distance away from the block before or after it. – Dunk Apr 30 '14 at 19:05
  • @User Java and Python do the same as the disk, they just do it in memory. Python rewrites the whole string, as I expect Java does. In both cases it may be possible to replace a byte at a position (although the disk will rewrite the block). – BillThor Apr 30 '14 at 22:06

7 Answers7

13

Given that most file systems store the contents of files in individual blocks that are not necessarily contiguous on the physical disk, but linked via pointer structures, it seems that such a mode - "inserting" rather than "appending" or "overwriting" - ought to be possible, and could certainly be made more efficient than what we have to do now: read the entire content, edit the stream of bytes and re-write the entire content.

However, for better or worse, the UNIX semantics of file systems were designed along the "rough and simple" paradigm in the 1970s: it allows you to do everything, but not necessarily in the most efficient way possible. Nowadays it is almost unthinkable to introduce a new file opening mode into the Virtual File System layer and have any hope of the major file systems adopting support for it. This is a pet peeve of mine, but unfortunately one unlikely to be resolved any time soon.

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 3
    Building that might make an interesting side project for a while... – FrustratedWithFormsDesigner Apr 29 '14 at 18:14
  • 1
    Block level storage complicates the question a step further. Sticking with the OP's original example, the two versions of the string ought to fit within a single block. The bytes need to be written out sequentially and that's what necessitates shifting the tail of the string down by whatever the amount inserted. –  Apr 29 '14 at 18:22
  • It would only be efficient if you have to insert **exactly** the amount of data that can be stored in one block, in **exactly** the border between two existing blocks. – Idan Arye Apr 29 '14 at 19:20
  • Kilian Forth seams to be right. I asked a professor about this and he told me about the same: The "rough and simple" design allows portability and thus is more widely used. Not many file systems allow insertion and even less operating systems expose it, to apply to a portable interface. @GlenH7 Two people who edited my question made it look like I would ask about bytes and reverted my clarification. The real question is about the interface we use. – User Apr 30 '14 at 10:53
  • Yes, the blocks are linked via pointers and thus the file contents don't have to be stored contiguously but when they are stored contiguously then the hardware can read block after block without having to slow down at all. If it had to follow pointer by pointer then the read head would constantly be moving. That's why defragmentation helps speed up your computer. It puts the block pointers for files in contiguous blocks. Then the command is not read block 1, read block 3, read block 9, read block n...it becomes read blocks 1 through n. The hardware can do that much, much more efficiently. – Dunk Apr 30 '14 at 19:31
  • As always, this is a trade-off between the dominating access patterns. If you read things much more often than you write, then optimizing for efficient reads makes sense. Me, I like to edit sound files ripped from the radio, and I cry a little inside whenever a tool makes me re-write 2GB of data to disk, just because I cut off a little bit from the beginning. – Kilian Foth May 01 '14 at 08:40
  • A thought: if you store your file in two linked lists of blocks instead of one, you can insert or delete arbitrary-length subsequences of bytes without copying data already in the file. The second list of blocks keeps track of the "pieces" of your file stored in the first list of blocks, but a file written sequentially would be all in one "piece". A file with insertions/deletions could be restored to one "piece" when you defragment the disk. It would be interesting to see how efficient this can be made. – David K May 05 '14 at 20:38
  • @IdanArye Actually it wouldn't have to be exactly at the border; having it not at the border means you have to read and write one extra block. – user253751 Oct 01 '20 at 18:09
13

Theoretically, you could implement a file that would allow this sort of thing. For maximum flexibility, though, you'd need to store a pointer to the next byte along with every byte in the file. Assuming a 64-bit pointer, that would mean that 8 of every 9 bytes of your file would be composed of internal pointers. So it would take 9000 bytes of space to store 1000 bytes of actual data. Reading the file would also be slow since you'd need to read each byte, read the pointer, follow the pointer to read the next byte, etc. rather than reading large, contiguous blocks of data from disk.

Obviously, this sort of approach isn't practical. You could, though, split the file up into, say 32 kb blocks. That would make it relatively easy to add 32 kb of data at any 32 kb boundary in the file. It wouldn't make it any easier to add a single byte as the 5th byte of the file. If you reserve some free space in every block, though, you could allow small additions of data to take place that would only affect the data in that single block. You'd have a penalty in terms of file size, of course, but potentially a reasonable one. Figuring out how much space to reserve and how to split blocks, though, tends to be much easier for a particular application than for a general-purpose system-- what works in one context may be very bad in another depending on the file access and modification characteristics.

In fact, many systems that spend a lot of time interacting with files implement something like what I've described above when they implement their particular file abstraction. Databases, for example, will generally implement some concept of a "block" as the smallest unit of I/O that they can work with and will generally reserve some amount of space for future growth so that updating a row in a table only affects the one block on which that data is stored rather than rewriting the entire file. Different databases, of course, have different implementations with different trade-offs.

Justin Cave
  • 12,691
  • 3
  • 44
  • 53
  • 3
    I'd also mention the challenge of "seek to the block that is at 1 gigabyte of a 2 gigabyte file" could take a little bit of time with the linked list of bytes implementation. –  Apr 29 '14 at 18:35
  • The issue of what happens during insertions is one that causes a lot of consternation among people designing de-duplication for storage systems. – Blrfl Apr 29 '14 at 19:40
  • Thanks for understanding that I did not mean to talk about bytes but about the bigger picture. – User Apr 30 '14 at 08:38
  • You *absolutely would not* have to store a pointer to the next byte along with every byte. You would have to store a pointer along with every *arbitrarily large group* of bytes. – user253751 Oct 01 '20 at 18:09
7

The "problem" boils down to how files are written out to the storage medium in a byte by byte fashion.

In it's most basic representation, a file is nothing more than a series of bytes written out to the disk (aka storage medium). So your original string looks like:

Address  Value
0x00     `a`
0x01     `a`
0x02     `a`
0x03     `b`
0x04     `d`
0x05     `d`
0x06     `d`

And you want to insert C at position 0x04. That requires shifting bytes 4 - 6 down one byte so you can insert the new value. If you don't, you're going to over-write the value that's currently at 0x04 which is not what you want.

Address  Value
0x00     `a`
0x01     `a`
0x02     `a`
0x03     `b`
0x04     `C`
0x05     `d`
0x06     `d`
0x07     `d`

So the reason why you have to re-write the tail of the file after you insert a new value is because there isn't any space within the file to accept the inserted value. Otherwise you would over-write what was there.


Addendum 1: If you wanted to replace the value of b with C then you do not need to re-write the tail of the string. Replacing a value with a like sized value doesn't require a rewrite.

Addendum 2: If you wanted to replace the string ab with C then you would need to re-write the rest of the file as you've created a gap in the file.

Addendum 3: Block level constructs were created to make handling large files easier to deal with. Instead of having to find 1M worth of contiguous space for your file, you now only need to find 1M worth of available blocks to write to instead.

In theory, you could construct a filesystem that did byte-by-byte linking similar to what blocks provide. Then you could insert a new byte by updating the to | from pointers at the appropriate point. I would hazard a guess that the performance on that would be pretty poor.


As Grandmaster B suggested, use a picture of stacked dominoes to visually understand how the file is represented.

dominoes

You can't insert another domino within the line of dominoes without causing everything to tumble over. You have to create the space for the new domino by moving the others down the line. Moving dominoes down the line is the equivalent of re-writing the tail of the file after the insertion point.

  • Suppose a b C and d are not characters but gigabytes of characters. Could you address this in your answer? I like the picture but I also think people would aproach inserting 1000 dominos into 2000 dominos differently that 1 domino into 6 dominos. – User Apr 29 '14 at 19:22
  • @User - GB of characters instead of bytes fundamentally changes the nature of your question and now blocks for storage must be considered. At a simple level, the answer is the same. You can't insert something within a contiguous series of "whatevers" without creating space. –  Apr 29 '14 at 19:33
  • This information is simply wrong. Fragmentation is a thing. – user253751 Oct 01 '20 at 18:10
0

The most efficient way to insert a block of bytes in the middle of a file would be to:

  1. Map the file to memory
  2. Append the bytes at the end of the memory image of the file
  3. Rotate these files into place (with a standard algorithm available in the C++ Standard Library, for instance)
  4. Let the OS take care of writing dirty blocks to disk
-1

First you need to read everything after the insert point, then write it back down by as much space as you're going to insert. Then you can write your "insert" data into the correct place. Extremely poor performance operation, hence not natively supported.

Brian Knoblauch
  • 4,170
  • 1
  • 17
  • 20
  • 1
    What abut an SSD with random access? Also files are split into pieces by the file system. How does that relate to writing everything again? – User Apr 29 '14 at 18:08
  • @User sure you can *access* it randomly (though you're not doing bit level access, you're doing block level still)... but how do you say what byte comes next? –  Apr 29 '14 at 18:11
  • 1
    SSD still reads and writes a page at a time. So to write your 1 byte that you want to insert you'd have to write an entire page of data along with updating all the corresponding file system tables/pointers. I would not be surprised if the initial file systems did have an insert-like operation but they realized that it added far more overhead than it saved. – Dunk Apr 30 '14 at 19:17
-1

When you do direct access to a file you are using a low level that can be used to build more sophisticated structures. Consider building a database with your data that allows the types of access you need, including insertion.

It would be less expensive if you only need to iterate through the file not do random accesses to a specified offset. If you need random access by offset in file you will need to update the index for all bytes beyond the insertion point.

In general, you will pay in indexing data structures, memory to store the index, and extra disk accesses to update it.

-1

Insertion into a file is not implemented in most file systems because it is considered an "expensive" (time-eating and space eating) operation with potentially long-term "expensive" repercussions and additional failure modes.

A file system with insertion semantics would probably either use shift&insert (potentially very expensive when you insert at the front of a large file, but no/few long-term side-effects) or some sort of generalized heap allocation with variable length allocation sizes (very ill-behaved performance in some cases [imagine the interactive users' faces if they try to save a file during a stop-the-world GC!]).

If you want to experiment, you can easily build a file I/O abstraction in Java or Python that implements insertion. If you succeed and it has well-behaved performance characteristics, you have the basis for an excellent research paper. Good luck.

Scott Leadley
  • 226
  • 1
  • 4
  • this does not seem to offer anything substantial over prior 6 answers – gnat Apr 29 '14 at 19:22
  • You can write all the software you want but it won't change the way the hardware works. Hardware works by reading/writing in blocks/pages. In a HDD, if that data is not contiguous then the read head has to move which slows file access time dramatically. Any insert operation would "by the very fact of it being an insert" have to be stored elsewhere and not contiguously. So sure, insert will possibly be faster (for very large files) but reading will be much slower. – Dunk Apr 30 '14 at 19:24
  • Well it wouldn't be expensive if they implemented it! – user253751 Oct 01 '20 at 18:10