Randomly Accessing Data Packets in a Compressed File

Question

In my line of work I deal with very large files, hundreds of Gigabytes in size. The nature of the data in these files is such that compression would greatly reduce their size. The problem is, the records/data packets within the file must be individually accessible.

Is there a way to apply some home-grown techniques to these records to individually compress them and, once compressed, place them in a data stream such that the byte offset locations of each compressed packet is still known?

Would this kind of fragmentation of the packet data substantially affect the efficiency of the compression/decompression cycle? Are zip algorithms suitable for this, or are there better methods of compression that are designed specifically for this?

So, looked at the zip algorithms link and it already has byte offset locations for files. So it seems you could apply the algorithms treating each record/data packet as a file. So, is it just that you want a better algorithm? — psr, Sep 20 '11 at 00:46
@psr There would be literally hundreds of thousands of packets, and it looks like zip has a limit of 65,535. Zip64 might be viable, though. — Robert Harvey, Sep 20 '11 at 02:14

score 3 · Answer 1 · answered Sep 20 '11 at 00:44

If you are dealing with well-defined packets, then the answer has to be that yes this is all possible.

I'd suggest: - the file contains 2 types of information: an index, and a data record - data records are compressed - indexes point to data records, or a new index

The indexes need to be extensible so that as you grow a file by adding more records you can create and add a new index if it is required.

This could all be wrapped up in a fairly nice API.

If you wanted to compress the data records, I'd suggest you look at 7-zip, it seems to have a COM interface or similar, and it compresses better than plain zip.

Something to bear in mind is that when dealing with large files, you may find that you get far better compression of the whole file, when compared to compressing the records individually. This is because most of these compression algorithm rely on detection of repeated patterns in a file, and if there is repeated information across records this will crunch down very well. An individual record might not have much repeated information and so may not compress as well.

Something like BerkleyDB might make this faster to implement. — Brendan Long, Sep 20 '11 at 01:13

score 2 · Accepted Answer · answered Sep 20 '11 at 03:41

The LZ-based compression schemes are based on finding and eliminating repeated strings of characters. As they compress a stream, they build up a dictionary of strings that have been encountered, so when the same string is encountered again, they transmit the location of that string in the dictionary instead of re-transmitting the entire string.

In a typical case, the first few kilobytes of data actually expand a little, because the dictionary starts out (essentially¹) empty. Only after a few kilobytes have been scanned and strings added to the dictionary do you start to get much compression.

To get such an algorithm to work decently on record-oriented data, you probably want to group your records into blocks of, say, something like 64K apiece. Reading a record will be a two-step process. First you'll find the block that contains the record, read it into memory, and de-compress the whole block. Then you'll find the record you care about in that decompressed data.

The block size you select is a compromise between compression efficiency and random access efficiency. A larger block generally improves compression, but (obviously enough) requires you to read more data to get to the records in a block. A smaller block size reduces the extra data you need to read to get to a particular record, but also reduces compression.

If you're willing to hand-roll your compression, you can do things rather differently. The general idea would be to scan through a large quantity of data to build a (roughly LZ-like) dictionary of repeated strings, but not do on-the-fly compression like LZ does. Instead, store the dictionary (separately from the data). After you've scanned through all the data, use the full dictionary to compress the data. This requires you to store the dictionary (which uses some space) but allows you to have it pre-built when you decompress the data. This reduces the penalty for compressing each record separately, so when you read data you'll only need to read one record (plus associated parts of the dictionary -- but when it's in use, you'll probably have most of the dictionary in RAM most of the time).

¹ In quite a few implementations, the dictionary starts out initialized with entries for the 256 possible byte values, but this still results in expansion -- each of those one-character strings is represented in the bit-stream with a (minimum of a) 9-bit code. In other cases, those dictionary entries are "virtual" -- each is treated as being present at the proper position in the dictionary, but never actually stored.

score 1 · Answer 3 · answered Sep 21 '11 at 05:46

A lot of this depends on the types of files you're dealing with, and their internal structure.

Is there a structural/logical reason for the files being as large as they are?

How interconnected is the data within each file?

Is it likely that once you start reading, you'll finish your read locally, or will you find yourself skipping around the file to finish the read?

Under the assumption that your reads are mostly local, and relatively small, a modified LZ compression algorithm should do the trick. You can roll your own, or use one of the examples available on the web, and get fairly decent compression while allowing for random access.

If you're dealing with more complex architectures and contents, though, you'll have to get more creative. You might want to look into parsing the contents of each file and then storing them into a database which comes with built in compression algorithms, like Oracle, for example, since it can save you significant headaches.

score 0 · Answer 4 · answered Sep 21 '11 at 06:39

0

You might consider going at this from a different direction. How about storing the files on a compressed drive? They'll be accessible as regular files, but take up less space on the drives.

answered Sep 21 '11 at 06:39

GrandmasterB

37,990
7
78
131

The files are stored on various media, including NAS servers and the like. Native compression isn't always available. – Robert Harvey Sep 21 '11 at 16:28

Randomly Accessing Data Packets in a Compressed File

4 Answers4