29

If all data is essentially just a bit string, then all data can be represented as a number. Because a compression algorithm, c(x), must reduce or keep the same length of the input, then the compressed file must be smaller or equal to the input (and greater or equal to 0). This can be stated as 0 <= c(x) <= x. For compression to be useful there must also be a decode algorithm, d(x), which returns the original input.

Start with the number/data item 0. 0 <= c(0) <= 0, hence c(0) = 0, and d(0) = 0. Then 0 <= c(1) <= 1, and c(1) != 0 as d(0) = 0, so c(1) = 1. We can continue this: 0<=c(2)<=2, c(2) != 0 as d(0) = 0, c(2) != 1 as d(1) = 1, so c(2) = 2. This pattern can then repeat forever showing that without losing any data, any compression algorithm cannot compress data into a size lower than the original input.

I do understand how some compression algorithms work, such as run-length encoding (RLE), but I cannot see how they avoid this issue. One reason which I could see as to why RLE or other algorithms can fail would be how to transmit just the plain text. E.g., if x = 8888123, you can compress that to four 8s, 123, which is then transmitted as 48123. But then how can you transmit 48123?

And if you then add a special delimiter signal to signify that the "8" is a repeat count, then you are adding even more data. If you use a special signal, such as 1111, then you would transmit the signal 111148123 to represent 8888123, but then how do you transmit 111148123 and so on. This is just an example as to why the demonstration of RLE compressing data doesn’t show that compression is actually useful.

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Mercury
  • 465
  • 1
  • 2
  • 5
  • 50
    You are correct, there is some input (often called "pathological") which frustrates a particular lossless compression algorithm. The "fix" is to simply not compress it. Perhaps adding a few extra leading bytes to indicate this. There can be no guarantee that all data will be shorter. Just that expectation the great majority of typical data will be shorter. – user949300 Feb 07 '23 at 09:25
  • 10
    One aspect that wasn't mentioned explicitly in the answers is that real data isn't random – it follows some probability distribution, and we can create algorithms that produce good results on typical data, for example if heuristics like delta-encoding are appropriate, or if a string we're compressing contains English text. There's also the concept of [Kolmogorov complexity](https://en.wikipedia.org/wiki/Kolmogorov_complexity): some strings contain little actual information, and we can replace them by a shorter algorithm to reconstruct the input. The hard part is finding optimal algorithms :) – amon Feb 07 '23 at 13:16
  • 1
    Your examples are a little too basic. Consider this (fairly naive) compression: take all the words in your question and create a table of each unique the word followed by all the positions at which is appears. I count the word 'compression' 5 times which takes up 55 bytes in ascii. In the table it would take 11 bytes (for the word) and then 5 bytes to show each location. We can therefore reduce the original 55 bytes to 17 (1 byte for a record delimiter) bytes for that word. A good place to start for some deeper understanding is gzip algorithm. – JimmyJames Feb 07 '23 at 17:25
  • Sorry, you'd need two bytes for the location but it's still 55 bytes down to 22. – JimmyJames Feb 07 '23 at 18:17
  • 6
    Looks like the title should be "universal/ultimate vs. lossless compression" as body seem to imply that "compression" means ability to convert any data to shorter representation of the same data... Unfortunately schools don't teach basics like [pigeonhole principle](https://en.wikipedia.org/wiki/Pigeonhole_principle) that would stop such question... along with "Univeral code to reverse hash of values". – Alexei Levenkov Feb 07 '23 at 18:55
  • In `c(x)`, is `x` the actual number you're trying to compress? Because usually that's the number of bits. And usually we don't think of binary data as decimal numbers (although one can, if one wants to). If it's the number, then you can certainly have a compression scheme where `c(2) = 1` if you want. – NotThatGuy Feb 07 '23 at 18:55
  • 1
    You took an example where RLE probably wouldn't help, because there aren't *runs* with enough *length* in the input (not that RLE is that useful to begin with - it's just a simple intro to compression). The example on [Wikipedia](https://en.wikipedia.org/wiki/Run-length_encoding) is `WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW`, which converts to a much shorter `12W1B12W3B24W1B14W`, and you'd just transmit that. This assumes there aren't numbers in the input, but something like that does sometimes happen, and there are other solutions if it doesn't. – NotThatGuy Feb 07 '23 at 19:06
  • 1
    Imagine you have a 100 by 100 image full of perfectly white pixels, or a string of 50 "X"s and then 50 "Y"s, both of which would not fit into this comment. But you could still perfectly reconstruct both of these pieces of data perfectly. – Hobbamok Feb 08 '23 at 10:16
  • You ask about the content, but should focus the algorithm. If your compression algorithm is only able to compress the bible, you can compress it in a single bit: "In the beginning..." -> "1". To decompress, use the same algorithm: "1"->"In the beginning..." (lossless). Of course this is a stupid algorithm, but here goes another: there are ~172000 words. Each word has ~4.7 letters length (37.6 bits). If you encode each word as a set of bits, you just need 18 bits. So, you can compress any text to half the size, and get back lossless. Of course, better algorithms exist, those are just examples. – RodolfoAP Feb 08 '23 at 16:16
  • On a more philosophical level: True, we cannot compress reality (or any part of it) without losing information. What we do when we compress "lossless": We discard information we deem irrelevant and compress the remaining information; we create an *abstraction* and compress that. – Peter - Reinstate Monica Feb 08 '23 at 17:36
  • 3
    Because the title is asking about an alleged universal truth, I feel the need to fact check this: are you assuming that your uncompressed data is already as lean as possible and not rife with redundancies? Because I would argue that redundant data can very easily be compressed without loss of information. – Flater Feb 08 '23 at 22:09
  • 2
    Posted here because of an expectation of a bad reception at other Stack Exchange sites, like *[Computer Science](https://cs.stackexchange.com/tour)*? – Peter Mortensen Feb 08 '23 at 23:47
  • Compression is no problem. It's the _decompression_ where the issue is ... – davidbak Feb 09 '23 at 20:56
  • Posting as comment because I don't have enough rep to answer a highly active question. Your argument is flawed. If `c:B->B` (B is the set of all bit strings) is lossless then it's a bijection and you can, indeed, prove that `0 <= c(x) <= x` implies `c(x) = x` as you have done. However, what you should have stated is: `0 <= len(c(x)) <= len(x)`, where `len(x)` is the length of the message `x` (in bits, for instance). `len(x)` is clearly not a bijection, so your point does not stand. – Nacib Neme Feb 10 '23 at 03:03
  • 3
    Yes, this is how you prove that, if `0 <= c(x) <= x` holds for all x then c must be the identity function. However if you allow `c(x) > x` for some x, then you could have a subset A of x where `c(x) <= x` for x in A and not have a contradiction. – pygri Feb 10 '23 at 10:14

7 Answers7

118

There can be no algorithm that losslessly compresses all inputs. But there are many algorithms that losslessly compress many inputs. And it turns out that most of the strings we like to operate on have enough internal redundancy that they compress rather well.

Think of it as a Robin Hood-style exchange: the algorithm maps an infinite number of possible inputs to an equally infinite number of possible outputs, so if it maps some strings to shorter strings, it must map some other strings to longer strings. But we can arrange things so that it takes from the dull strings to give to the interesting strings :-)

Kilian Foth
  • 107,706
  • 45
  • 295
  • 310
  • 1
    So does that mean that on average, the compressed data is the same size? As in half the time its smaller and half the time it is larger, but the interesting strings are the ones which become smaller. How come the algorithm "knows" which strings are interesting, or to rephrase, what property do interesting strings have which dull strings dont? To the human eye some binary code can seem very random, and obviously you can reformat it in an image or something where it makes sense, but can you not reformat even random data in a way which "makes sense"? – Mercury Feb 07 '23 at 10:55
  • 11
    Just as it's impossible to shorten *every* string, it's impossible for the average compressed size to be smaller than the average input size. So unless the algorithm is perfect, it can only create output that is, on average, at least slightly *larger* than the input (surprising, but true!). – Kilian Foth Feb 07 '23 at 11:11
  • 4
    Different compressors can target very different strings. As gnasher describes below, audio compresson might predict the next frame and then send only the difference to the prediction. This targets a rather different class of inputs than straighforward run-length encoding. – Kilian Foth Feb 07 '23 at 11:12
  • Ahh ok, that makes sense, thank you. – Mercury Feb 07 '23 at 11:47
  • 5
    "How come the algorithm "knows" which strings are interesting" - it's not so much that the algorithm "knows", it's that the human writing it targets a specific use case and exploits some property associated with it - and then you sort of need to read the "fine print" in order to know when it's appropriate to use the algorithm. For a more visual example, lossless image compression will work better on images that have large fields of exactly the same color, or have repeating patterns (like stylized graphics, logos, transparent icons), than on something like an image of a forest or a rocky cliff. – Filip Milovanović Feb 07 '23 at 12:21
  • 4
    *"So does that mean that on average, the compressed data is the same size?"* The average on - what? If you calculate the average text compression factor over the set of all HTML sites for a standard zip algorithm, I am sure the compressed data will be a lot smaller than the input data on average. – Doc Brown Feb 07 '23 at 12:56
  • 1
    @DocBrown when i said average i meant the average space for every possible binary input, not just every HTML site or image. – Mercury Feb 07 '23 at 13:09
  • 54
    @Mercury: In general, compression algorithms target not arbitrary random strings, but files produced by humans (or by programs written by humans). And those almost invariably tend to have redundant structures which compress well. The most simple example would be ASCII text: never uses the upper byte, lots of repetitions of words, some letters and combinations occur far more often than others. All things that can be exploited by a compression algorithm. Pretty much the only commonly occurring files without much redundancy are those which are already compressed in some way. – Michael Borgwardt Feb 07 '23 at 13:18
  • 29
    @Mercury: well I am sure that you meant this, but what I am trying here is to make you understand where your expectations and assumptions have gone wrong. Noone uses compression algorithms for randomly generated binary input in reality. And noone reasonably expects "c(x)" to be shorter than x *for every possible input x* - where did you get this wrong definition from? – Doc Brown Feb 07 '23 at 14:05
  • 8
    @Mercury What is impossible to compress effectively is random. What compression takes advantage of is that humans have never been a good source of random. – candied_orange Feb 07 '23 at 15:37
  • 4
    See https://en.wikipedia.org/wiki/Lossless_compression: "By operation of the pigeonhole principle, no lossless compression algorithm can efficiently compress all possible data. For this reason, many different algorithms exist that are designed either with a specific type of input data in mind or with specific assumptions about what kinds of redundancy the uncompressed data are likely to contain. Therefore, compression ratios tend to be stronger on human- and machine-readable documents and code in comparison to entropic binary data (random bytes)." – Blackhawk Feb 07 '23 at 21:19
  • 12
    Also consider [timsort](https://en.wikipedia.org/wiki/Timsort). Tim shows that the algorithm achieves the same worst-case performance as mergesort, but by taking advantage of the partial order present in many (if not most) real-world datasets, the average case performance can be much better. – Blackhawk Feb 07 '23 at 21:27
  • @Mercury how does it know what's interesting -- A lot of times the compression just analyzes the thing you're trying to compress first. For example, running your post through a frequency analyzer even with just single character frequency https://www.dcode.fr/frequency-analysis shows that the space character is nearly 20% of all characters, and just 7 characters make up 50% of the whole post. Whereas /:Z? only appear one time each. So even with some overhead describing which values map to what, we could easily compress that. – user3067860 Feb 08 '23 at 14:27
  • 3
    @candied_orange More specifically, random data is usually not *interesting*. However, another class of data that shouldn't compress well is encrypted data (ideally the encryption algorithm should make it seem random). So you should compress first, then encrypt (unfortunately, this is not helpful if you're doing on-the-fly compression in a communication protocol). – Barmar Feb 08 '23 at 15:52
  • @barmar true. You've got me thinking you could test how good a job your encryption is doing at making your cypher text appear random by compressing it and seeing what kind of ratio you get. – candied_orange Feb 08 '23 at 16:03
  • 1
    Expanding Peter's compressed sentence with context and outside knowledge: "Most strings can be trivially compressed with the help of a dictionary. Typical natural language is terribly redundant. The Arabic language does not use vowels when written. They always write like(?) this." – Nayuki Feb 08 '23 at 16:35
  • @candied_orange I am not sure if that is the case, while I think you might get some useful information by comparing how well your ciphertext compresses compared to the plaintext. I don't think that tells you a lot about the security of the cryptographic system. AES-CBC (Cipher Block Chaining) which was invented to get rid of data patterns that leak thru AES-ECB, doesn't compress well, so might pass your compression test, but it has some major security flaws due to its design, and implementation, while better then ECB it is not very secure. – Questor Feb 08 '23 at 18:33
  • 2
    @Questor A canary tweeting in a coal mine makes a poor substitute for a proper structural inspection. – candied_orange Feb 08 '23 at 19:15
  • 3
    @Nayuki Apparently, some heavy-handed moderator action has deleted my comment. Because your comment doesn't make much sense without it, I'll repeat myself: "Mst strngs cn b cmprssd wth th hlp f a dctnry. Typcl ntrl lngg s trrbly rdndnt. Th rbc lngg ds nt s vwls whn wrttn. Thy alwys wrt lk ths." – Peter - Reinstate Monica Feb 08 '23 at 23:16
  • 1
    While it may make sense to talk of compression algorithms separately, a compressed file is often a container that can incorporate different compression algorithms. So working backwards a compressor can choose between algorithms (as stated even just original data as a worst case). Hence the average compression rate of lossless compression is not bound against a single algorithm. – Jayfang Feb 10 '23 at 15:33
34

Compression works by making more likely patterns shorter at the cost of less likely patterns

Suppose we have a data format which is a string of 4 possible values, we could represent it with 2 bits each:

ABAABAACADAADCAAABAAA

Normally, we could store this as 2 bits per character. This string would cost 42 bits in that format, like this:

000100000100001000110000111000000001000000

However, notice that there are more As than any other letter. We can exploit this and instead use a variable number of bits depending on the letter:

  • A as 1
  • B as 001
  • C as 010
  • D as 011

Notice the length of A has gotten shorter while every other letter needs to be longer in order for it not to be able to be confused with a. Now, if we encode the string, we'd get this:

10011100111010101111011010111001111

Which is only 35 bits long. Overall, we've saved seven bits.

However, if we tried to encode the string DDDDDD in the compact encoding, it's length would be 18 instead of the 12 we'd get uncompressed. So for this "less likely" string the compression actually made the string significantly longer.

Not all compression is based on replacing specific characters, some compression algorithms use run-length encoding that requires extra bits in the case a sub-string is not repeated. Some include a table of replacements in the file itself, which allows the most flexibility but requires the size of the decoding table itself. Some compression systems have a bit flag that lets you turn off compression altogether, but even in this case you would still have the cost of the flag.

In practice though, most files we compress have a lot of patterns. For example, most text on the internet is ASCII printable, even on foreign language sites because of HTML tags and markup. Thus, shortening ASCII printable characters at the cost of non-printables and high-value unicode is basically always significantly shorter. This is just one example of a very strong pattern that makes compression very effective in the real world.

mousetail
  • 449
  • 3
  • 4
  • 2
    Your first line perfectly summarises what I've struggled to describe succinctly in my answer! – Jasmijn Feb 08 '23 at 10:51
18

I recommend learning about Shannon entropy, but the simple answer is: lossless compression can't compress all inputs, but it can compress some inputs, and in practice that is enough.

To demonstrate I'd like to amend your RLE algorithm: a 0 followed by another digit n indicates that the next n digits are literal and not compressed. The 0 was free, because we would never need to include 0 counts of anything. This means that 48123 can be either encoded as 1418111213 or 0548123, but we'll only care about the shortest encoding.

Now think of all the possible 9-digit inputs. In the best case, some of them are compressible to 2 digits: 9n, where n is whatever digit is repeated nine times. In the worst case, we need 11 digits to "compress" the sequence, with 2 extra digits to encode that we're not actually going to compress anything.

But what is the average case? If we had a million different inputs, how much storage space would we need in total?

That would depend on how common each input would be. Imagine the input is actually about detecting some relatively rare event, so that 90% of the inputs is all zeroes, and 99% of the rest of the inputs only one of the digits is non-zero, the average output size would only be about 2.5, even though there will be some output values that take up 11 digits.

So it's really the combination of two helpful facts:

  1. While we can't compress all inputs, we can limit how bad the worst case gets.
  2. Most data we care about compressing is relatively low in entropy, which means it is highly compressible.

As an aside, you know how if you're watching a streaming video on like Netflix or YouTube, and there's snow or other busy small details rapidly changing, the resolution drops and everything gets blurry? That's because those types of images have high entropy and are harder to compress, which means the only way to compress that part of the video without it stopping to buffer is to lose information. With lossless compression, we can't make a trade-off like that, which means it's not suitable for streaming video. If we only care about total storage space, then it doesn't matter if there are some parts that take more space, as long as there is enough compression on average.

Jasmijn
  • 1,561
  • 9
  • 14
  • Thanks for you comment. Obviously in real world applications, most data is not mainly just 0s, e.g code for something. Code transformed to binary to most people seems indistinguishable from random noise, does it have some special property where there are many patterns which a computer can pick up on? – Mercury Feb 07 '23 at 11:20
  • 3
    Yes, that was just a very clear example, but most data has some kind of pattern. A lot of binary data does include text sections, images often contain patches of repeated patterns, a lot of files are structured in a predictable way, so different fields are always at the same offset no matter their value, and those values aren't uniformly distributed either. It really depends on the specifics, which is why modern lossless compression software (post 1980s IIRC) has different strategies it can try on the data, suitable for different types of redundancy. – Jasmijn Feb 07 '23 at 13:06
  • Ahh ok, that makes sense, thank you very much. – Mercury Feb 07 '23 at 13:10
  • @Mercury *"Obviously in real world applications, most data is not mainly just 0s"* You might be surprised at how often it is. (Often, when we expect that data will be particularly sparse, we'll use a representation tailored for that, but that's just another form of compression.) – Ray Feb 07 '23 at 19:27
  • 6
    @Mercury Actually, a lot of real-world data involves a _lot_ more redundancy than you might think, it’s just in the form of patterns and other structure, not long runs of empty space. As an illustrative example, try and import an executable or shared library as a raw image in GIMP, the structure is immediately obvious, and if you know enough about the executable file format you can even infer where specific parts are based on how things look. – Austin Hemmelgarn Feb 07 '23 at 21:08
  • 1
    Even within sections of an executable that are machine instructions not data; some instructions are much more commonly used than others giving scope for compression. – Dan Is Fiddling By Firelight Feb 07 '23 at 21:24
  • 1
    @Mercury `Code transformed to binary to most people seems indistinguishable from random noise` -- this is obviously not true. Have you ever seen an image (or video) or random noise? They look like static on TV (see https://commons.wikimedia.org/wiki/File:Tv_static.jpg). Fully random noise is what "no data" is about. Any data will depart from random because data is some work done to represent something which means it increases order and reduces noise. – slebetman Feb 08 '23 at 06:46
  • 2
    @Mercury If by "code transformed to binary" you mean compiled machine code, I'm afraid you're wrong, it is extremely distinguishable from random noise. It's not very pleasant/easy for a human to read (and certainly not if you try to look at it via a program designed to display ASCII or unicode text, like `cat`ing it to a terminal or opening it in Notepad), but it certainly isn't *random*. Machine code is a well-specified simple code with lots of redundancy (not every possible word is a valid operation). Binaries usually compress quite well. – Ben Feb 09 '23 at 01:12
  • @AustinHemmelgarn How can you open an executable in GIMP? For me it just says "Opening failed: unknown file type". – leo848 Feb 11 '23 at 20:57
  • 2
    @leo848 By default, GIMP tries to autodetect file type, which obviously fails for executables. You have to explicitly force the file type to ‘Raw Image Data’ in the open dialog by clicking on the ‘Select File Type’ thing at the bottom of the dialog and selecting ‘Raw Image Data’ from the list that that exposes. – Austin Hemmelgarn Feb 11 '23 at 21:03
  • 1
    @AustinHemmelgarn Wow, this is really interesting and visualizes the entropy really well. Thanks! – leo848 Feb 11 '23 at 21:14
12

For example, if you had a file with 16 bit audio, you can obviously represent it with 16 bit per sample. However, a simple lossless compressor will have an algorithm that predicts the next sample, and only store the difference between the real and the predicted next sample. (Predictor + corrector). The corrections usually need much less than 16 bits. Take a 1,000 Hz sine wave that hasn't been produced 100% correctly: Your lossless compressor will predict a 1,000 Hz sine wave and store the tiny differences between its prediction and the real samplee.

And if your audio file contains just random 16 bit values, then lossless compression won't work very well. Note: You can always get away with only one wasted bit. Have bit 0 of your "compressed" data mean "compressed" or "uncompressed", so for any sample that your lossless compression algorithm cannot shrink, you just set bit 0 to 1 , and follow with the original data.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • Specifically, audio data has already been filtered to eliminate sounds above human hearing range. Thus, we know the max amount of change between two audio samples, and that delta can normally be represented by much less than full resolution. – bta Feb 07 '23 at 18:10
  • 1
    @bta *Normally*, but not always. The delta between samples is a function of both frequency and amplitude, so we can represent *most* audible sounds using small deltas, but we'll lose information when the frequency and amplitude are both high -- for instance, transients like drums. But since most deltas are small, we could use a variable-length code that lossless encodes small deltas using fewer bits and large deltas using more bits. – NobodyNada Feb 07 '23 at 18:57
  • You would also not just record the difference between samples, but possibly higher differences. Especially with lower frequencies that will give you a better prediction. And a 15,000 Hz sound won't have a huge amplitude, it would blow your speakers apart. – gnasher729 Feb 07 '23 at 19:42
  • 5
    The best part is the worst-case audio stream that cannot be compressed, a stream of random values, is basically white noise which is something no human would want to save as an audio file because it is completely uninteresting. – slebetman Feb 08 '23 at 06:39
  • You could record the differences between samples, which is usually an improvement already. But you could predict the change: Sample n+1 - sample n = sample n - sample n-1, and only record the error in this prediction. That will usually be smaller again. Or you find a better predictor than this trivial one. – gnasher729 Feb 09 '23 at 16:52
  • 2
    @slebetman as someone who works in the audio field I can assure you that people do sometimes save white noise to an audio file. It's sometimes useful e.g. for equipment calibration or noise masking. There's not much point in trying to compress it, though. – Jeremy Friesner Feb 10 '23 at 15:10
7

If our data was sequences of random values, and longer sequences were less likely than shorter ones, then lossless compression would on-average be worse than useless.

But our data is far from random sequences of values. Take for example a piece of English text encoded in ASCII.

The first observation we can make is not all byte values are used and of those that are used some are used far more frequently than others. By assigning shorter codes to frequently used characters and longer codes to less frequently used characters we can encode the text in a smaller space.

Stepping out a level we can observe that the text is formed of words. Many combinations of letters simply never occur. Stepping out another level we find it's not just words, but combinations of words that are used repeatedly.

General purpose lossless compression algorithms are designed to find and describe patterns in somewhat arbitrary data, they work well in cases where the patterns are easily distinguishable, and the distance between repeats of a pattern are not too long.

Most compressors are also designed with an "escape hatch", that is if no patterns can be found or if describing the patterns takes more space than the original data did, there is a mechanism for passing through the data with only a small amount of overhead.

General purpose compressors tend not to do very well on things like photographs, audio and video though, the patterns in these types of data do exist, but they are often fuzzier requiring some "correction" to be applied on top of the pattern. Another problem is that images and video are multi-dimensional, as the video is cut into frames and the image is cut into lines, pixels that are next to each other in the image can be spread to distant locations in a file. Special purpose compressors can do better because they know what sorts of patterns to look for and because they can work in multiple dimensions.

Peter Green
  • 2,125
  • 9
  • 15
  • Compression is all about finding the pattern. For example I can list 1000 digits or simply say, "the first 1000 digits of π". – candied_orange Feb 08 '23 at 15:38
  • Another aspect is that you have to be able to describe the pattern to your decompressor, which has to be able to recreate it. You could make the decompressor a virtual machine capable of running arbitrary algorithms but generally we don't. – Peter Green Feb 08 '23 at 19:04
  • A virtual machine capable of running arbitrary algorithms? You sound like you're describing the JVM. You need a bit more than that. You need to recognize the pattern, be able to encode and decode the pattern, and either save bits while doing it or give up and spit em out raw. What kinds of patterns you take on dictates what kind of compression you're doing. – candied_orange Feb 08 '23 at 19:23
  • Yes, on the compressor side you need to be able to recognise the pattern. Then you need to encode it and then the decompressor needs to recreate it. There are a couple of approaches to this, one is to tightly tie together the design of the compressor and decompressor. Another is to define the decompressor first and then a valid compressor is anything that generates valid input for the decompressor. – Peter Green Feb 08 '23 at 21:57
  • When taking the "decompressor defines the standard" approach, you have a decision to make as to how flexible you make the decompressor. Do you make the decompressor largely "fixed function" or do you allow the compressor to program the decompressor. There are pros and cons to both approaches. – Peter Green Feb 08 '23 at 21:58
4

Other answers are right: not all data can be compressed. I'm not aiming to replace the accepted answer, but to provide some other descriptive text that might also be useful.

I'd like to expand on the idea, though, that "we can arrange things so that it takes from the dull strings to give to the interesting strings"

It turns out that a lot of data you care about is actually "interesting".

Perhaps the clearest example I can provide is snow. No, I'm not talking about slightly melted ice. I'm talking about "visual static". See: Wikipedia's page on "Noise (video) for some examples.

Now, as it turns out, some people may find looking at such snow to be interesting. In particular, sometimes when old televisions didn't get a full visual signal, but their antenna got something, the end result is that people could sometimes make out some vague shapes moving in what was mostly just a bunch of visual gibberish.

Okay, so maybe you think a bunch of random dots of different grey scales interesting. However, here is the key question. Is any one of them any more interesting than any other?

What if the seventeenth pixel from the left and twenty seven rows from the top were four shades darker. Would that image be more or less interesting to you? Would you even notice? Even if you were told to check that, and if you had a way to measure that and you did, would you care?

See, an average individual possible screen of pretty randomized snow is, frankly, rather boring, not really any more or less interesting than a similar picture.

Yet, arranged pixels can create glorious images that we care about.

Okay, let's take a look at another example. A sine wave with frequency modulation. This sine wave typically it moves upward and downward between values of positive one and negative one. What you typically find is that the wave moves up, and then it moves down, and then it moves up. What makes it interesting is how quickly the sine wave is moving up or down.

If the thing, whatever the sine wave is measuring, had truely random data instead of useful ("interesting") data, then you wouldn't see an organized wave. You would see dots at random spaces. Maybe they wouldn't even follow the rule of being between negative one and positive one.

See, if I told you:

0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5

you see an interesting pattern. If I told you:

0.5 0.2 973 -4087265 0.42 -10 -15 8279827329.328932

Such numbers have no discernable pattern, and so aren't interesting.

Interesting spreadsheets store information in rows and columns in ways that are interesting. In contrast, one specific instance of really random data is, typically, not too terribly any more interesting than any other piece of truely random data. (And if we aren't needing to identify a specific piece of random data, then lossless compression is probably not as desired if lossy compression can compress stronger.)

So, when we work with data, we typically work with "interesting" data, which has some sort of discernable patterns. Data compression software tends to look for such patterns.

For instance, English text tends to use ASCII codes from 65 to 90 (A-Z) and 97 to 122 (a-z) and others (spaces, periods, sometimes numbers and commas). Much more rarely do you use some of the other characters on a keyboard (like curly braces { } which may get heavily used in computer programming, but not typical English writing), and it is quite uncommon to use some of the other bytes like the character ¼. But while English doesn't typically use such characters, typically data compression can use all of the possible characters that a byte can be set to, so there is a larger pool of availale characters to use than the group of characters that need to be represented. Combine that with concepts like vowels showing up more often, and it turns out that text is typically highy compressible.

Jasmijn's answer describes what FAQS.org Compression: section 8 calls this the "pigeon hole" principle. Allow me to visualize this a bit:

All possible values with three bits:

000
001
010
011
100
101
110
111

All possible values with two bits of less:

00
01
10
11

You can't compress all eight of the 3-bit possibilities into using just two bits of information. If those "two bit" variations represented "compressed data", there are four possible compressed files that can exist. Those can compress into four of the original/uncompressed 3-bit possibilities. The remaining 3-bit possibilities have no smaller form which doesn't already have a meaning (to compress to a different 3-bit value).

Granted, you could try some techniques like being able to store 1-bit values, but you'll still come up short, and then you probably need to store additional information like the length of the comressed file. Keep in mind that every single bit you use up, for anything, could represent twice as many potential files.

TOOGAM
  • 149
  • 5
  • 1
    ¼ is not a ascii value – mousetail Feb 08 '23 at 07:29
  • 1
    Fair enough. I admit that my research indicates that the term ASCII is most properly applied to just the first 128 characters (zero through 127), and others fit into the category known as "high ASCII" which is a confusing term because such "high ASCII" characters aren't specified by the ASCII standard. (My recent mis-use of the term was based on a long-standing incorrect understanding that I only corrected more recently, and so was a slip-up on my part.) I edited the answer to just call ¼ a "character". – TOOGAM Feb 08 '23 at 09:11
  • When you say "high ASCII" do you mean latin-1? – mousetail Feb 08 '23 at 09:12
  • The term "high ASCII" refers to any character above 127. This may be a term that originated from Microsoft-published MS-DOS manuals (though I'm not certain about that). Latin-1 is the name for a Unicode block, and is term often used to describe Windows-1252 code page. On my browser, ASCII 174 looks like Code Page 437's ASCII 174 which looks like a "1/4" symbol. CP-1252 would render ASCII 174 as the "Registered Trademark" sign ®. Sources: https://en.wikipedia.org/wiki/ISO/IEC_8859-1 https://en.wikipedia.org/wiki/Code_page_437 https://www.w3schools.com/charsets/ref_html_ansi.asp – TOOGAM Feb 08 '23 at 09:24
  • 2
    Sorry it was a very minor nitpick, your answer is good otherwise – mousetail Feb 08 '23 at 09:29
1

Compression works using data structures. You say that, for example:

8888123

Is compressed as:

48123

But that's not how RLE (or any compression algorithm) works. Instead, the output from your example would look more like this:

[4, 8] [1, 1] [1, 2] [1, 3]

We now have four output symbols, but we need to make some assumptions about our input data to create a system where some symbols are favored at the expense of other symbols.

Let's say that we define the output symbols as:

struct {
  unsigned int length: 4;
  unsigned int value: 4;
}

Where we assume that each input value is a value from 0-9 (technically, 0-15), and the maximum length per symbol is 16 (length + 1). Armed with these assumptions, we can then write the output as:

[0011 0100] [0000 0001] [0000 0010] [0000 0011]

Here, by declaring that all values are input as numbers, we can say that we've turned 7 bytes into 4. By constraining the inputs to certain ranges, we improve the compression for most use cases.

Alternatively, consider the concept of limiting ourselves to ASCII codes from 0-127, and treating the value as an ASCII string. We now have a setup where we can use the MSB to indicate if this is a literal value, or a repeating value.

To encode 8888123, we output the data as:

[1000 0011] // Run indicator, length 4
[0011 1000] // value '8'
[0011 0001] // No run indicator, value '1'
[0011 0010] // No run indicator, value '2'
[0011 0011] // No run indicator, value '3'

We can also encode values between 128-255 using a longer code:

[1000 0000] // Run indicator, length 1
[1010 1010] // value 170

This is suboptimal if we end up with a lot of these higher characters, but our RLE scheme is optimized for repeating characters between codes 0 and 127. As long as that assumption holds true, we win.

For any lossless algorithm, we have to decide which attributes we are looking for. PNG, for example, favors images that have particular gradients and patterns. Random noise will not compress well, but most images we want to store are not literally random noise. They have patterns, and those patterns can be represented in smaller states.

To see this in action, try compressing a bitmap of some uniform data (e.g. a simple web graphic), a PNG, and an executable file into a ZIP file. The bitmap will likely gain the greatest compression ratio, the PNG a very small compression ratio (especially if it is a PNG of the bitmap), and the executable will most likely be "stored" rather than compressed at all. Most lossless compression algorithms use some variation of finding patterns and expressing those in a compact form.

GIF, remembers previously unseen strings of values to build a dictionary as it encodes the data. The decoder then rebuilds the dictionary by reading the input stream that was the output stream from the encoder. PNG works primarily by using part of LZ77 or Huffman encoding (essentially the same compression that ZIP uses), and can also apply a filter (e.g. a delta for each row) to improve performance even further. This algorithm builds a dictionary with a sliding window, finds patterns, and expresses the pattern as an offset into the sliding window instead of repeating all the bytes.

All of these assumptions built into the various formats assume that we want to encode useful images, such as real-life images or QR codes, and not randomly selected pixels. The algorithm is chosen so that shorter codes represent more useful/common patterns, and longer codes for useless/less common patterns. As a result, most algorithms can compress the data they are designed for efficiently, at the cost of not being able to efficiently encode useless information.

Note: Yes, I oversimplified some of the algorithms, as this answer was already quite long. Interested readers can just search for specific keywords or specifications.

phyrfox
  • 529
  • 3
  • 6