2

Today I went across a weird case for which I have no explanation, so here I am.

I have two files with identical content, but one is encoded in UTF-8 and the other one is in IBM EBCDIC. Both of them have approximately the same size (336MB and 335MB).

But if I compress the files (ZIP or RAR) one of them is reduced to 26MB (UTF) and the other one to 16MB (EBCDIC).

How can be this possible? The behavior of EBCDIC encoding is always better under compression? Why?

rodripf
  • 137
  • 2
  • 4
    Practically, EBCDIC only uses 7 bits out of 8 in a byte. That alone could explain the compression difference. – Robert Harvey Nov 21 '19 at 18:38
  • @Robert I stole that information to improve my answer. I hope you are OK with that. – πάντα ῥεῖ Nov 21 '19 at 18:41
  • @RobertHarvey For people using other languages as English, EBCDIC uses the full 8 bit range, as far as I know, depending on the [code pages](https://en.wikipedia.org/wiki/EBCDIC_500) ... – Christophe Nov 21 '19 at 19:34
  • @Christophe: Which would only apply to EBCDIC files that are not in English, probably an unlikely scenario. – Robert Harvey Nov 21 '19 at 19:36
  • @RobertHarvey Well, according to OP's data, with at least 1MB multibyte characters in the file, there are for sure a couple of non-english characters. OP is from Uruguay where a lot of ñ, Ñ, ú and other non-ascii chars are used. By the way, could you explain why you think that a non-English EBCDIC is improbable ? – Christophe Nov 21 '19 at 19:48
  • @Christophe: A *few* non-English characters won't materially affect the compression characteristics. – Robert Harvey Nov 21 '19 at 19:50
  • @RobertHarvey and this is exactly what makes this question very interesting. How could 0,3% of non-english char explain a 60% difference ? The English chars are encoded on 7 bits in both cases. – Christophe Nov 21 '19 at 20:02
  • @Christophe: It doesn't. The other 99.7 percent of 7-bit English characters explains the difference. – Robert Harvey Nov 21 '19 at 20:05
  • @RobertHarvey 99.7 characters are English **in both files**. The question is why the EBCDIC compresses 60% more than the UTF-8, where only 0,03% of the UTF8 are multibyte (and use effectively the 8th bit). – Christophe Nov 21 '19 at 20:08
  • 2
    @Christophe: Fundamentally, compression merely reduces redundancies in the data. Practically, this particular question doesn't contain enough information to be answerable. Zip uses Shrink, Reduce (levels 1-4), Implode, Deflate, Deflate64, bzip2, LZMA (EFS), WavPack, and PPMd algorithms to compress data; at a minimum, we would need to know which algorithm is in use, and whether Zip chose a different algorithm for each compression exercise. – Robert Harvey Nov 21 '19 at 20:13
  • 3
    Then you would have to evaluate the algorithm in use against the data being compressed to see what is happening under the hood. Not exactly a trivial exercise. – Robert Harvey Nov 21 '19 at 20:14
  • 2
    @RobertHarvey Indeed! Here I can agree with you. ZIP probably choses a different algorithm for both files, in view of a very different statistical distribution in both files. Different algorithms would explain the huge difference. And I agree, more data is needed to say for sure. – Christophe Nov 21 '19 at 21:12
  • Sadly the data I am testing right now is confidential, so I cannot share it. I'll try to generate some random data with the same characteristics and share it! – rodripf Nov 22 '19 at 19:45

2 Answers2

3

First thoughts

EBCDIC is a format with an 8 bit encoding. The basic EBCDIC character set has plenty of unused space, so not all the potential of the 8 bits is used. But it has codepages for handling non-English languages, such as EBCDIC 284 for latin America.

UTF-8 on the other hand is a multibyte encoding. The English character set is encoded on 7 useful butes. Other unicode characters could be encoded in up to 4 bytes.

The first thought that comes to mind, would be that the UTF8 uses multibyte and this could explain the difference in compression.

Deeper analysis

The conversion between EBCDIC codepages and UNICODE code map to the 256 first characters. This means that ecoding an EBCDIC file with a local code page, would use a couple of multibyte characters in UTF-8, but each mutlibyte would be 2 bytes long.

Your data shows that UTF-8 encoding is 236MB and EBCDIC encoding is 235MB. This means that at most 1M characters use a multibyte encoding in UTF8 (exactly 1MB if only 2 bytes, less if more bytes are needed). And 234MB use an English character set. So the huge difference in compression (60%) cannot be explained by just because of some multibyte chars.

Furthermore, compression algorithms tend to statistically find out and eliminate redundancies. They should compress the data to the minimum number of bits. The useful information in both files being the same, independently of the encoding, the difference should be minimal.

Conclusions

By deduction, the different encoding cannot explain a huge difference in compression, if the optimal algorithm was used.

According to wikipedia, ZIP files may use the following compression schemes:

Store (no compression), Shrink, Reduce (levels 1-4), Implode, Deflate, Deflate64, bzip2, LZMA (EFS), WavPack, and PPMd

For performance reason, ZIP does not try all of these compressions to determine the absolutely best. It uses some heuristics to chose the most suitable algorithm. The very different statistical distribution of EBCDIC and UTF8 could explain that the heuristic choses different algorithms.

So, with the given information, it seems that ZIP uses a suboptimal encoding scheme, and not necessarily the same for both files.

But this assumption needs an experimental verification. So please check the compression algorithm used in the ZIP files (as explained here on SO). If it's different you have the reason. If it's the same, please update your question and tell us which one so that we can re-investigate with more relevant information.

Christophe
  • 74,672
  • 10
  • 115
  • 187
0

There is some extraordinary claim. One file of 336MB and one file of 335MB, containing the same data, one compressed to 26MB and one compressed to 16MB.

That's a compression factor of 22 in one case. So either your compression software is very much broken and throws away 75% of the input, or you are compressing some very unusual text.

I wouldn't bother making any guesses about what is going on here until I'm told what exactly the contents of these files is.

gnasher729
  • 42,090
  • 4
  • 59
  • 119
  • 3
    Almost any compression factor is achievable if the text consists of repeating patterns of text. – Steve Nov 21 '19 at 20:52