1

I am writing a python program which parses zip (currently only zlib, using DEFLATE compression) files and verifies the correctness of their headers and data. One of the things I'm trying to achieve is calculating the uncompressed size of a compressed (DEFLATE-d) file inside a zip archive, without actually uncompressing the file and, obviously, not relying on the uncompressed size field found in the file record's headers. This is so that I can ensure that none of the zip record's fields have been tampered with (in this case, the uncompressed size field).

I've gone through the ZIP specification (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) over and over but am in sort of a brain fart and don't see any way to do this without completely parsing the huffman trees and calculating the corresponding stream size, which is what I don't want to do. I will appreciate any idea or direction regarding how to do this.

To clarify, I'm not looking for a library\module to do this for me, rather a direction how it can be done.

Much thanks.

S B
  • 11
  • 3
  • Read it more carefully. Just about every structure that describes a file has its uncompressed size. – Blrfl Apr 06 '15 at 22:04
  • @Blrfl Of course, but I'm intentionally not relying on that field - I want to calculate it myself and compare the result to the given uncompressed size (this can be an indicator of an invalid zip archive). – S B Apr 06 '15 at 22:11
  • This might help: http://stackoverflow.com/questions/10908877/extracting-a-zipfile-to-memory I doubt you can do this without uncompress this to some place. – Gort the Robot Apr 06 '15 at 22:22
  • @SB: You are saying that the compressor is **untrusted**. The only reasonable way to ensure data integrity (of which non-truncation is the one you're looking for) is to compute the cryptographic checksums both before and after compression. This can only be done by the person doing the compression. Otherwise, trial-decompression is the only way you can verify non-truncation. – rwong Apr 06 '15 at 22:35
  • 1
    In other words, the person doing the compression must **claim** the values of checksums before and after compression, and you must **verify** those facts. Furthermore, if you do not **trust** the correctness of the decompressor code (i.e. it might harbor some defects that lead to arbitrary code execution), you simply can't do anything other than refusing any compressed data you don't trust. – rwong Apr 06 '15 at 22:37
  • @rwong Indeed. The digital signature check is done at a later stage; what I'm trying to avoid (or detect) here is a compressor exploiting a vulnerability in the decompression code, thus all the checks before the decompression itself. – S B Apr 06 '15 at 22:38
  • 1
    If your question is indeed security-related, please (1) first, read a lot of articles to get an overall sense of it, (2) ask at security.stackexchange – rwong Apr 06 '15 at 22:38
  • @rwong I have read a lot of articles :) this is why I tried posting here instead of at security.se - I have a sense of the security-related issues here, but have a difficulty with the programming issue\s. – S B Apr 06 '15 at 22:41
  • I do not think it is possible. consider that adding a single byte to the uncompressed stream could either alter the compressed size or not alter it (if it was covered by an RLE section, for instance). The only way to find the uncompressed size is to uncompress -- even if you dump every byte being uncompressed, as `gzip --test` does. – Ross Presser Apr 10 '15 at 03:32

0 Answers0