149 lines
6.9 KiB
Plaintext
149 lines
6.9 KiB
Plaintext
1. Algorithm
|
|
|
|
The deflation algorithm used by zip and gzip is a variation of LZ77
|
|
(Lempel-Ziv 1977, see reference below). It finds duplicated strings in
|
|
the input data. The second occurrence of a string is replaced by a
|
|
pointer to the previous string, in the form of a pair (distance,
|
|
length). Distances are limited to 32K bytes, and lengths are limited
|
|
to 258 bytes. When a string does not occur anywhere in the previous
|
|
32K bytes, it is emitted as a sequence of literal bytes. (In this
|
|
description, 'string' must be taken as an arbitrary sequence of bytes,
|
|
and is not restricted to printable characters.)
|
|
|
|
Literals or match lengths are compressed with one Huffman tree, and
|
|
match distances are compressed with another tree. The trees are stored
|
|
in a compact form at the start of each block. The blocks can have any
|
|
size (except that the compressed data for one block must fit in
|
|
available memory). A block is terminated when zip determines that it
|
|
would be useful to start another block with fresh trees. (This is
|
|
somewhat similar to compress.)
|
|
|
|
Duplicated strings are found using a hash table. All input strings of
|
|
length 3 are inserted in the hash table. A hash index is computed for
|
|
the next 3 bytes. If the hash chain for this index is not empty, all
|
|
strings in the chain are compared with the current input string, and
|
|
the longest match is selected.
|
|
|
|
The hash chains are searched starting with the most recent strings, to
|
|
favor small distances and thus take advantage of the Huffman encoding.
|
|
The hash chains are singly linked. There are no deletions from the
|
|
hash chains, the algorithm simply discards matches that are too old.
|
|
|
|
To avoid a worst-case situation, very long hash chains are arbitrarily
|
|
truncated at a certain length, determined by a runtime option (zip -1
|
|
to -9). So zip does not always find the longest possible match but
|
|
generally finds a match which is long enough.
|
|
|
|
zip also defers the selection of matches with a lazy evaluation
|
|
mechanism. After a match of length N has been found, zip searches for a
|
|
longer match at the next input byte. If a longer match is found, the
|
|
previous match is truncated to a length of one (thus producing a single
|
|
literal byte) and the longer match is emitted afterwards. Otherwise,
|
|
the original match is kept, and the next match search is attempted only
|
|
N steps later.
|
|
|
|
The lazy match evaluation is also subject to a runtime parameter. If
|
|
the current match is long enough, zip reduces the search for a longer
|
|
match, thus speeding up the whole process. If compression ratio is more
|
|
important than speed, zip attempts a complete second search even if
|
|
the first match is already long enough.
|
|
|
|
|
|
2. gzip file format
|
|
|
|
The pkzip format imposes a lot of overhead in various headers, which
|
|
are useful for an archiver but not necessary when only one file is
|
|
compressed. gzip uses a much simpler structure. Numbers are in little
|
|
endian format, and bit 0 is the least significant bit.
|
|
A gzip file is a sequence of compressed members. Each member has the
|
|
following structure:
|
|
|
|
2 bytes magic header 0x1f, 0x8b (\037 \213)
|
|
1 byte compression method (0..7 reserved, 8 = deflate)
|
|
1 byte flags
|
|
bit 0 set: file probably ascii text
|
|
bit 1 set: continuation of multi-part gzip file
|
|
bit 2 set: extra field present
|
|
bit 3 set: original file name present
|
|
bit 4 set: file comment present
|
|
bit 5 set: file is encrypted
|
|
bit 6,7: reserved
|
|
4 bytes file modification time in Unix format
|
|
1 byte extra flags (depend on compression method)
|
|
1 byte operating system on which compression took place
|
|
|
|
2 bytes optional part number (second part=1)
|
|
2 bytes optional extra field length
|
|
? bytes optional extra field
|
|
? bytes optional original file name, zero terminated
|
|
? bytes optional file comment, zero terminated
|
|
12 bytes optional encryption header
|
|
? bytes compressed data
|
|
4 bytes crc32
|
|
4 bytes uncompressed input size modulo 2^32
|
|
|
|
The format was designed to allow single pass compression without any
|
|
backwards seek, and without a priori knowledge of the uncompressed
|
|
input size or the available size on the output media. If input does
|
|
not come from a regular disk file, the file modification time is set
|
|
to the time at which compression started.
|
|
|
|
The time stamp is useful mainly when one gzip file is transferred over
|
|
a network. In this case it would not help to keep ownership
|
|
attributes. In the local case, the ownership attributes are preserved
|
|
by gzip when compressing/decompressing the file. A time stamp of zero
|
|
is ignored.
|
|
|
|
Bit 0 in the flags is only an optional indication, which can be set by
|
|
a small lookahead in the input data. In case of doubt, the flag is
|
|
cleared indicating binary data. For systems which have different
|
|
file formats for ascii text and binary data, the decompressor can
|
|
use the flag to choose the appropriate format.
|
|
|
|
It must be possible to detect the end of the compressed data with any
|
|
compression format, regardless of the actual size of the compressed
|
|
data. If the compressed data cannot fit in one file (in particular for
|
|
diskettes), each part starts with a header as described above, but
|
|
only the last part has the crc32 and uncompressed size. A decompressor
|
|
may prompt for additional data for multipart compressed files. It is
|
|
desirable but not mandatory that multiple parts be extractable
|
|
independently so that partial data can be recovered if one of the
|
|
parts is damaged. This is possible only if no compression state is
|
|
kept from one part to the other. The compression-type dependent flags
|
|
can indicate this.
|
|
|
|
If the file being compressed is on a file system with case insensitive
|
|
names, the original name field must be forced to lower case. There is
|
|
no original file name if the data was compressed from standard input.
|
|
|
|
On operating systems which support multiple extensions, and on
|
|
original files without extension, the extension ".z" is added to the
|
|
original file name (foo.c => foo.c.z). Otherwise the string "z" (or
|
|
"-z" for VMS) is added to the original extension (foo.c => foo.cz or
|
|
foo.c-z). The original name or extension is truncated if necessary,
|
|
but in this case the original name is always saved in the compressed
|
|
file (foo.doc => foo.doz).
|
|
|
|
Compression is always performed, even if the compressed file is
|
|
slightly larger than the original. The worst case expansion is
|
|
a few bytes for the gzip file header, plus 5 bytes every 32K block,
|
|
or an expansion ratio of 0.015% for large files.
|
|
|
|
The encryption is that of zip 1.9. For the encryption check, the
|
|
last byte of the decoded encryption header must be zero. The time
|
|
stamp of an encrypted file might be set to zero to avoid giving a clue
|
|
about the construction of the random header.
|
|
|
|
Jean-loup Gailly
|
|
jloup@chorus.fr
|
|
|
|
References:
|
|
|
|
[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
|
|
Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
|
|
pp. 337-343.
|
|
|
|
APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
|
|
ftp in ux1.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
|
|
Use "unzip pkz193a.exe APPNOTE.TXT" to extract.
|