This commit was generated by cvs2svn to track changes on a CVS vendor
[unix-history] / gnu / usr.bin / gzip / algorithm.doc
CommitLineData
3013fe88
NW
11. Algorithm
2
3The deflation algorithm used by zip and gzip is a variation of LZ77
4(Lempel-Ziv 1977, see reference below). It finds duplicated strings in
5the input data. The second occurrence of a string is replaced by a
6pointer to the previous string, in the form of a pair (distance,
7length). Distances are limited to 32K bytes, and lengths are limited
8to 258 bytes. When a string does not occur anywhere in the previous
932K bytes, it is emitted as a sequence of literal bytes. (In this
10description, 'string' must be taken as an arbitrary sequence of bytes,
11and is not restricted to printable characters.)
12
13Literals or match lengths are compressed with one Huffman tree, and
14match distances are compressed with another tree. The trees are stored
15in a compact form at the start of each block. The blocks can have any
16size (except that the compressed data for one block must fit in
17available memory). A block is terminated when zip determines that it
18would be useful to start another block with fresh trees. (This is
19somewhat similar to compress.)
20
21Duplicated strings are found using a hash table. All input strings of
22length 3 are inserted in the hash table. A hash index is computed for
23the next 3 bytes. If the hash chain for this index is not empty, all
24strings in the chain are compared with the current input string, and
25the longest match is selected.
26
27The hash chains are searched starting with the most recent strings, to
28favor small distances and thus take advantage of the Huffman encoding.
29The hash chains are singly linked. There are no deletions from the
30hash chains, the algorithm simply discards matches that are too old.
31
32To avoid a worst-case situation, very long hash chains are arbitrarily
33truncated at a certain length, determined by a runtime option (zip -1
34to -9). So zip does not always find the longest possible match but
35generally finds a match which is long enough.
36
37zip also defers the selection of matches with a lazy evaluation
38mechanism. After a match of length N has been found, zip searches for a
39longer match at the next input byte. If a longer match is found, the
40previous match is truncated to a length of one (thus producing a single
41literal byte) and the longer match is emitted afterwards. Otherwise,
42the original match is kept, and the next match search is attempted only
43N steps later.
44
45The lazy match evaluation is also subject to a runtime parameter. If
46the current match is long enough, zip reduces the search for a longer
47match, thus speeding up the whole process. If compression ratio is more
48important than speed, zip attempts a complete second search even if
49the first match is already long enough.
50
caed0dfe
NW
51The lazy match evaluation is no performed for the fastest compression
52modes (speed options -1 to -3). For these fast modes, new strings
53are inserted in the hash table only when no match was found, or
54when the match is not too long. This degrades the compression ratio
55but saves time since there are both fewer insertions and fewer searches.
56
3013fe88
NW
57
582. gzip file format
59
60The pkzip format imposes a lot of overhead in various headers, which
61are useful for an archiver but not necessary when only one file is
62compressed. gzip uses a much simpler structure. Numbers are in little
63endian format, and bit 0 is the least significant bit.
64A gzip file is a sequence of compressed members. Each member has the
65following structure:
66
672 bytes magic header 0x1f, 0x8b (\037 \213)
681 byte compression method (0..7 reserved, 8 = deflate)
691 byte flags
70 bit 0 set: file probably ascii text
71 bit 1 set: continuation of multi-part gzip file
72 bit 2 set: extra field present
73 bit 3 set: original file name present
74 bit 4 set: file comment present
75 bit 5 set: file is encrypted
76 bit 6,7: reserved
774 bytes file modification time in Unix format
781 byte extra flags (depend on compression method)
791 byte operating system on which compression took place
80
812 bytes optional part number (second part=1)
822 bytes optional extra field length
83? bytes optional extra field
84? bytes optional original file name, zero terminated
85? bytes optional file comment, zero terminated
8612 bytes optional encryption header
87? bytes compressed data
884 bytes crc32
894 bytes uncompressed input size modulo 2^32
90
91The format was designed to allow single pass compression without any
92backwards seek, and without a priori knowledge of the uncompressed
93input size or the available size on the output media. If input does
94not come from a regular disk file, the file modification time is set
95to the time at which compression started.
96
97The time stamp is useful mainly when one gzip file is transferred over
98a network. In this case it would not help to keep ownership
99attributes. In the local case, the ownership attributes are preserved
100by gzip when compressing/decompressing the file. A time stamp of zero
101is ignored.
102
103Bit 0 in the flags is only an optional indication, which can be set by
104a small lookahead in the input data. In case of doubt, the flag is
105cleared indicating binary data. For systems which have different
106file formats for ascii text and binary data, the decompressor can
107use the flag to choose the appropriate format.
108
caed0dfe
NW
109The extra field, if present, must consist of one or more subfields,
110each with the following format:
111
112 subfield id : 2 bytes
113 subfield size : 2 bytes (little-endian format)
114 subfield data
115
116The subfield id can consist of two letters with some mnemonic value.
117Please send any such id to jloup@chorus.fr. Ids with a zero second
118byte are reserved for future use. The following ids are defined:
119
120 Ap (0x41, 0x70) : Apollo file type information
121
122The subfield size is the size of the subfield data and does not
123include the id and the size itself. The field 'extra field length' is
124the total size of the extra field, including subfield ids and sizes.
125
3013fe88
NW
126It must be possible to detect the end of the compressed data with any
127compression format, regardless of the actual size of the compressed
128data. If the compressed data cannot fit in one file (in particular for
129diskettes), each part starts with a header as described above, but
130only the last part has the crc32 and uncompressed size. A decompressor
131may prompt for additional data for multipart compressed files. It is
132desirable but not mandatory that multiple parts be extractable
133independently so that partial data can be recovered if one of the
134parts is damaged. This is possible only if no compression state is
135kept from one part to the other. The compression-type dependent flags
136can indicate this.
137
138If the file being compressed is on a file system with case insensitive
139names, the original name field must be forced to lower case. There is
140no original file name if the data was compressed from standard input.
141
142Compression is always performed, even if the compressed file is
143slightly larger than the original. The worst case expansion is
144a few bytes for the gzip file header, plus 5 bytes every 32K block,
145or an expansion ratio of 0.015% for large files. Note that the actual
146number of used disk blocks almost never increases.
147
148The encryption is that of zip 1.9. For the encryption check, the
149last byte of the decoded encryption header must be zero. The time
150stamp of an encrypted file might be set to zero to avoid giving a clue
151about the construction of the random header.
152
153Jean-loup Gailly
154jloup@chorus.fr
155
156References:
157
158[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
159Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3,
160pp. 337-343.
161
162APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by
163ftp in ftp.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59]
164Use "unzip pkz193a.exe APPNOTE.TXT" to extract.