Commit | Line | Data |
---|---|---|
3013fe88 NW |
1 | 1. Algorithm |
2 | ||
3 | The deflation algorithm used by zip and gzip is a variation of LZ77 | |
4 | (Lempel-Ziv 1977, see reference below). It finds duplicated strings in | |
5 | the input data. The second occurrence of a string is replaced by a | |
6 | pointer to the previous string, in the form of a pair (distance, | |
7 | length). Distances are limited to 32K bytes, and lengths are limited | |
8 | to 258 bytes. When a string does not occur anywhere in the previous | |
9 | 32K bytes, it is emitted as a sequence of literal bytes. (In this | |
10 | description, 'string' must be taken as an arbitrary sequence of bytes, | |
11 | and is not restricted to printable characters.) | |
12 | ||
13 | Literals or match lengths are compressed with one Huffman tree, and | |
14 | match distances are compressed with another tree. The trees are stored | |
15 | in a compact form at the start of each block. The blocks can have any | |
16 | size (except that the compressed data for one block must fit in | |
17 | available memory). A block is terminated when zip determines that it | |
18 | would be useful to start another block with fresh trees. (This is | |
19 | somewhat similar to compress.) | |
20 | ||
21 | Duplicated strings are found using a hash table. All input strings of | |
22 | length 3 are inserted in the hash table. A hash index is computed for | |
23 | the next 3 bytes. If the hash chain for this index is not empty, all | |
24 | strings in the chain are compared with the current input string, and | |
25 | the longest match is selected. | |
26 | ||
27 | The hash chains are searched starting with the most recent strings, to | |
28 | favor small distances and thus take advantage of the Huffman encoding. | |
29 | The hash chains are singly linked. There are no deletions from the | |
30 | hash chains, the algorithm simply discards matches that are too old. | |
31 | ||
32 | To avoid a worst-case situation, very long hash chains are arbitrarily | |
33 | truncated at a certain length, determined by a runtime option (zip -1 | |
34 | to -9). So zip does not always find the longest possible match but | |
35 | generally finds a match which is long enough. | |
36 | ||
37 | zip also defers the selection of matches with a lazy evaluation | |
38 | mechanism. After a match of length N has been found, zip searches for a | |
39 | longer match at the next input byte. If a longer match is found, the | |
40 | previous match is truncated to a length of one (thus producing a single | |
41 | literal byte) and the longer match is emitted afterwards. Otherwise, | |
42 | the original match is kept, and the next match search is attempted only | |
43 | N steps later. | |
44 | ||
45 | The lazy match evaluation is also subject to a runtime parameter. If | |
46 | the current match is long enough, zip reduces the search for a longer | |
47 | match, thus speeding up the whole process. If compression ratio is more | |
48 | important than speed, zip attempts a complete second search even if | |
49 | the first match is already long enough. | |
50 | ||
caed0dfe NW |
51 | The lazy match evaluation is no performed for the fastest compression |
52 | modes (speed options -1 to -3). For these fast modes, new strings | |
53 | are inserted in the hash table only when no match was found, or | |
54 | when the match is not too long. This degrades the compression ratio | |
55 | but saves time since there are both fewer insertions and fewer searches. | |
56 | ||
3013fe88 NW |
57 | |
58 | 2. gzip file format | |
59 | ||
60 | The pkzip format imposes a lot of overhead in various headers, which | |
61 | are useful for an archiver but not necessary when only one file is | |
62 | compressed. gzip uses a much simpler structure. Numbers are in little | |
63 | endian format, and bit 0 is the least significant bit. | |
64 | A gzip file is a sequence of compressed members. Each member has the | |
65 | following structure: | |
66 | ||
67 | 2 bytes magic header 0x1f, 0x8b (\037 \213) | |
68 | 1 byte compression method (0..7 reserved, 8 = deflate) | |
69 | 1 byte flags | |
70 | bit 0 set: file probably ascii text | |
71 | bit 1 set: continuation of multi-part gzip file | |
72 | bit 2 set: extra field present | |
73 | bit 3 set: original file name present | |
74 | bit 4 set: file comment present | |
75 | bit 5 set: file is encrypted | |
76 | bit 6,7: reserved | |
77 | 4 bytes file modification time in Unix format | |
78 | 1 byte extra flags (depend on compression method) | |
79 | 1 byte operating system on which compression took place | |
80 | ||
81 | 2 bytes optional part number (second part=1) | |
82 | 2 bytes optional extra field length | |
83 | ? bytes optional extra field | |
84 | ? bytes optional original file name, zero terminated | |
85 | ? bytes optional file comment, zero terminated | |
86 | 12 bytes optional encryption header | |
87 | ? bytes compressed data | |
88 | 4 bytes crc32 | |
89 | 4 bytes uncompressed input size modulo 2^32 | |
90 | ||
91 | The format was designed to allow single pass compression without any | |
92 | backwards seek, and without a priori knowledge of the uncompressed | |
93 | input size or the available size on the output media. If input does | |
94 | not come from a regular disk file, the file modification time is set | |
95 | to the time at which compression started. | |
96 | ||
97 | The time stamp is useful mainly when one gzip file is transferred over | |
98 | a network. In this case it would not help to keep ownership | |
99 | attributes. In the local case, the ownership attributes are preserved | |
100 | by gzip when compressing/decompressing the file. A time stamp of zero | |
101 | is ignored. | |
102 | ||
103 | Bit 0 in the flags is only an optional indication, which can be set by | |
104 | a small lookahead in the input data. In case of doubt, the flag is | |
105 | cleared indicating binary data. For systems which have different | |
106 | file formats for ascii text and binary data, the decompressor can | |
107 | use the flag to choose the appropriate format. | |
108 | ||
caed0dfe NW |
109 | The extra field, if present, must consist of one or more subfields, |
110 | each with the following format: | |
111 | ||
112 | subfield id : 2 bytes | |
113 | subfield size : 2 bytes (little-endian format) | |
114 | subfield data | |
115 | ||
116 | The subfield id can consist of two letters with some mnemonic value. | |
117 | Please send any such id to jloup@chorus.fr. Ids with a zero second | |
118 | byte are reserved for future use. The following ids are defined: | |
119 | ||
120 | Ap (0x41, 0x70) : Apollo file type information | |
121 | ||
122 | The subfield size is the size of the subfield data and does not | |
123 | include the id and the size itself. The field 'extra field length' is | |
124 | the total size of the extra field, including subfield ids and sizes. | |
125 | ||
3013fe88 NW |
126 | It must be possible to detect the end of the compressed data with any |
127 | compression format, regardless of the actual size of the compressed | |
128 | data. If the compressed data cannot fit in one file (in particular for | |
129 | diskettes), each part starts with a header as described above, but | |
130 | only the last part has the crc32 and uncompressed size. A decompressor | |
131 | may prompt for additional data for multipart compressed files. It is | |
132 | desirable but not mandatory that multiple parts be extractable | |
133 | independently so that partial data can be recovered if one of the | |
134 | parts is damaged. This is possible only if no compression state is | |
135 | kept from one part to the other. The compression-type dependent flags | |
136 | can indicate this. | |
137 | ||
138 | If the file being compressed is on a file system with case insensitive | |
139 | names, the original name field must be forced to lower case. There is | |
140 | no original file name if the data was compressed from standard input. | |
141 | ||
142 | Compression is always performed, even if the compressed file is | |
143 | slightly larger than the original. The worst case expansion is | |
144 | a few bytes for the gzip file header, plus 5 bytes every 32K block, | |
145 | or an expansion ratio of 0.015% for large files. Note that the actual | |
146 | number of used disk blocks almost never increases. | |
147 | ||
148 | The encryption is that of zip 1.9. For the encryption check, the | |
149 | last byte of the decoded encryption header must be zero. The time | |
150 | stamp of an encrypted file might be set to zero to avoid giving a clue | |
151 | about the construction of the random header. | |
152 | ||
153 | Jean-loup Gailly | |
154 | jloup@chorus.fr | |
155 | ||
156 | References: | |
157 | ||
158 | [LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data | |
159 | Compression", IEEE Transactions on Information Theory", Vol. 23, No. 3, | |
160 | pp. 337-343. | |
161 | ||
162 | APPNOTE.TXT documentation file in PKZIP 1.93a. It is available by | |
163 | ftp in ftp.cso.uiuc.edu:/pc/exec-pc/pkz193a.exe [128.174.5.59] | |
164 | Use "unzip pkz193a.exe APPNOTE.TXT" to extract. |