1 2 3 4 5 6 7 Network Working Group P. Deutsch 8 Request for Comments: 1952 Aladdin Enterprises 9 Category: Informational May 1996 10 11 12 GZIP file format specification version 4.3 13 14 Status of This Memo 15 16 This memo provides information for the Internet community. This memo 17 does not specify an Internet standard of any kind. Distribution of 18 this memo is unlimited. 19 20 IESG Note: 21 22 The IESG takes no position on the validity of any Intellectual 23 Property Rights statements contained in this document. 24 25 Notices 26 27 Copyright (c) 1996 L. Peter Deutsch 28 29 Permission is granted to copy and distribute this document for any 30 purpose and without charge, including translations into other 31 languages and incorporation into compilations, provided that the 32 copyright notice and this notice are preserved, and that any 33 substantive changes or deletions from the original are clearly 34 marked. 35 36 A pointer to the latest version of this and related documentation in 37 HTML format can be found at the URL 38 <ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html>. 39 40 Abstract 41 42 This specification defines a lossless compressed data format that is 43 compatible with the widely used GZIP utility. The format includes a 44 cyclic redundancy check value for detecting data corruption. The 45 format presently uses the DEFLATE method of compression but can be 46 easily extended to use other compression methods. The format can be 47 implemented readily in a manner not covered by patents. 48 49 50 51 52 53 54 55 56 57 58 Deutsch Informational [Page 1] 59 61 RFC 1952 GZIP File Format Specification May 1996 62 63 64 Table of Contents 65 66 1. Introduction ................................................... 2 67 1.1. Purpose ................................................... 2 68 1.2. Intended audience ......................................... 3 69 1.3. Scope ..................................................... 3 70 1.4. Compliance ................................................ 3 71 1.5. Definitions of terms and conventions used ................. 3 72 1.6. Changes from previous versions ............................ 3 73 2. Detailed specification ......................................... 4 74 2.1. Overall conventions ....................................... 4 75 2.2. File format ............................................... 5 76 2.3. Member format ............................................. 5 77 2.3.1. Member header and trailer ........................... 6 78 2.3.1.1. Extra field ................................... 8 79 2.3.1.2. Compliance .................................... 9 80 3. References .................................................. 9 81 4. Security Considerations .................................... 10 82 5. Acknowledgements ........................................... 10 83 6. Author's Address ........................................... 10 84 7. Appendix: Jean-Loup Gailly's gzip utility .................. 11 85 8. Appendix: Sample CRC Code .................................. 11 86 87 1. Introduction 88 89 1.1. Purpose 90 91 The purpose of this specification is to define a lossless 92 compressed data format that: 93 94 * Is independent of CPU type, operating system, file system, 95 and character set, and hence can be used for interchange; 96 * Can compress or decompress a data stream (as opposed to a 97 randomly accessible file) to produce another data stream, 98 using only an a priori bounded amount of intermediate 99 storage, and hence can be used in data communications or 100 similar structures such as Unix filters; 101 * Compresses data with efficiency comparable to the best 102 currently available general-purpose compression methods, 103 and in particular considerably better than the "compress" 104 program; 105 * Can be implemented readily in a manner not covered by 106 patents, and hence can be practiced freely; 107 * Is compatible with the file format produced by the current 108 widely used gzip utility, in that conforming decompressors 109 will be able to read data produced by the existing gzip 110 compressor. 111 112 113 114 115 Deutsch Informational [Page 2] 116 118 RFC 1952 GZIP File Format Specification May 1996 119 120 121 The data format defined by this specification does not attempt to: 122 123 * Provide random access to compressed data; 124 * Compress specialized data (e.g., raster graphics) as well as 125 the best currently available specialized algorithms. 126 127 1.2. Intended audience 128 129 This specification is intended for use by implementors of software 130 to compress data into gzip format and/or decompress data from gzip 131 format. 132 133 The text of the specification assumes a basic background in 134 programming at the level of bits and other primitive data 135 representations. 136 137 1.3. Scope 138 139 The specification specifies a compression method and a file format 140 (the latter assuming only that a file can store a sequence of 141 arbitrary bytes). It does not specify any particular interface to 142 a file system or anything about character sets or encodings 143 (except for file names and comments, which are optional). 144 145 1.4. Compliance 146 147 Unless otherwise indicated below, a compliant decompressor must be 148 able to accept and decompress any file that conforms to all the 149 specifications presented here; a compliant compressor must produce 150 files that conform to all the specifications presented here. The 151 material in the appendices is not part of the specification per se 152 and is not relevant to compliance. 153 154 1.5. Definitions of terms and conventions used 155 156 byte: 8 bits stored or transmitted as a unit (same as an octet). 157 (For this specification, a byte is exactly 8 bits, even on 158 machines which store a character on a number of bits different 159 from 8.) See below for the numbering of bits within a byte. 160 161 1.6. Changes from previous versions 162 163 There have been no technical changes to the gzip format since 164 version 4.1 of this specification. In version 4.2, some 165 terminology was changed, and the sample CRC code was rewritten for 166 clarity and to eliminate the requirement for the caller to do pre- 167 and post-conditioning. Version 4.3 is a conversion of the 168 specification to RFC style. 169 170 171 172 Deutsch Informational [Page 3] 173 175 RFC 1952 GZIP File Format Specification May 1996 176 177 178 2. Detailed specification 179 180 2.1. Overall conventions 181 182 In the diagrams below, a box like this: 183 184 +---+ 185 | | <-- the vertical bars might be missing 186 +---+ 187 188 represents one byte; a box like this: 189 190 +==============+ 191 | | 192 +==============+ 193 194 represents a variable number of bytes. 195 196 Bytes stored within a computer do not have a "bit order", since 197 they are always treated as a unit. However, a byte considered as 198 an integer between 0 and 255 does have a most- and least- 199 significant bit, and since we write numbers with the most- 200 significant digit on the left, we also write bytes with the most- 201 significant bit on the left. In the diagrams below, we number the 202 bits of a byte so that bit 0 is the least-significant bit, i.e., 203 the bits are numbered: 204 205 +--------+ 206 |76543210| 207 +--------+ 208 209 This document does not address the issue of the order in which 210 bits of a byte are transmitted on a bit-sequential medium, since 211 the data format described here is byte- rather than bit-oriented. 212 213 Within a computer, a number may occupy multiple bytes. All 214 multi-byte numbers in the format described here are stored with 215 the least-significant byte first (at the lower memory address). 216 For example, the decimal number 520 is stored as: 217 218 0 1 219 +--------+--------+ 220 |00001000|00000010| 221 +--------+--------+ 222 ^ ^ 223 | | 224 | + more significant byte = 2 x 256 225 + less significant byte = 8 226 227 228 229 Deutsch Informational [Page 4] 230 232 RFC 1952 GZIP File Format Specification May 1996 233 234 235 2.2. File format 236 237 A gzip file consists of a series of "members" (compressed data 238 sets). The format of each member is specified in the following 239 section. The members simply appear one after another in the file, 240 with no additional information before, between, or after them. 241 242 2.3. Member format 243 244 Each member has the following structure: 245 246 +---+---+---+---+---+---+---+---+---+---+ 247 |ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->) 248 +---+---+---+---+---+---+---+---+---+---+ 249 250 (if FLG.FEXTRA set) 251 252 +---+---+=================================+ 253 | XLEN |...XLEN bytes of "extra field"...| (more-->) 254 +---+---+=================================+ 255 256 (if FLG.FNAME set) 257 258 +=========================================+ 259 |...original file name, zero-terminated...| (more-->) 260 +=========================================+ 261 262 (if FLG.FCOMMENT set) 263 264 +===================================+ 265 |...file comment, zero-terminated...| (more-->) 266 +===================================+ 267 268 (if FLG.FHCRC set) 269 270 +---+---+ 271 | CRC16 | 272 +---+---+ 273 274 +=======================+ 275 |...compressed blocks...| (more-->) 276 +=======================+ 277 278 0 1 2 3 4 5 6 7 279 +---+---+---+---+---+---+---+---+ 280 | CRC32 | ISIZE | 281 +---+---+---+---+---+---+---+---+ 282 283 284 285 286 Deutsch Informational [Page 5] 287 289 RFC 1952 GZIP File Format Specification May 1996 290 291 292 2.3.1. Member header and trailer 293 294 ID1 (IDentification 1) 295 ID2 (IDentification 2) 296 These have the fixed values ID1 = 31 (0x1f, \037), ID2 = 139 297 (0x8b, \213), to identify the file as being in gzip format. 298 299 CM (Compression Method) 300 This identifies the compression method used in the file. CM 301 = 0-7 are reserved. CM = 8 denotes the "deflate" 302 compression method, which is the one customarily used by 303 gzip and which is documented elsewhere. 304 305 FLG (FLaGs) 306 This flag byte is divided into individual bits as follows: 307 308 bit 0 FTEXT 309 bit 1 FHCRC 310 bit 2 FEXTRA 311 bit 3 FNAME 312 bit 4 FCOMMENT 313 bit 5 reserved 314 bit 6 reserved 315 bit 7 reserved 316 317 If FTEXT is set, the file is probably ASCII text. This is 318 an optional indication, which the compressor may set by 319 checking a small amount of the input data to see whether any 320 non-ASCII characters are present. In case of doubt, FTEXT 321 is cleared, indicating binary data. For systems which have 322 different file formats for ascii text and binary data, the 323 decompressor can use FTEXT to choose the appropriate format. 324 We deliberately do not specify the algorithm used to set 325 this bit, since a compressor always has the option of 326 leaving it cleared and a decompressor always has the option 327 of ignoring it and letting some other program handle issues 328 of data conversion. 329 330 If FHCRC is set, a CRC16 for the gzip header is present, 331 immediately before the compressed data. The CRC16 consists 332 of the two least significant bytes of the CRC32 for all 333 bytes of the gzip header up to and not including the CRC16. 334 [The FHCRC bit was never set by versions of gzip up to 335 1.2.4, even though it was documented with a different 336 meaning in gzip 1.2.4.] 337 338 If FEXTRA is set, optional extra fields are present, as 339 described in a following section. 340 341 342 343 Deutsch Informational [Page 6] 344 346 RFC 1952 GZIP File Format Specification May 1996 347 348 349 If FNAME is set, an original file name is present, 350 terminated by a zero byte. The name must consist of ISO 351 8859-1 (LATIN-1) characters; on operating systems using 352 EBCDIC or any other character set for file names, the name 353 must be translated to the ISO LATIN-1 character set. This 354 is the original name of the file being compressed, with any 355 directory components removed, and, if the file being 356 compressed is on a file system with case insensitive names, 357 forced to lower case. There is no original file name if the 358 data was compressed from a source other than a named file; 359 for example, if the source was stdin on a Unix system, there 360 is no file name. 361 362 If FCOMMENT is set, a zero-terminated file comment is 363 present. This comment is not interpreted; it is only 364 intended for human consumption. The comment must consist of 365 ISO 8859-1 (LATIN-1) characters. Line breaks should be 366 denoted by a single line feed character (10 decimal). 367 368 Reserved FLG bits must be zero. 369 370 MTIME (Modification TIME) 371 This gives the most recent modification time of the original 372 file being compressed. The time is in Unix format, i.e., 373 seconds since 00:00:00 GMT, Jan. 1, 1970. (Note that this 374 may cause problems for MS-DOS and other systems that use 375 local rather than Universal time.) If the compressed data 376 did not come from a file, MTIME is set to the time at which 377 compression started. MTIME = 0 means no time stamp is 378 available. 379 380 XFL (eXtra FLags) 381 These flags are available for use by specific compression 382 methods. The "deflate" method (CM = 8) sets these flags as 383 follows: 384 385 XFL = 2 - compressor used maximum compression, 386 slowest algorithm 387 XFL = 4 - compressor used fastest algorithm 388 389 OS (Operating System) 390 This identifies the type of file system on which compression 391 took place. This may be useful in determining end-of-line 392 convention for text files. The currently defined values are 393 as follows: 394 395 396 397 398 399 400 Deutsch Informational [Page 7] 401 403 RFC 1952 GZIP File Format Specification May 1996 404 405 406 0 - FAT filesystem (MS-DOS, OS/2, NT/Win32) 407 1 - Amiga 408 2 - VMS (or OpenVMS) 409 3 - Unix 410 4 - VM/CMS 411 5 - Atari TOS 412 6 - HPFS filesystem (OS/2, NT) 413 7 - Macintosh 414 8 - Z-System 415 9 - CP/M 416 10 - TOPS-20 417 11 - NTFS filesystem (NT) 418 12 - QDOS 419 13 - Acorn RISCOS 420 255 - unknown 421 422 XLEN (eXtra LENgth) 423 If FLG.FEXTRA is set, this gives the length of the optional 424 extra field. See below for details. 425 426 CRC32 (CRC-32) 427 This contains a Cyclic Redundancy Check value of the 428 uncompressed data computed according to CRC-32 algorithm 429 used in the ISO 3309 standard and in section 8.1.1.6.2 of 430 ITU-T recommendation V.42. (See http://www.iso.ch for 431 ordering ISO documents. See gopher://info.itu.ch for an 432 online version of ITU-T V.42.) 433 434 ISIZE (Input SIZE) 435 This contains the size of the original (uncompressed) input 436 data modulo 2^32. 437 438 2.3.1.1. Extra field 439 440 If the FLG.FEXTRA bit is set, an "extra field" is present in 441 the header, with total length XLEN bytes. It consists of a 442 series of subfields, each of the form: 443 444 +---+---+---+---+==================================+ 445 |SI1|SI2| LEN |... LEN bytes of subfield data ...| 446 +---+---+---+---+==================================+ 447 448 SI1 and SI2 provide a subfield ID, typically two ASCII letters 449 with some mnemonic value. Jean-Loup Gailly 450 <gzip (a] prep.ai.mit.edu> is maintaining a registry of subfield 451 IDs; please send him any subfield ID you wish to use. Subfield 452 IDs with SI2 = 0 are reserved for future use. The following 453 IDs are currently defined: 454 455 456 457 Deutsch Informational [Page 8] 458 460 RFC 1952 GZIP File Format Specification May 1996 461 462 463 SI1 SI2 Data 464 ---------- ---------- ---- 465 0x41 ('A') 0x70 ('P') Apollo file type information 466 467 LEN gives the length of the subfield data, excluding the 4 468 initial bytes. 469 470 2.3.1.2. Compliance 471 472 A compliant compressor must produce files with correct ID1, 473 ID2, CM, CRC32, and ISIZE, but may set all the other fields in 474 the fixed-length part of the header to default values (255 for 475 OS, 0 for all others). The compressor must set all reserved 476 bits to zero. 477 478 A compliant decompressor must check ID1, ID2, and CM, and 479 provide an error indication if any of these have incorrect 480 values. It must examine FEXTRA/XLEN, FNAME, FCOMMENT and FHCRC 481 at least so it can skip over the optional fields if they are 482 present. It need not examine any other part of the header or 483 trailer; in particular, a decompressor may ignore FTEXT and OS 484 and always produce binary output, and still be compliant. A 485 compliant decompressor must give an error indication if any 486 reserved bit is non-zero, since such a bit could indicate the 487 presence of a new field that would cause subsequent data to be 488 interpreted incorrectly. 489 490 3. References 491 492 [1] "Information Processing - 8-bit single-byte coded graphic 493 character sets - Part 1: Latin alphabet No.1" (ISO 8859-1:1987). 494 The ISO 8859-1 (Latin-1) character set is a superset of 7-bit 495 ASCII. Files defining this character set are available as 496 iso_8859-1.* in ftp://ftp.uu.net/graphics/png/documents/ 497 498 [2] ISO 3309 499 500 [3] ITU-T recommendation V.42 501 502 [4] Deutsch, L.P.,"DEFLATE Compressed Data Format Specification", 503 available in ftp://ftp.uu.net/pub/archiving/zip/doc/ 504 505 [5] Gailly, J.-L., GZIP documentation, available as gzip-*.tar in 506 ftp://prep.ai.mit.edu/pub/gnu/ 507 508 [6] Sarwate, D.V., "Computation of Cyclic Redundancy Checks via Table 509 Look-Up", Communications of the ACM, 31(8), pp.1008-1013. 510 511 512 513 514 Deutsch Informational [Page 9] 515 517 RFC 1952 GZIP File Format Specification May 1996 518 519 520 [7] Schwaderer, W.D., "CRC Calculation", April 85 PC Tech Journal, 521 pp.118-133. 522 523 [8] ftp://ftp.adelaide.edu.au/pub/rocksoft/papers/crc_v3.txt, 524 describing the CRC concept. 525 526 4. Security Considerations 527 528 Any data compression method involves the reduction of redundancy in 529 the data. Consequently, any corruption of the data is likely to have 530 severe effects and be difficult to correct. Uncompressed text, on 531 the other hand, will probably still be readable despite the presence 532 of some corrupted bytes. 533 534 It is recommended that systems using this data format provide some 535 means of validating the integrity of the compressed data, such as by 536 setting and checking the CRC-32 check value. 537 538 5. Acknowledgements 539 540 Trademarks cited in this document are the property of their 541 respective owners. 542 543 Jean-Loup Gailly designed the gzip format and wrote, with Mark Adler, 544 the related software described in this specification. Glenn 545 Randers-Pehrson converted this document to RFC and HTML format. 546 547 6. Author's Address 548 549 L. Peter Deutsch 550 Aladdin Enterprises 551 203 Santa Margarita Ave. 552 Menlo Park, CA 94025 553 554 Phone: (415) 322-0103 (AM only) 555 FAX: (415) 322-1734 556 EMail: <ghost (a] aladdin.com> 557 558 Questions about the technical content of this specification can be 559 sent by email to: 560 561 Jean-Loup Gailly <gzip (a] prep.ai.mit.edu> and 562 Mark Adler <madler (a] alumni.caltech.edu> 563 564 Editorial comments on this specification can be sent by email to: 565 566 L. Peter Deutsch <ghost (a] aladdin.com> and 567 Glenn Randers-Pehrson <randeg (a] alumni.rpi.edu> 568 569 570 571 Deutsch Informational [Page 10] 572 574 RFC 1952 GZIP File Format Specification May 1996 575 576 577 7. Appendix: Jean-Loup Gailly's gzip utility 578 579 The most widely used implementation of gzip compression, and the 580 original documentation on which this specification is based, were 581 created by Jean-Loup Gailly <gzip (a] prep.ai.mit.edu>. Since this 582 implementation is a de facto standard, we mention some more of its 583 features here. Again, the material in this section is not part of 584 the specification per se, and implementations need not follow it to 585 be compliant. 586 587 When compressing or decompressing a file, gzip preserves the 588 protection, ownership, and modification time attributes on the local 589 file system, since there is no provision for representing protection 590 attributes in the gzip file format itself. Since the file format 591 includes a modification time, the gzip decompressor provides a 592 command line switch that assigns the modification time from the file, 593 rather than the local modification time of the compressed input, to 594 the decompressed output. 595 596 8. Appendix: Sample CRC Code 597 598 The following sample code represents a practical implementation of 599 the CRC (Cyclic Redundancy Check). (See also ISO 3309 and ITU-T V.42 600 for a formal specification.) 601 602 The sample code is in the ANSI C programming language. Non C users 603 may find it easier to read with these hints: 604 605 & Bitwise AND operator. 606 ^ Bitwise exclusive-OR operator. 607 >> Bitwise right shift operator. When applied to an 608 unsigned quantity, as here, right shift inserts zero 609 bit(s) at the left. 610 ! Logical NOT operator. 611 ++ "n++" increments the variable n. 612 0xNNN 0x introduces a hexadecimal (base 16) constant. 613 Suffix L indicates a long value (at least 32 bits). 614 615 /* Table of CRCs of all 8-bit messages. */ 616 unsigned long crc_table[256]; 617 618 /* Flag: has the table been computed? Initially false. */ 619 int crc_table_computed = 0; 620 621 /* Make the table for a fast CRC. */ 622 void make_crc_table(void) 623 { 624 unsigned long c; 625 626 627 628 Deutsch Informational [Page 11] 629 631 RFC 1952 GZIP File Format Specification May 1996 632 633 634 int n, k; 635 for (n = 0; n < 256; n++) { 636 c = (unsigned long) n; 637 for (k = 0; k < 8; k++) { 638 if (c & 1) { 639 c = 0xedb88320L ^ (c >> 1); 640 } else { 641 c = c >> 1; 642 } 643 } 644 crc_table[n] = c; 645 } 646 crc_table_computed = 1; 647 } 648 649 /* 650 Update a running crc with the bytes buf[0..len-1] and return 651 the updated crc. The crc should be initialized to zero. Pre- and 652 post-conditioning (one's complement) is performed within this 653 function so it shouldn't be done by the caller. Usage example: 654 655 unsigned long crc = 0L; 656 657 while (read_buffer(buffer, length) != EOF) { 658 crc = update_crc(crc, buffer, length); 659 } 660 if (crc != original_crc) error(); 661 */ 662 unsigned long update_crc(unsigned long crc, 663 unsigned char *buf, int len) 664 { 665 unsigned long c = crc ^ 0xffffffffL; 666 int n; 667 668 if (!crc_table_computed) 669 make_crc_table(); 670 for (n = 0; n < len; n++) { 671 c = crc_table[(c ^ buf[n]) & 0xff] ^ (c >> 8); 672 } 673 return c ^ 0xffffffffL; 674 } 675 676 /* Return the CRC of the bytes buf[0..len-1]. */ 677 unsigned long crc(unsigned char *buf, int len) 678 { 679 return update_crc(0L, buf, len); 680 } 681 682 683 684 685 Deutsch Informational [Page 12] 686 688