Home | History | Annotate | Download | only in doc
      1 LZ4 Frame Format Description
      2 ============================
      3 
      4 ###Notices
      5 
      6 Copyright (c) 2013-2015 Yann Collet
      7 
      8 Permission is granted to copy and distribute this document
      9 for any  purpose and without charge,
     10 including translations into other  languages
     11 and incorporation into compilations,
     12 provided that the copyright notice and this notice are preserved,
     13 and that any substantive changes or deletions from the original
     14 are clearly marked.
     15 Distribution of this document is unlimited.
     16 
     17 ###Version
     18 
     19 1.5.1 (31/03/2015)
     20 
     21 
     22 Introduction
     23 ------------
     24 
     25 The purpose of this document is to define a lossless compressed data format,
     26 that is independent of CPU type, operating system,
     27 file system and character set, suitable for
     28 File compression, Pipe and streaming compression
     29 using the [LZ4 algorithm](http://www.lz4.org).
     30 
     31 The data can be produced or consumed,
     32 even for an arbitrarily long sequentially presented input data stream,
     33 using only an a priori bounded amount of intermediate storage,
     34 and hence can be used in data communications.
     35 The format uses the LZ4 compression method,
     36 and optional [xxHash-32 checksum method](https://github.com/Cyan4973/xxHash),
     37 for detection of data corruption.
     38 
     39 The data format defined by this specification
     40 does not attempt to allow random access to compressed data.
     41 
     42 This specification is intended for use by implementers of software
     43 to compress data into LZ4 format and/or decompress data from LZ4 format.
     44 The text of the specification assumes a basic background in programming
     45 at the level of bits and other primitive data representations.
     46 
     47 Unless otherwise indicated below,
     48 a compliant compressor must produce data sets
     49 that conform to the specifications presented here.
     50 It doesnt need to support all options though.
     51 
     52 A compliant decompressor must be able to decompress
     53 at least one working set of parameters
     54 that conforms to the specifications presented here.
     55 It may also ignore checksums.
     56 Whenever it does not support a specific parameter within the compressed stream,
     57 it must produce a non-ambiguous error code
     58 and associated error message explaining which parameter is unsupported.
     59 
     60 
     61 General Structure of LZ4 Frame format
     62 -------------------------------------
     63 
     64 | MagicNb | F. Descriptor | Block | (...) | EndMark | C. Checksum |
     65 |:-------:|:-------------:| ----- | ----- | ------- | ----------- |
     66 | 4 bytes |  3-11 bytes   |       |       | 4 bytes | 0-4 bytes   |
     67 
     68 __Magic Number__
     69 
     70 4 Bytes, Little endian format.
     71 Value : 0x184D2204
     72 
     73 __Frame Descriptor__
     74 
     75 3 to 11 Bytes, to be detailed in the next part.
     76 Most important part of the spec.
     77 
     78 __Data Blocks__
     79 
     80 To be detailed later on.
     81 Thats where compressed data is stored.
     82 
     83 __EndMark__
     84 
     85 The flow of blocks ends when the last data block has a size of 0.
     86 The size is expressed as a 32-bits value.
     87 
     88 __Content Checksum__
     89 
     90 Content Checksum verify that the full content has been decoded correctly.
     91 The content checksum is the result
     92 of [xxh32() hash function](https://github.com/Cyan4973/xxHash)
     93 digesting the original (decoded) data as input, and a seed of zero.
     94 Content checksum is only present when its associated flag
     95 is set in the frame descriptor.
     96 Content Checksum validates the result,
     97 that all blocks were fully transmitted in the correct order and without error,
     98 and also that the encoding/decoding process itself generated no distortion.
     99 Its usage is recommended.
    100 
    101 __Frame Concatenation__
    102 
    103 In some circumstances, it may be preferable to append multiple frames,
    104 for example in order to add new data to an existing compressed file
    105 without re-framing it.
    106 
    107 In such case, each frame has its own set of descriptor flags.
    108 Each frame is considered independent.
    109 The only relation between frames is their sequential order.
    110 
    111 The ability to decode multiple concatenated frames
    112 within a single stream or file
    113 is left outside of this specification.
    114 As an example, the reference lz4 command line utility behavior is
    115 to decode all concatenated frames in their sequential order.
    116 
    117 
    118 Frame Descriptor
    119 ----------------
    120 
    121 | FLG     | BD      | (Content Size) | HC      |
    122 | ------- | ------- |:--------------:| ------- |
    123 | 1 byte  | 1 byte  |  0 - 8 bytes   | 1 byte  |
    124 
    125 The descriptor uses a minimum of 3 bytes,
    126 and up to 11 bytes depending on optional parameters.
    127 
    128 __FLG byte__
    129 
    130 |  BitNb  |   7-6   |    5    |     4     |   3     |     2     |    1-0   |
    131 | ------- | ------- | ------- | --------- | ------- | --------- | -------- |
    132 |FieldName| Version | B.Indep | B.Checksum| C.Size  | C.Checksum|*Reserved*|
    133 
    134 
    135 __BD byte__
    136 
    137 |  BitNb  |     7    |     6-5-4    |  3-2-1-0 |
    138 | ------- | -------- | ------------ | -------- |
    139 |FieldName|*Reserved*| Block MaxSize|*Reserved*|
    140 
    141 In the tables, bit 7 is highest bit, while bit 0 is lowest.
    142 
    143 __Version Number__
    144 
    145 2-bits field, must be set to 01.
    146 Any other value cannot be decoded by this version of the specification.
    147 Other version numbers will use different flag layouts.
    148 
    149 __Block Independence flag__
    150 
    151 If this flag is set to 1, blocks are independent.
    152 If this flag is set to 0, each block depends on previous ones
    153 (up to LZ4 window size, which is 64 KB).
    154 In such case, its necessary to decode all blocks in sequence.
    155 
    156 Block dependency improves compression ratio, especially for small blocks.
    157 On the other hand, it makes direct jumps or multi-threaded decoding impossible.
    158 
    159 __Block checksum flag__
    160 
    161 If this flag is set, each data block will be followed by a 4-bytes checksum,
    162 calculated by using the xxHash-32 algorithm on the raw (compressed) data block.
    163 The intention is to detect data corruption (storage or transmission errors)
    164 immediately, before decoding.
    165 Block checksum usage is optional.
    166 
    167 __Content Size flag__
    168 
    169 If this flag is set, the uncompressed size of data included within the frame
    170 will be present as an 8 bytes unsigned little endian value, after the flags.
    171 Content Size usage is optional.
    172 
    173 __Content checksum flag__
    174 
    175 If this flag is set, a content checksum will be appended after the EndMark.
    176 
    177 Recommended value : 1 (content checksum is present)
    178 
    179 __Block Maximum Size__
    180 
    181 This information is intended to help the decoder allocate memory.
    182 Size here refers to the original (uncompressed) data size.
    183 Block Maximum Size is one value among the following table :
    184 
    185 |  0  |  1  |  2  |  3  |   4   |   5    |  6   |  7   |
    186 | --- | --- | --- | --- | ----- | ------ | ---- | ---- |
    187 | N/A | N/A | N/A | N/A | 64 KB | 256 KB | 1 MB | 4 MB |
    188 
    189 The decoder may refuse to allocate block sizes above a (system-specific) size.
    190 Unused values may be used in a future revision of the spec.
    191 A decoder conformant to the current version of the spec
    192 is only able to decode blocksizes defined in this spec.
    193 
    194 __Reserved bits__
    195 
    196 Value of reserved bits **must** be 0 (zero).
    197 Reserved bit might be used in a future version of the specification,
    198 typically enabling new optional features.
    199 If this happens, a decoder respecting the current version of the specification
    200 shall not be able to decode such a frame.
    201 
    202 __Content Size__
    203 
    204 This is the original (uncompressed) size.
    205 This information is optional, and only present if the associated flag is set.
    206 Content size is provided using unsigned 8 Bytes, for a maximum of 16 HexaBytes.
    207 Format is Little endian.
    208 This value is informational, typically for display or memory allocation.
    209 It can be skipped by a decoder, or used to validate content correctness.
    210 
    211 __Header Checksum__
    212 
    213 One-byte checksum of combined descriptor fields, including optional ones.
    214 The value is the second byte of xxh32() : ` (xxh32()>>8) & 0xFF `
    215 using zero as a seed,
    216 and the full Frame Descriptor as an input
    217 (including optional fields when they are present).
    218 A wrong checksum indicates an error in the descriptor.
    219 Header checksum is informational and can be skipped.
    220 
    221 
    222 Data Blocks
    223 -----------
    224 
    225 | Block Size |  data  | (Block Checksum) |
    226 |:----------:| ------ |:----------------:|
    227 |  4 bytes   |        |   0 - 4 bytes    |
    228 
    229 
    230 __Block Size__
    231 
    232 This field uses 4-bytes, format is little-endian.
    233 
    234 The highest bit is 1 if data in the block is uncompressed.
    235 
    236 The highest bit is 0 if data in the block is compressed by LZ4.
    237 
    238 All other bits give the size, in bytes, of the following data block
    239 (the size does not include the block checksum if present).
    240 
    241 Block Size shall never be larger than Block Maximum Size.
    242 Such a thing could happen for incompressible source data.
    243 In such case, such a data block shall be passed in uncompressed format.
    244 
    245 __Data__
    246 
    247 Where the actual data to decode stands.
    248 It might be compressed or not, depending on previous field indications.
    249 Uncompressed size of Data can be any size, up to block maximum size.
    250 Note that data block is not necessarily full :
    251 an arbitrary flush may happen anytime. Any block can be partially filled.
    252 
    253 __Block checksum__
    254 
    255 Only present if the associated flag is set.
    256 This is a 4-bytes checksum value, in little endian format,
    257 calculated by using the xxHash-32 algorithm on the raw (undecoded) data block,
    258 and a seed of zero.
    259 The intention is to detect data corruption (storage or transmission errors)
    260 before decoding.
    261 
    262 Block checksum is cumulative with Content checksum.
    263 
    264 
    265 Skippable Frames
    266 ----------------
    267 
    268 | Magic Number | Frame Size | User Data |
    269 |:------------:|:----------:| --------- |
    270 |   4 bytes    |  4 bytes   |           |
    271 
    272 Skippable frames allow the integration of user-defined data
    273 into a flow of concatenated frames.
    274 Its design is pretty straightforward,
    275 with the sole objective to allow the decoder to quickly skip
    276 over user-defined data and continue decoding.
    277 
    278 For the purpose of facilitating identification,
    279 it is discouraged to start a flow of concatenated frames with a skippable frame.
    280 If there is a need to start such a flow with some user data
    281 encapsulated into a skippable frame,
    282 its recommended to start with a zero-byte LZ4 frame
    283 followed by a skippable frame.
    284 This will make it easier for file type identifiers.
    285 
    286 
    287 __Magic Number__
    288 
    289 4 Bytes, Little endian format.
    290 Value : 0x184D2A5X, which means any value from 0x184D2A50 to 0x184D2A5F.
    291 All 16 values are valid to identify a skippable frame.
    292 
    293 __Frame Size__
    294 
    295 This is the size, in bytes, of the following User Data
    296 (without including the magic number nor the size field itself).
    297 4 Bytes, Little endian format, unsigned 32-bits.
    298 This means User Data cant be bigger than (2^32-1) Bytes.
    299 
    300 __User Data__
    301 
    302 User Data can be anything. Data will just be skipped by the decoder.
    303 
    304 
    305 Legacy frame
    306 ------------
    307 
    308 The Legacy frame format was defined into the initial versions of LZ4Demo.
    309 Newer compressors should not use this format anymore, as it is too restrictive.
    310 
    311 Main characteristics of the legacy format :
    312 
    313 - Fixed block size : 8 MB.
    314 - All blocks must be completely filled, except the last one.
    315 - All blocks are always compressed, even when compression is detrimental.
    316 - The last block is detected either because
    317   it is followed by the EOF (End of File) mark,
    318   or because it is followed by a known Frame Magic Number.
    319 - No checksum
    320 - Convention is Little endian
    321 
    322 | MagicNb | B.CSize | CData | B.CSize | CData |  (...)  | EndMark |
    323 | ------- | ------- | ----- | ------- | ----- | ------- | ------- |
    324 | 4 bytes | 4 bytes | CSize | 4 bytes | CSize | x times |   EOF   |
    325 
    326 
    327 __Magic Number__
    328 
    329 4 Bytes, Little endian format.
    330 Value : 0x184C2102
    331 
    332 __Block Compressed Size__
    333 
    334 This is the size, in bytes, of the following compressed data block.
    335 4 Bytes, Little endian format.
    336 
    337 __Data__
    338 
    339 Where the actual compressed data stands.
    340 Data is always compressed, even when compression is detrimental.
    341 
    342 __EndMark__
    343 
    344 End of legacy frame is implicit only.
    345 It must be followed by a standard EOF (End Of File) signal,
    346 wether it is a file or a stream.
    347 
    348 Alternatively, if the frame is followed by a valid Frame Magic Number,
    349 it is considered completed.
    350 It makes legacy frames compatible with frame concatenation.
    351 
    352 Any other value will be interpreted as a block size,
    353 and trigger an error if it does not fit within acceptable range.
    354 
    355 
    356 Version changes
    357 ---------------
    358 
    359 1.5.1 : changed format to MarkDown compatible
    360 
    361 1.5 : removed Dictionary ID from specification
    362 
    363 1.4.1 : changed wording from stream to frame
    364 
    365 1.4 : added skippable streams, re-added stream checksum
    366 
    367 1.3 : modified header checksum
    368 
    369 1.2 : reduced choice of block size, to postpone decision on dynamic size of BlockSize Field.
    370 
    371 1.1 : optional fields are now part of the descriptor
    372 
    373 1.0 : changed block size specification, adding a compressed/uncompressed flag
    374 
    375 0.9 : reduced scale of block maximum size table
    376 
    377 0.8 : removed : high compression flag
    378 
    379 0.7 : removed : stream checksum
    380 
    381 0.6 : settled : stream size uses 8 bytes, endian convention is little endian
    382 
    383 0.5: added copyright notice
    384 
    385 0.4 : changed format to Google Doc compatible OpenDocument
    386