1 SQUASHFS 4.0 FILESYSTEM 2 ======================= 3 4 Squashfs is a compressed read-only filesystem for Linux. 5 It uses zlib compression to compress files, inodes and directories. 6 Inodes in the system are very small and all blocks are packed to minimise 7 data overhead. Block sizes greater than 4K are supported up to a maximum 8 of 1Mbytes (default block size 128K). 9 10 Squashfs is intended for general read-only filesystem use, for archival 11 use (i.e. in cases where a .tar.gz file may be used), and in constrained 12 block device/memory systems (e.g. embedded systems) where low overhead is 13 needed. 14 15 Mailing list: squashfs-devel (a] lists.sourceforge.net 16 Web site: www.squashfs.org 17 18 1. FILESYSTEM FEATURES 19 ---------------------- 20 21 Squashfs filesystem features versus Cramfs: 22 23 Squashfs Cramfs 24 25 Max filesystem size: 2^64 16 MiB 26 Max file size: ~ 2 TiB 16 MiB 27 Max files: unlimited unlimited 28 Max directories: unlimited unlimited 29 Max entries per directory: unlimited unlimited 30 Max block size: 1 MiB 4 KiB 31 Metadata compression: yes no 32 Directory indexes: yes no 33 Sparse file support: yes no 34 Tail-end packing (fragments): yes no 35 Exportable (NFS etc.): yes no 36 Hard link support: yes no 37 "." and ".." in readdir: yes no 38 Real inode numbers: yes no 39 32-bit uids/gids: yes no 40 File creation time: yes no 41 Xattr and ACL support: no no 42 43 Squashfs compresses data, inodes and directories. In addition, inode and 44 directory data are highly compacted, and packed on byte boundaries. Each 45 compressed inode is on average 8 bytes in length (the exact length varies on 46 file type, i.e. regular file, directory, symbolic link, and block/char device 47 inodes have different sizes). 48 49 2. USING SQUASHFS 50 ----------------- 51 52 As squashfs is a read-only filesystem, the mksquashfs program must be used to 53 create populated squashfs filesystems. This and other squashfs utilities 54 can be obtained from http://www.squashfs.org. Usage instructions can be 55 obtained from this site also. 56 57 58 3. SQUASHFS FILESYSTEM DESIGN 59 ----------------------------- 60 61 A squashfs filesystem consists of seven parts, packed together on a byte 62 alignment: 63 64 --------------- 65 | superblock | 66 |---------------| 67 | datablocks | 68 | & fragments | 69 |---------------| 70 | inode table | 71 |---------------| 72 | directory | 73 | table | 74 |---------------| 75 | fragment | 76 | table | 77 |---------------| 78 | export | 79 | table | 80 |---------------| 81 | uid/gid | 82 | lookup table | 83 --------------- 84 85 Compressed data blocks are written to the filesystem as files are read from 86 the source directory, and checked for duplicates. Once all file data has been 87 written the completed inode, directory, fragment, export and uid/gid lookup 88 tables are written. 89 90 3.1 Inodes 91 ---------- 92 93 Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each 94 compressed block is prefixed by a two byte length, the top bit is set if the 95 block is uncompressed. A block will be uncompressed if the -noI option is set, 96 or if the compressed block was larger than the uncompressed block. 97 98 Inodes are packed into the metadata blocks, and are not aligned to block 99 boundaries, therefore inodes overlap compressed blocks. Inodes are identified 100 by a 48-bit number which encodes the location of the compressed metadata block 101 containing the inode, and the byte offset into that block where the inode is 102 placed (<block, offset>). 103 104 To maximise compression there are different inodes for each file type 105 (regular file, directory, device, etc.), the inode contents and length 106 varying with the type. 107 108 To further maximise compression, two types of regular file inode and 109 directory inode are defined: inodes optimised for frequently occurring 110 regular files and directories, and extended types where extra 111 information has to be stored. 112 113 3.2 Directories 114 --------------- 115 116 Like inodes, directories are packed into compressed metadata blocks, stored 117 in a directory table. Directories are accessed using the start address of 118 the metablock containing the directory and the offset into the 119 decompressed block (<block, offset>). 120 121 Directories are organised in a slightly complex way, and are not simply 122 a list of file names. The organisation takes advantage of the 123 fact that (in most cases) the inodes of the files will be in the same 124 compressed metadata block, and therefore, can share the start block. 125 Directories are therefore organised in a two level list, a directory 126 header containing the shared start block value, and a sequence of directory 127 entries, each of which share the shared start block. A new directory header 128 is written once/if the inode start block changes. The directory 129 header/directory entry list is repeated as many times as necessary. 130 131 Directories are sorted, and can contain a directory index to speed up 132 file lookup. Directory indexes store one entry per metablock, each entry 133 storing the index/filename mapping to the first directory header 134 in each metadata block. Directories are sorted in alphabetical order, 135 and at lookup the index is scanned linearly looking for the first filename 136 alphabetically larger than the filename being looked up. At this point the 137 location of the metadata block the filename is in has been found. 138 The general idea of the index is ensure only one metadata block needs to be 139 decompressed to do a lookup irrespective of the length of the directory. 140 This scheme has the advantage that it doesn't require extra memory overhead 141 and doesn't require much extra storage on disk. 142 143 3.3 File data 144 ------------- 145 146 Regular files consist of a sequence of contiguous compressed blocks, and/or a 147 compressed fragment block (tail-end packed block). The compressed size 148 of each datablock is stored in a block list contained within the 149 file inode. 150 151 To speed up access to datablocks when reading 'large' files (256 Mbytes or 152 larger), the code implements an index cache that caches the mapping from 153 block index to datablock location on disk. 154 155 The index cache allows Squashfs to handle large files (up to 1.75 TiB) while 156 retaining a simple and space-efficient block list on disk. The cache 157 is split into slots, caching up to eight 224 GiB files (128 KiB blocks). 158 Larger files use multiple slots, with 1.75 TiB files using all 8 slots. 159 The index cache is designed to be memory efficient, and by default uses 160 16 KiB. 161 162 3.4 Fragment lookup table 163 ------------------------- 164 165 Regular files can contain a fragment index which is mapped to a fragment 166 location on disk and compressed size using a fragment lookup table. This 167 fragment lookup table is itself stored compressed into metadata blocks. 168 A second index table is used to locate these. This second index table for 169 speed of access (and because it is small) is read at mount time and cached 170 in memory. 171 172 3.5 Uid/gid lookup table 173 ------------------------ 174 175 For space efficiency regular files store uid and gid indexes, which are 176 converted to 32-bit uids/gids using an id look up table. This table is 177 stored compressed into metadata blocks. A second index table is used to 178 locate these. This second index table for speed of access (and because it 179 is small) is read at mount time and cached in memory. 180 181 3.6 Export table 182 ---------------- 183 184 To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 185 can optionally (disabled with the -no-exports Mksquashfs option) contain 186 an inode number to inode disk location lookup table. This is required to 187 enable Squashfs to map inode numbers passed in filehandles to the inode 188 location on disk, which is necessary when the export code reinstantiates 189 expired/flushed inodes. 190 191 This table is stored compressed into metadata blocks. A second index table is 192 used to locate these. This second index table for speed of access (and because 193 it is small) is read at mount time and cached in memory. 194 195 196 4. TODOS AND OUTSTANDING ISSUES 197 ------------------------------- 198 199 4.1 Todo list 200 ------------- 201 202 Implement Xattr and ACL support. The Squashfs 4.0 filesystem layout has hooks 203 for these but the code has not been written. Once the code has been written 204 the existing layout should not require modification. 205 206 4.2 Squashfs internal cache 207 --------------------------- 208 209 Blocks in Squashfs are compressed. To avoid repeatedly decompressing 210 recently accessed data Squashfs uses two small metadata and fragment caches. 211 212 The cache is not used for file datablocks, these are decompressed and cached in 213 the page-cache in the normal way. The cache is used to temporarily cache 214 fragment and metadata blocks which have been read as a result of a metadata 215 (i.e. inode or directory) or fragment access. Because metadata and fragments 216 are packed together into blocks (to gain greater compression) the read of a 217 particular piece of metadata or fragment will retrieve other metadata/fragments 218 which have been packed with it, these because of locality-of-reference may be 219 read in the near future. Temporarily caching them ensures they are available 220 for near future access without requiring an additional read and decompress. 221 222 In the future this internal cache may be replaced with an implementation which 223 uses the kernel page cache. Because the page cache operates on page sized 224 units this may introduce additional complexity in terms of locking and 225 associated race conditions. 226