| 1 | .\" Copyright (c) 1982 Regents of the University of California. |
| 2 | .\" All rights reserved. The Berkeley software License Agreement |
| 3 | .\" specifies the terms and conditions for redistribution. |
| 4 | .\" |
| 5 | .\" @(#)2.t 4.3 (Berkeley) %G% |
| 6 | .\" |
| 7 | .ds RH Overview of the file system |
| 8 | .NH |
| 9 | Overview of the file system |
| 10 | .PP |
| 11 | The file system is discussed in detail in [Mckusick84]; |
| 12 | this section gives a brief overview. |
| 13 | .NH 2 |
| 14 | Superblock |
| 15 | .PP |
| 16 | A file system is described by its |
| 17 | .I "super-block" . |
| 18 | The super-block is built when the file system is created (\c |
| 19 | .I newfs (8)) |
| 20 | and never changes. |
| 21 | The super-block |
| 22 | contains the basic parameters of the file system, |
| 23 | such as the number of data blocks it contains |
| 24 | and a count of the maximum number of files. |
| 25 | Because the super-block contains critical data, |
| 26 | .I newfs |
| 27 | replicates it to protect against catastrophic loss. |
| 28 | The |
| 29 | .I "default super block" |
| 30 | always resides at a fixed offset from the beginning |
| 31 | of the file system's disk partition. |
| 32 | The |
| 33 | .I "redundant super blocks" |
| 34 | are not referenced unless a head crash |
| 35 | or other hard disk error causes the default super-block |
| 36 | to be unusable. |
| 37 | The redundant blocks are sprinkled throughout the disk partition. |
| 38 | .PP |
| 39 | Within the file system are files. |
| 40 | Certain files are distinguished as directories and contain collections |
| 41 | of pointers to files that may themselves be directories. |
| 42 | Every file has a descriptor associated with it called an |
| 43 | .I "inode". |
| 44 | The inode contains information describing ownership of the file, |
| 45 | time stamps indicating modification and access times for the file, |
| 46 | and an array of indices pointing to the data blocks for the file. |
| 47 | In this section, |
| 48 | we assume that the first 12 blocks |
| 49 | of the file are directly referenced by values stored |
| 50 | in the inode structure itself\(dg. |
| 51 | .FS |
| 52 | \(dgThe actual number may vary from system to system, but is usually in |
| 53 | the range 5-13. |
| 54 | .FE |
| 55 | The inode structure may also contain references to indirect blocks |
| 56 | containing further data block indices. |
| 57 | In a file system with a 4096 byte block size, a singly indirect |
| 58 | block contains 1024 further block addresses, |
| 59 | a doubly indirect block contains 1024 addresses of further single indirect |
| 60 | blocks, |
| 61 | and a triply indirect block contains 1024 addresses of further doubly indirect |
| 62 | blocks (the triple indirect block is never needed in practice). |
| 63 | .PP |
| 64 | In order to create files with up to |
| 65 | 2\(ua32 bytes, |
| 66 | using only two levels of indirection, |
| 67 | the minimum size of a file system block is 4096 bytes. |
| 68 | The size of file system blocks can be any power of two |
| 69 | greater than or equal to 4096. |
| 70 | The block size of the file system is maintained in the super-block, |
| 71 | so it is possible for file systems of different block sizes |
| 72 | to be accessible simultaneously on the same system. |
| 73 | The block size must be decided when |
| 74 | .I newfs |
| 75 | creates the file system; |
| 76 | the block size cannot be subsequently |
| 77 | changed without rebuilding the file system. |
| 78 | .NH 2 |
| 79 | Summary information |
| 80 | .PP |
| 81 | Associated with the super block is non replicated |
| 82 | .I "summary information" . |
| 83 | The summary information changes |
| 84 | as the file system is modified. |
| 85 | The summary information contains |
| 86 | the number of blocks, fragments, inodes and directories in the file system. |
| 87 | .NH 2 |
| 88 | Cylinder groups |
| 89 | .PP |
| 90 | The file system partitions the disk into one or more areas called |
| 91 | .I "cylinder groups". |
| 92 | A cylinder group is comprised of one or more consecutive |
| 93 | cylinders on a disk. |
| 94 | Each cylinder group includes inode slots for files, a |
| 95 | .I "block map" |
| 96 | describing available blocks in the cylinder group, |
| 97 | and summary information describing the usage of data blocks |
| 98 | within the cylinder group. |
| 99 | A fixed number of inodes is allocated for each cylinder group |
| 100 | when the file system is created. |
| 101 | The current policy is to allocate one inode for each 2048 |
| 102 | bytes of disk space; |
| 103 | this is expected to be far more inodes than will ever be needed. |
| 104 | .PP |
| 105 | All the cylinder group bookkeeping information could be |
| 106 | placed at the beginning of each cylinder group. |
| 107 | However if this approach were used, |
| 108 | all the redundant information would be on the top platter. |
| 109 | A single hardware failure that destroyed the top platter |
| 110 | could cause the loss of all copies of the redundant super-blocks. |
| 111 | Thus the cylinder group bookkeeping information |
| 112 | begins at a floating offset from the beginning of the cylinder group. |
| 113 | The offset for |
| 114 | the |
| 115 | .I "i+1" st |
| 116 | cylinder group is about one track further |
| 117 | from the beginning of the cylinder group |
| 118 | than it was for the |
| 119 | .I "i" th |
| 120 | cylinder group. |
| 121 | In this way, |
| 122 | the redundant |
| 123 | information spirals down into the pack; |
| 124 | any single track, cylinder, |
| 125 | or platter can be lost without losing all copies of the super-blocks. |
| 126 | Except for the first cylinder group, |
| 127 | the space between the beginning of the cylinder group |
| 128 | and the beginning of the cylinder group information stores data. |
| 129 | .NH 2 |
| 130 | Fragments |
| 131 | .PP |
| 132 | To avoid waste in storing small files, |
| 133 | the file system space allocator divides a single |
| 134 | file system block into one or more |
| 135 | .I "fragments". |
| 136 | The fragmentation of the file system is specified |
| 137 | when the file system is created; |
| 138 | each file system block can be optionally broken into |
| 139 | 2, 4, or 8 addressable fragments. |
| 140 | The lower bound on the size of these fragments is constrained |
| 141 | by the disk sector size; |
| 142 | typically 512 bytes is the lower bound on fragment size. |
| 143 | The block map associated with each cylinder group |
| 144 | records the space availability at the fragment level. |
| 145 | Aligned fragments are examined |
| 146 | to determine block availability. |
| 147 | .PP |
| 148 | On a file system with a block size of 4096 bytes |
| 149 | and a fragment size of 1024 bytes, |
| 150 | a file is represented by zero or more 4096 byte blocks of data, |
| 151 | and possibly a single fragmented block. |
| 152 | If a file system block must be fragmented to obtain |
| 153 | space for a small amount of data, |
| 154 | the remainder of the block is made available for allocation |
| 155 | to other files. |
| 156 | For example, |
| 157 | consider an 11000 byte file stored on |
| 158 | a 4096/1024 byte file system. |
| 159 | This file uses two full size blocks and a 3072 byte fragment. |
| 160 | If no fragments with at least 3072 bytes |
| 161 | are available when the file is created, |
| 162 | a full size block is split yielding the necessary 3072 byte |
| 163 | fragment and an unused 1024 byte fragment. |
| 164 | This remaining fragment can be allocated to another file, as needed. |
| 165 | .NH 2 |
| 166 | Updates to the file system |
| 167 | .PP |
| 168 | Every working day hundreds of files |
| 169 | are created, modified, and removed. |
| 170 | Every time a file is modified, |
| 171 | the operating system performs a |
| 172 | series of file system updates. |
| 173 | These updates, when written on disk, yield a consistent file system. |
| 174 | The file system stages |
| 175 | all modifications of critical information; |
| 176 | modification can |
| 177 | either be completed or cleanly backed out after a crash. |
| 178 | Knowing the information that is first written to the file system, |
| 179 | deterministic procedures can be developed to |
| 180 | repair a corrupted file system. |
| 181 | To understand this process, |
| 182 | the order that the update |
| 183 | requests were being honored must first be understood. |
| 184 | .PP |
| 185 | When a user program does an operation to change the file system, |
| 186 | such as a |
| 187 | .I write , |
| 188 | the data to be written is copied into an internal |
| 189 | .I "in-core" |
| 190 | buffer in the kernel. |
| 191 | Normally, the disk update is handled asynchronously; |
| 192 | the user process is allowed to proceed even though |
| 193 | the data has not yet been written to the disk. |
| 194 | The data, |
| 195 | along with the inode information reflecting the change, |
| 196 | is eventually written out to disk. |
| 197 | The real disk write may not happen until long after the |
| 198 | .I write |
| 199 | system call has returned. |
| 200 | Thus at any given time, the file system, |
| 201 | as it resides on the disk, |
| 202 | lags the state of the file system represented by the in-core information. |
| 203 | .PP |
| 204 | The disk information is updated to reflect the in-core information |
| 205 | when the buffer is required for another use, |
| 206 | when a |
| 207 | .I sync (2) |
| 208 | is done (at 30 second intervals) by |
| 209 | .I "/etc/update" "(8)," |
| 210 | or by manual operator intervention with the |
| 211 | .I sync (8) |
| 212 | command. |
| 213 | If the system is halted without writing out the in-core information, |
| 214 | the file system on the disk will be in an inconsistent state. |
| 215 | .PP |
| 216 | If all updates are done asynchronously, several serious |
| 217 | inconsistencies can arise. |
| 218 | One inconsistency is that a block may be claimed by two inodes. |
| 219 | Such an inconsistency can occur when the system is halted before |
| 220 | the pointer to the block in the old inode has been cleared |
| 221 | in the copy of the old inode on the disk, |
| 222 | and after the pointer to the block in the new inode has been written out |
| 223 | to the copy of the new inode on the disk. |
| 224 | Here, |
| 225 | there is no deterministic method for deciding |
| 226 | which inode should really claim the block. |
| 227 | A similar problem can arise with a multiply claimed inode. |
| 228 | .PP |
| 229 | The problem with asynchronous inode updates |
| 230 | can be avoided by doing all inode deallocations synchronously. |
| 231 | Consequently, |
| 232 | inodes and indirect blocks are written to the disk synchronously |
| 233 | (\fIi.e.\fP the process blocks until the information is |
| 234 | really written to disk) |
| 235 | when they are being deallocated. |
| 236 | Similarly inodes are kept consistent by synchronously |
| 237 | deleting, adding, or changing directory entries. |
| 238 | .ds RH Fixing corrupted file systems |