Commit | Line | Data |
---|---|---|
a38b2411 KM |
1 | .\" Copyright (c) 1982 Regents of the University of California. |
2 | .\" All rights reserved. The Berkeley software License Agreement | |
3 | .\" specifies the terms and conditions for redistribution. | |
4 | .\" | |
145bc69d | 5 | .\" @(#)2.t 4.3 (Berkeley) %G% |
a38b2411 KM |
6 | .\" |
7 | .ds RH Overview of the file system | |
8 | .NH | |
9 | Overview of the file system | |
10 | .PP | |
6a1194d8 | 11 | The file system is discussed in detail in [Mckusick84]; |
a38b2411 KM |
12 | this section gives a brief overview. |
13 | .NH 2 | |
14 | Superblock | |
15 | .PP | |
16 | A file system is described by its | |
17 | .I "super-block" . | |
18 | The super-block is built when the file system is created (\c | |
19 | .I newfs (8)) | |
20 | and never changes. | |
21 | The super-block | |
22 | contains the basic parameters of the file system, | |
23 | such as the number of data blocks it contains | |
24 | and a count of the maximum number of files. | |
25 | Because the super-block contains critical data, | |
26 | .I newfs | |
27 | replicates it to protect against catastrophic loss. | |
28 | The | |
29 | .I "default super block" | |
30 | always resides at a fixed offset from the beginning | |
31 | of the file system's disk partition. | |
32 | The | |
33 | .I "redundant super blocks" | |
34 | are not referenced unless a head crash | |
35 | or other hard disk error causes the default super-block | |
36 | to be unusable. | |
37 | The redundant blocks are sprinkled throughout the disk partition. | |
38 | .PP | |
39 | Within the file system are files. | |
40 | Certain files are distinguished as directories and contain collections | |
41 | of pointers to files that may themselves be directories. | |
42 | Every file has a descriptor associated with it called an | |
43 | .I "inode". | |
44 | The inode contains information describing ownership of the file, | |
45 | time stamps indicating modification and access times for the file, | |
46 | and an array of indices pointing to the data blocks for the file. | |
47 | In this section, | |
48 | we assume that the first 12 blocks | |
49 | of the file are directly referenced by values stored | |
50 | in the inode structure itself\(dg. | |
51 | .FS | |
52 | \(dgThe actual number may vary from system to system, but is usually in | |
53 | the range 5-13. | |
54 | .FE | |
55 | The inode structure may also contain references to indirect blocks | |
56 | containing further data block indices. | |
57 | In a file system with a 4096 byte block size, a singly indirect | |
58 | block contains 1024 further block addresses, | |
59 | a doubly indirect block contains 1024 addresses of further single indirect | |
60 | blocks, | |
61 | and a triply indirect block contains 1024 addresses of further doubly indirect | |
145bc69d | 62 | blocks (the triple indirect block is never needed in practice). |
a38b2411 KM |
63 | .PP |
64 | In order to create files with up to | |
65 | 2\(ua32 bytes, | |
66 | using only two levels of indirection, | |
67 | the minimum size of a file system block is 4096 bytes. | |
68 | The size of file system blocks can be any power of two | |
69 | greater than or equal to 4096. | |
70 | The block size of the file system is maintained in the super-block, | |
71 | so it is possible for file systems of different block sizes | |
72 | to be accessible simultaneously on the same system. | |
73 | The block size must be decided when | |
74 | .I newfs | |
75 | creates the file system; | |
76 | the block size cannot be subsequently | |
77 | changed without rebuilding the file system. | |
78 | .NH 2 | |
79 | Summary information | |
80 | .PP | |
81 | Associated with the super block is non replicated | |
82 | .I "summary information" . | |
83 | The summary information changes | |
84 | as the file system is modified. | |
85 | The summary information contains | |
86 | the number of blocks, fragments, inodes and directories in the file system. | |
87 | .NH 2 | |
88 | Cylinder groups | |
89 | .PP | |
90 | The file system partitions the disk into one or more areas called | |
91 | .I "cylinder groups". | |
92 | A cylinder group is comprised of one or more consecutive | |
93 | cylinders on a disk. | |
94 | Each cylinder group includes inode slots for files, a | |
95 | .I "block map" | |
96 | describing available blocks in the cylinder group, | |
97 | and summary information describing the usage of data blocks | |
98 | within the cylinder group. | |
99 | A fixed number of inodes is allocated for each cylinder group | |
100 | when the file system is created. | |
101 | The current policy is to allocate one inode for each 2048 | |
102 | bytes of disk space; | |
103 | this is expected to be far more inodes than will ever be needed. | |
104 | .PP | |
105 | All the cylinder group bookkeeping information could be | |
106 | placed at the beginning of each cylinder group. | |
107 | However if this approach were used, | |
108 | all the redundant information would be on the top platter. | |
109 | A single hardware failure that destroyed the top platter | |
110 | could cause the loss of all copies of the redundant super-blocks. | |
111 | Thus the cylinder group bookkeeping information | |
112 | begins at a floating offset from the beginning of the cylinder group. | |
113 | The offset for | |
114 | the | |
115 | .I "i+1" st | |
116 | cylinder group is about one track further | |
117 | from the beginning of the cylinder group | |
118 | than it was for the | |
119 | .I "i" th | |
120 | cylinder group. | |
121 | In this way, | |
122 | the redundant | |
123 | information spirals down into the pack; | |
124 | any single track, cylinder, | |
125 | or platter can be lost without losing all copies of the super-blocks. | |
126 | Except for the first cylinder group, | |
127 | the space between the beginning of the cylinder group | |
128 | and the beginning of the cylinder group information stores data. | |
129 | .NH 2 | |
130 | Fragments | |
131 | .PP | |
132 | To avoid waste in storing small files, | |
133 | the file system space allocator divides a single | |
134 | file system block into one or more | |
135 | .I "fragments". | |
136 | The fragmentation of the file system is specified | |
137 | when the file system is created; | |
138 | each file system block can be optionally broken into | |
139 | 2, 4, or 8 addressable fragments. | |
140 | The lower bound on the size of these fragments is constrained | |
141 | by the disk sector size; | |
142 | typically 512 bytes is the lower bound on fragment size. | |
143 | The block map associated with each cylinder group | |
144 | records the space availability at the fragment level. | |
145 | Aligned fragments are examined | |
146 | to determine block availability. | |
147 | .PP | |
148 | On a file system with a block size of 4096 bytes | |
149 | and a fragment size of 1024 bytes, | |
150 | a file is represented by zero or more 4096 byte blocks of data, | |
151 | and possibly a single fragmented block. | |
152 | If a file system block must be fragmented to obtain | |
153 | space for a small amount of data, | |
154 | the remainder of the block is made available for allocation | |
155 | to other files. | |
156 | For example, | |
157 | consider an 11000 byte file stored on | |
158 | a 4096/1024 byte file system. | |
159 | This file uses two full size blocks and a 3072 byte fragment. | |
160 | If no fragments with at least 3072 bytes | |
161 | are available when the file is created, | |
162 | a full size block is split yielding the necessary 3072 byte | |
163 | fragment and an unused 1024 byte fragment. | |
164 | This remaining fragment can be allocated to another file, as needed. | |
165 | .NH 2 | |
166 | Updates to the file system | |
167 | .PP | |
168 | Every working day hundreds of files | |
169 | are created, modified, and removed. | |
170 | Every time a file is modified, | |
171 | the operating system performs a | |
172 | series of file system updates. | |
173 | These updates, when written on disk, yield a consistent file system. | |
174 | The file system stages | |
175 | all modifications of critical information; | |
176 | modification can | |
177 | either be completed or cleanly backed out after a crash. | |
178 | Knowing the information that is first written to the file system, | |
179 | deterministic procedures can be developed to | |
180 | repair a corrupted file system. | |
181 | To understand this process, | |
182 | the order that the update | |
183 | requests were being honored must first be understood. | |
184 | .PP | |
185 | When a user program does an operation to change the file system, | |
186 | such as a | |
187 | .I write , | |
188 | the data to be written is copied into an internal | |
189 | .I "in-core" | |
190 | buffer in the kernel. | |
191 | Normally, the disk update is handled asynchronously; | |
192 | the user process is allowed to proceed even though | |
193 | the data has not yet been written to the disk. | |
194 | The data, | |
195 | along with the inode information reflecting the change, | |
196 | is eventually written out to disk. | |
197 | The real disk write may not happen until long after the | |
198 | .I write | |
199 | system call has returned. | |
200 | Thus at any given time, the file system, | |
201 | as it resides on the disk, | |
202 | lags the state of the file system represented by the in-core information. | |
203 | .PP | |
204 | The disk information is updated to reflect the in-core information | |
205 | when the buffer is required for another use, | |
206 | when a | |
207 | .I sync (2) | |
208 | is done (at 30 second intervals) by | |
209 | .I "/etc/update" "(8)," | |
210 | or by manual operator intervention with the | |
211 | .I sync (8) | |
212 | command. | |
213 | If the system is halted without writing out the in-core information, | |
214 | the file system on the disk will be in an inconsistent state. | |
215 | .PP | |
216 | If all updates are done asynchronously, several serious | |
217 | inconsistencies can arise. | |
218 | One inconsistency is that a block may be claimed by two inodes. | |
219 | Such an inconsistency can occur when the system is halted before | |
220 | the pointer to the block in the old inode has been cleared | |
221 | in the copy of the old inode on the disk, | |
222 | and after the pointer to the block in the new inode has been written out | |
223 | to the copy of the new inode on the disk. | |
224 | Here, | |
225 | there is no deterministic method for deciding | |
226 | which inode should really claim the block. | |
227 | A similar problem can arise with a multiply claimed inode. | |
228 | .PP | |
229 | The problem with asynchronous inode updates | |
230 | can be avoided by doing all inode deallocations synchronously. | |
231 | Consequently, | |
232 | inodes and indirect blocks are written to the disk synchronously | |
233 | (\fIi.e.\fP the process blocks until the information is | |
234 | really written to disk) | |
235 | when they are being deallocated. | |
236 | Similarly inodes are kept consistent by synchronously | |
237 | deleting, adding, or changing directory entries. | |
238 | .ds RH Fixing corrupted file systems |