4.4BSD snapshot (revision 8.1); add 1993 to copyright
[unix-history] / usr / src / sbin / fsck / SMM.doc / 2.t
CommitLineData
5c938adf
KB
1.\" Copyright (c) 1982, 1993
2.\" The Regents of the University of California. All rights reserved.
a38b2411 3.\"
3edcb7c8
KB
4.\" %sccs.include.redist.roff%
5.\"
5c938adf 6.\" @(#)2.t 8.1 (Berkeley) %G%
a38b2411
KM
7.\"
8.ds RH Overview of the file system
9.NH
10Overview of the file system
11.PP
6a1194d8 12The file system is discussed in detail in [Mckusick84];
a38b2411
KM
13this section gives a brief overview.
14.NH 2
15Superblock
16.PP
17A file system is described by its
18.I "super-block" .
19The super-block is built when the file system is created (\c
20.I newfs (8))
21and never changes.
22The super-block
23contains the basic parameters of the file system,
24such as the number of data blocks it contains
25and a count of the maximum number of files.
26Because the super-block contains critical data,
27.I newfs
28replicates it to protect against catastrophic loss.
29The
30.I "default super block"
31always resides at a fixed offset from the beginning
32of the file system's disk partition.
33The
34.I "redundant super blocks"
35are not referenced unless a head crash
36or other hard disk error causes the default super-block
37to be unusable.
38The redundant blocks are sprinkled throughout the disk partition.
39.PP
40Within the file system are files.
41Certain files are distinguished as directories and contain collections
42of pointers to files that may themselves be directories.
43Every file has a descriptor associated with it called an
44.I "inode".
45The inode contains information describing ownership of the file,
46time stamps indicating modification and access times for the file,
47and an array of indices pointing to the data blocks for the file.
48In this section,
49we assume that the first 12 blocks
50of the file are directly referenced by values stored
51in the inode structure itself\(dg.
52.FS
53\(dgThe actual number may vary from system to system, but is usually in
54the range 5-13.
55.FE
56The inode structure may also contain references to indirect blocks
57containing further data block indices.
58In a file system with a 4096 byte block size, a singly indirect
59block contains 1024 further block addresses,
60a doubly indirect block contains 1024 addresses of further single indirect
61blocks,
62and a triply indirect block contains 1024 addresses of further doubly indirect
145bc69d 63blocks (the triple indirect block is never needed in practice).
a38b2411
KM
64.PP
65In order to create files with up to
662\(ua32 bytes,
67using only two levels of indirection,
68the minimum size of a file system block is 4096 bytes.
69The size of file system blocks can be any power of two
70greater than or equal to 4096.
71The block size of the file system is maintained in the super-block,
72so it is possible for file systems of different block sizes
73to be accessible simultaneously on the same system.
74The block size must be decided when
75.I newfs
76creates the file system;
77the block size cannot be subsequently
78changed without rebuilding the file system.
79.NH 2
80Summary information
81.PP
82Associated with the super block is non replicated
83.I "summary information" .
84The summary information changes
85as the file system is modified.
86The summary information contains
87the number of blocks, fragments, inodes and directories in the file system.
88.NH 2
89Cylinder groups
90.PP
91The file system partitions the disk into one or more areas called
92.I "cylinder groups".
93A cylinder group is comprised of one or more consecutive
94cylinders on a disk.
95Each cylinder group includes inode slots for files, a
96.I "block map"
97describing available blocks in the cylinder group,
98and summary information describing the usage of data blocks
99within the cylinder group.
100A fixed number of inodes is allocated for each cylinder group
101when the file system is created.
102The current policy is to allocate one inode for each 2048
103bytes of disk space;
104this is expected to be far more inodes than will ever be needed.
105.PP
106All the cylinder group bookkeeping information could be
107placed at the beginning of each cylinder group.
108However if this approach were used,
109all the redundant information would be on the top platter.
110A single hardware failure that destroyed the top platter
111could cause the loss of all copies of the redundant super-blocks.
112Thus the cylinder group bookkeeping information
113begins at a floating offset from the beginning of the cylinder group.
114The offset for
115the
116.I "i+1" st
117cylinder group is about one track further
118from the beginning of the cylinder group
119than it was for the
120.I "i" th
121cylinder group.
122In this way,
123the redundant
124information spirals down into the pack;
125any single track, cylinder,
126or platter can be lost without losing all copies of the super-blocks.
127Except for the first cylinder group,
128the space between the beginning of the cylinder group
129and the beginning of the cylinder group information stores data.
130.NH 2
131Fragments
132.PP
133To avoid waste in storing small files,
134the file system space allocator divides a single
135file system block into one or more
136.I "fragments".
137The fragmentation of the file system is specified
138when the file system is created;
139each file system block can be optionally broken into
1402, 4, or 8 addressable fragments.
141The lower bound on the size of these fragments is constrained
142by the disk sector size;
143typically 512 bytes is the lower bound on fragment size.
144The block map associated with each cylinder group
145records the space availability at the fragment level.
146Aligned fragments are examined
147to determine block availability.
148.PP
149On a file system with a block size of 4096 bytes
150and a fragment size of 1024 bytes,
151a file is represented by zero or more 4096 byte blocks of data,
152and possibly a single fragmented block.
153If a file system block must be fragmented to obtain
154space for a small amount of data,
155the remainder of the block is made available for allocation
156to other files.
157For example,
158consider an 11000 byte file stored on
159a 4096/1024 byte file system.
160This file uses two full size blocks and a 3072 byte fragment.
161If no fragments with at least 3072 bytes
162are available when the file is created,
163a full size block is split yielding the necessary 3072 byte
164fragment and an unused 1024 byte fragment.
165This remaining fragment can be allocated to another file, as needed.
166.NH 2
167Updates to the file system
168.PP
169Every working day hundreds of files
170are created, modified, and removed.
171Every time a file is modified,
172the operating system performs a
173series of file system updates.
174These updates, when written on disk, yield a consistent file system.
175The file system stages
176all modifications of critical information;
177modification can
178either be completed or cleanly backed out after a crash.
179Knowing the information that is first written to the file system,
180deterministic procedures can be developed to
181repair a corrupted file system.
182To understand this process,
183the order that the update
184requests were being honored must first be understood.
185.PP
186When a user program does an operation to change the file system,
187such as a
188.I write ,
189the data to be written is copied into an internal
190.I "in-core"
191buffer in the kernel.
192Normally, the disk update is handled asynchronously;
193the user process is allowed to proceed even though
194the data has not yet been written to the disk.
195The data,
196along with the inode information reflecting the change,
197is eventually written out to disk.
198The real disk write may not happen until long after the
199.I write
200system call has returned.
201Thus at any given time, the file system,
202as it resides on the disk,
203lags the state of the file system represented by the in-core information.
204.PP
205The disk information is updated to reflect the in-core information
206when the buffer is required for another use,
207when a
208.I sync (2)
209is done (at 30 second intervals) by
210.I "/etc/update" "(8),"
211or by manual operator intervention with the
212.I sync (8)
213command.
214If the system is halted without writing out the in-core information,
215the file system on the disk will be in an inconsistent state.
216.PP
217If all updates are done asynchronously, several serious
218inconsistencies can arise.
219One inconsistency is that a block may be claimed by two inodes.
220Such an inconsistency can occur when the system is halted before
221the pointer to the block in the old inode has been cleared
222in the copy of the old inode on the disk,
223and after the pointer to the block in the new inode has been written out
224to the copy of the new inode on the disk.
225Here,
226there is no deterministic method for deciding
227which inode should really claim the block.
228A similar problem can arise with a multiply claimed inode.
229.PP
230The problem with asynchronous inode updates
231can be avoided by doing all inode deallocations synchronously.
232Consequently,
233inodes and indirect blocks are written to the disk synchronously
234(\fIi.e.\fP the process blocks until the information is
235really written to disk)
236when they are being deallocated.
237Similarly inodes are kept consistent by synchronously
238deleting, adding, or changing directory entries.
239.ds RH Fixing corrupted file systems