date and time created 90/05/05 00:01:16 by mckusick
[unix-history] / usr / src / sbin / fsck / SMM.doc / 2.t
CommitLineData
a38b2411
KM
1.\" Copyright (c) 1982 Regents of the University of California.
2.\" All rights reserved. The Berkeley software License Agreement
3.\" specifies the terms and conditions for redistribution.
4.\"
145bc69d 5.\" @(#)2.t 4.3 (Berkeley) %G%
a38b2411
KM
6.\"
7.ds RH Overview of the file system
8.NH
9Overview of the file system
10.PP
6a1194d8 11The file system is discussed in detail in [Mckusick84];
a38b2411
KM
12this section gives a brief overview.
13.NH 2
14Superblock
15.PP
16A file system is described by its
17.I "super-block" .
18The super-block is built when the file system is created (\c
19.I newfs (8))
20and never changes.
21The super-block
22contains the basic parameters of the file system,
23such as the number of data blocks it contains
24and a count of the maximum number of files.
25Because the super-block contains critical data,
26.I newfs
27replicates it to protect against catastrophic loss.
28The
29.I "default super block"
30always resides at a fixed offset from the beginning
31of the file system's disk partition.
32The
33.I "redundant super blocks"
34are not referenced unless a head crash
35or other hard disk error causes the default super-block
36to be unusable.
37The redundant blocks are sprinkled throughout the disk partition.
38.PP
39Within the file system are files.
40Certain files are distinguished as directories and contain collections
41of pointers to files that may themselves be directories.
42Every file has a descriptor associated with it called an
43.I "inode".
44The inode contains information describing ownership of the file,
45time stamps indicating modification and access times for the file,
46and an array of indices pointing to the data blocks for the file.
47In this section,
48we assume that the first 12 blocks
49of the file are directly referenced by values stored
50in the inode structure itself\(dg.
51.FS
52\(dgThe actual number may vary from system to system, but is usually in
53the range 5-13.
54.FE
55The inode structure may also contain references to indirect blocks
56containing further data block indices.
57In a file system with a 4096 byte block size, a singly indirect
58block contains 1024 further block addresses,
59a doubly indirect block contains 1024 addresses of further single indirect
60blocks,
61and a triply indirect block contains 1024 addresses of further doubly indirect
145bc69d 62blocks (the triple indirect block is never needed in practice).
a38b2411
KM
63.PP
64In order to create files with up to
652\(ua32 bytes,
66using only two levels of indirection,
67the minimum size of a file system block is 4096 bytes.
68The size of file system blocks can be any power of two
69greater than or equal to 4096.
70The block size of the file system is maintained in the super-block,
71so it is possible for file systems of different block sizes
72to be accessible simultaneously on the same system.
73The block size must be decided when
74.I newfs
75creates the file system;
76the block size cannot be subsequently
77changed without rebuilding the file system.
78.NH 2
79Summary information
80.PP
81Associated with the super block is non replicated
82.I "summary information" .
83The summary information changes
84as the file system is modified.
85The summary information contains
86the number of blocks, fragments, inodes and directories in the file system.
87.NH 2
88Cylinder groups
89.PP
90The file system partitions the disk into one or more areas called
91.I "cylinder groups".
92A cylinder group is comprised of one or more consecutive
93cylinders on a disk.
94Each cylinder group includes inode slots for files, a
95.I "block map"
96describing available blocks in the cylinder group,
97and summary information describing the usage of data blocks
98within the cylinder group.
99A fixed number of inodes is allocated for each cylinder group
100when the file system is created.
101The current policy is to allocate one inode for each 2048
102bytes of disk space;
103this is expected to be far more inodes than will ever be needed.
104.PP
105All the cylinder group bookkeeping information could be
106placed at the beginning of each cylinder group.
107However if this approach were used,
108all the redundant information would be on the top platter.
109A single hardware failure that destroyed the top platter
110could cause the loss of all copies of the redundant super-blocks.
111Thus the cylinder group bookkeeping information
112begins at a floating offset from the beginning of the cylinder group.
113The offset for
114the
115.I "i+1" st
116cylinder group is about one track further
117from the beginning of the cylinder group
118than it was for the
119.I "i" th
120cylinder group.
121In this way,
122the redundant
123information spirals down into the pack;
124any single track, cylinder,
125or platter can be lost without losing all copies of the super-blocks.
126Except for the first cylinder group,
127the space between the beginning of the cylinder group
128and the beginning of the cylinder group information stores data.
129.NH 2
130Fragments
131.PP
132To avoid waste in storing small files,
133the file system space allocator divides a single
134file system block into one or more
135.I "fragments".
136The fragmentation of the file system is specified
137when the file system is created;
138each file system block can be optionally broken into
1392, 4, or 8 addressable fragments.
140The lower bound on the size of these fragments is constrained
141by the disk sector size;
142typically 512 bytes is the lower bound on fragment size.
143The block map associated with each cylinder group
144records the space availability at the fragment level.
145Aligned fragments are examined
146to determine block availability.
147.PP
148On a file system with a block size of 4096 bytes
149and a fragment size of 1024 bytes,
150a file is represented by zero or more 4096 byte blocks of data,
151and possibly a single fragmented block.
152If a file system block must be fragmented to obtain
153space for a small amount of data,
154the remainder of the block is made available for allocation
155to other files.
156For example,
157consider an 11000 byte file stored on
158a 4096/1024 byte file system.
159This file uses two full size blocks and a 3072 byte fragment.
160If no fragments with at least 3072 bytes
161are available when the file is created,
162a full size block is split yielding the necessary 3072 byte
163fragment and an unused 1024 byte fragment.
164This remaining fragment can be allocated to another file, as needed.
165.NH 2
166Updates to the file system
167.PP
168Every working day hundreds of files
169are created, modified, and removed.
170Every time a file is modified,
171the operating system performs a
172series of file system updates.
173These updates, when written on disk, yield a consistent file system.
174The file system stages
175all modifications of critical information;
176modification can
177either be completed or cleanly backed out after a crash.
178Knowing the information that is first written to the file system,
179deterministic procedures can be developed to
180repair a corrupted file system.
181To understand this process,
182the order that the update
183requests were being honored must first be understood.
184.PP
185When a user program does an operation to change the file system,
186such as a
187.I write ,
188the data to be written is copied into an internal
189.I "in-core"
190buffer in the kernel.
191Normally, the disk update is handled asynchronously;
192the user process is allowed to proceed even though
193the data has not yet been written to the disk.
194The data,
195along with the inode information reflecting the change,
196is eventually written out to disk.
197The real disk write may not happen until long after the
198.I write
199system call has returned.
200Thus at any given time, the file system,
201as it resides on the disk,
202lags the state of the file system represented by the in-core information.
203.PP
204The disk information is updated to reflect the in-core information
205when the buffer is required for another use,
206when a
207.I sync (2)
208is done (at 30 second intervals) by
209.I "/etc/update" "(8),"
210or by manual operator intervention with the
211.I sync (8)
212command.
213If the system is halted without writing out the in-core information,
214the file system on the disk will be in an inconsistent state.
215.PP
216If all updates are done asynchronously, several serious
217inconsistencies can arise.
218One inconsistency is that a block may be claimed by two inodes.
219Such an inconsistency can occur when the system is halted before
220the pointer to the block in the old inode has been cleared
221in the copy of the old inode on the disk,
222and after the pointer to the block in the new inode has been written out
223to the copy of the new inode on the disk.
224Here,
225there is no deterministic method for deciding
226which inode should really claim the block.
227A similar problem can arise with a multiply claimed inode.
228.PP
229The problem with asynchronous inode updates
230can be avoided by doing all inode deallocations synchronously.
231Consequently,
232inodes and indirect blocks are written to the disk synchronously
233(\fIi.e.\fP the process blocks until the information is
234really written to disk)
235when they are being deallocated.
236Similarly inodes are kept consistent by synchronously
237deleting, adding, or changing directory entries.
238.ds RH Fixing corrupted file systems