Commit | Line | Data |
---|---|---|
3b58afdc | 1 | # @(#)README 7.5 (Berkeley) %G% |
a25f8fbf KB |
2 | |
3 | The file system is reasonably stable, but incomplete. There is no cleaner | |
4 | on the 4.4BSD-Alpha tape. Therefore, LFS is currently a "write-once" file | |
5 | system. The cleaner system calls are all implemented and appear to work, | |
6 | although there are places where performance can be improved dramatically | |
7 | (see comments in lfs_syscalls.c). | |
8 | ||
9 | Missing Functionality: | |
10 | We currently do no block accounting when blocks are written. Since | |
11 | allocation is not performed until blocks in the buffer cache are | |
12 | written to disk, it is possible to return success on a write, only | |
13 | to discover later that there is insufficient space get the block | |
14 | on disk. | |
15 | ||
16 | We intend to support multiple block sizes rather than fragments. | |
17 | This is not implemented. | |
18 | ||
19 | Since blocks are laid out contiguously, we can miss rotations reading | |
20 | sequentially. We need to read in contiguous blocks to avoid that. | |
21 | See McVoy's Winter 1991 Usenix paper for details on how to do that. | |
22 | ||
a25f8fbf KB |
23 | ---------- |
24 | Design Details (a more complete design has been submitted to the January 1993 | |
25 | Usenix Conference): | |
cea82ceb KB |
26 | |
27 | The disk is laid out in segments. The first segment starts 8K into the | |
28 | disk (the first 8K is used for boot information). Each segment is composed | |
29 | of the following: | |
30 | ||
31 | An optional super block | |
32 | One or more groups of: | |
33 | segment summary | |
34 | 0 or more data blocks | |
35 | 0 or more inode blocks | |
36 | ||
37 | The segment summary and inode/data blocks start after the super block (if | |
38 | present), and grow toward the end of the segment. | |
39 | ||
40 | _______________________________________________ | |
41 | | | | | | | |
42 | | summary | data/inode | summary | data/inode | | |
43 | | block | blocks | block | blocks | ... | |
44 | |_________|____________|_________|____________| | |
45 | ||
46 | The data/inode blocks following a summary block are described by the | |
47 | summary block. In order to permit the segment to be written in any order | |
48 | and in a forward direction only, a checksum is calculated across the | |
49 | blocks described by the summary. Additionally, the summary is checksummed | |
50 | and timestamped. Both of these are intended for recovery; the former is | |
51 | to make it easy to determine that it *is* a summary block and the latter | |
52 | is to make it easy to determine when recovery is finished for partially | |
a25f8fbf | 53 | written segments. These checksums are also used by the cleaner. |
cea82ceb KB |
54 | |
55 | Summary block (detail) | |
56 | ________________ | |
57 | | sum cksum | | |
58 | | data cksum | | |
59 | | next segment | | |
60 | | timestamp | | |
61 | | FINFO count | | |
62 | | inode count | | |
a25f8fbf | 63 | | flags | |
cea82ceb KB |
64 | |______________| |
65 | | FINFO-1 | 0 or more file info structures, identifying the | |
66 | | . | blocks in the segment. | |
67 | | . | | |
68 | | . | | |
69 | | FINFO-N | | |
70 | | inode-N | | |
71 | | . | | |
72 | | . | | |
73 | | . | 0 or more inode daddr_t's, identifying the inode | |
74 | | inode-1 | blocks in the segment. | |
75 | |______________| | |
76 | ||
77 | Inode blocks are blocks of on-disk inodes in the same format as those in | |
a25f8fbf KB |
78 | the FFS. However, spare[0] contains the inode number of the inode so we |
79 | can find a particular inode on a page. They are packed page_size / | |
80 | sizeof(inode) to a block. Data blocks are exactly as in the FFS. Both | |
81 | inodes and data blocks move around the file system at will. | |
cea82ceb KB |
82 | |
83 | The file system is described by a super-block which is replicated and | |
84 | occurs as the first block of the first and other segments. (The maximum | |
85 | number of super-blocks is MAXNUMSB). Each super-block maintains a list | |
86 | of the disk addresses of all the super-blocks. The super-block maintains | |
87 | a small amount of checkpoint information, essentially just enough to find | |
a25f8fbf | 88 | the inode for the IFILE (fs->lfs_idaddr). |
cea82ceb KB |
89 | |
90 | The IFILE is visible in the file system, as inode number IFILE_INUM. It | |
91 | contains information shared between the kernel and various user processes. | |
92 | ||
93 | Ifile (detail) | |
94 | ________________ | |
95 | | cleaner info | Cleaner information per file system. (Page | |
96 | | | granularity.) | |
97 | |______________| | |
98 | | segment | Space available and last modified times per | |
99 | | usage table | segment. (Page granularity.) | |
100 | |______________| | |
101 | | IFILE-1 | Per inode status information: current version #, | |
102 | | . | if currently allocated, last access time and | |
103 | | . | current disk address of containing inode block. | |
104 | | . | If current disk address is LFS_UNUSED_DADDR, the | |
105 | | IFILE-N | inode is not in use, and it's on the free list. | |
106 | |______________| | |
107 | ||
108 | ||
109 | First Segment at Creation Time: | |
110 | _____________________________________________________________ | |
111 | | | | | | | | | | |
112 | | 8K pad | Super | summary | inode | ifile | root | l + f | | |
113 | | | block | | block | | dir | dir | | |
114 | |________|_______|_________|_______|_______|_______|_______| | |
115 | ^ | |
116 | Segment starts here. | |
33c35d95 KB |
117 | |
118 | Some differences from the Sprite LFS implementation. | |
119 | ||
120 | 1. The LFS implementation placed the ifile metadata and the super block | |
121 | at fixed locations. This implementation replicates the super block | |
122 | and puts each at a fixed location. The checkpoint data is divided into | |
123 | two parts -- just enough information to find the IFILE is stored in | |
cea82ceb KB |
124 | two of the super blocks, although it is not toggled between them as in |
125 | the Sprite implementation. (This was deliberate, to avoid a single | |
126 | point of failure.) The remaining checkpoint information is treated as | |
127 | a regular file, which means that the cleaner info, the segment usage | |
128 | table and the ifile meta-data are stored in normal log segments. | |
129 | (Tastes great, less filling...) | |
130 | ||
131 | 2. The segment layout is radically different in Sprite; this implementation | |
132 | uses something a lot like network framing, where data/inode blocks are | |
133 | written asynchronously, and a checksum is used to validate any set of | |
134 | summary and data/inode blocks. Sprite writes summary blocks synchronously | |
135 | after the data/inode blocks have been written and the existence of the | |
136 | summary block validates the data/inode blocks. This permits us to write | |
137 | everything contiguously, even partial segments and their summaries, whereas | |
138 | Sprite is forced to seek (from the end of the data inode to the summary | |
139 | which lives at the end of the segment). Additionally, writing the summary | |
140 | synchronously should cost about 1/2 a rotation per summary. | |
141 | ||
142 | 3. Sprite LFS distinguishes between different types of blocks in the segment. | |
143 | Other than inode blocks and data blocks, we don't. | |
144 | ||
145 | 4. Sprite LFS traverses the IFILE looking for free blocks. We maintain a | |
146 | free list threaded through the IFILE entries. | |
147 | ||
148 | 5. The cleaner runs in user space, as opposed to kernel space. It shares | |
149 | information with the kernel by reading/writing the IFILE and through | |
150 | cleaner specific system calls. | |
33c35d95 | 151 |