Commit | Line | Data |
---|---|---|
f55d5308 C |
1 | .de P1 |
2 | .DS | |
3 | .. | |
4 | .de P2 | |
5 | .DE | |
6 | .. | |
7 | .de UL | |
8 | .lg 0 | |
9 | .if n .ul | |
10 | \%\&\\$3\f3\\$1\fR\&\\$2 | |
11 | .lg | |
12 | .. | |
13 | .de UC | |
14 | \&\\$3\s-1\\$1\\s0\&\\$2 | |
15 | .. | |
16 | .de IT | |
17 | .lg 0 | |
18 | .if n .ul | |
19 | \%\&\\$3\f2\\$1\fR\&\\$2 | |
20 | .lg | |
21 | .. | |
22 | .de SP | |
23 | .sp \\$1 | |
24 | .. | |
25 | .hw device | |
26 | .TL | |
27 | UNIX Implementation | |
28 | .AU "MH 2C-523" 2394 | |
29 | K. Thompson | |
30 | .AI | |
31 | .MH | |
32 | .AB | |
33 | This paper describes in high-level terms the | |
34 | implementation of the resident | |
35 | .UX | |
36 | kernel. | |
37 | This discussion is broken into three parts. | |
38 | The first part describes | |
39 | how the | |
40 | .UX | |
41 | system views processes, users, and programs. | |
42 | The second part describes the I/O system. | |
43 | The last part describes the | |
44 | .UX | |
45 | file system. | |
46 | .AE | |
47 | .NH | |
48 | INTRODUCTION | |
49 | .PP | |
50 | The | |
51 | .UX | |
52 | kernel consists of about 10,000 | |
53 | lines of C code and about 1,000 lines of assembly code. | |
54 | The assembly code can be further broken down into | |
55 | 200 lines included for | |
56 | the sake of efficiency | |
57 | (they could have been written in C) | |
58 | and 800 lines to perform hardware | |
59 | functions not possible in C. | |
60 | .PP | |
61 | This code represents 5 to 10 percent of what has | |
62 | been lumped into the broad expression | |
63 | ``the | |
64 | .UX | |
65 | operating system.'' | |
66 | The kernel is the only | |
67 | .UX | |
68 | code that | |
69 | cannot be substituted by a user to his | |
70 | own liking. | |
71 | For this reason, | |
72 | the kernel should make as few real | |
73 | decisions as possible. | |
74 | This does not mean to allow the user | |
75 | a million options to do the same thing. | |
76 | Rather, it means to allow only one way to | |
77 | do one thing, | |
78 | but have that way be the least-common divisor | |
79 | of all the options that might have been provided. | |
80 | .PP | |
81 | What is or is not implemented in the kernel | |
82 | represents both a great responsibility and a great power. | |
83 | It is a soap-box platform on | |
84 | ``the way things should be done.'' | |
85 | Even so, if | |
86 | ``the way'' is too radical, | |
87 | no one will follow it. | |
88 | Every important decision was weighed | |
89 | carefully. | |
90 | Throughout, | |
91 | simplicity has been substituted for efficiency. | |
92 | Complex algorithms are used only if | |
93 | their complexity can be localized. | |
94 | .NH | |
95 | PROCESS CONTROL | |
96 | .PP | |
97 | In the | |
98 | .UX | |
99 | system, | |
100 | a user executes programs in an | |
101 | environment called a user process. | |
102 | When a system function is required, | |
103 | the user process calls the system | |
104 | as a subroutine. | |
105 | At some point in this call, | |
106 | there is a distinct switch of environments. | |
107 | After this, | |
108 | the process is said to be a system process. | |
109 | In the normal definition of processes, | |
110 | the user and system processes are different | |
111 | phases of the same process | |
112 | (they never execute simultaneously). | |
113 | For protection, | |
114 | each system process has its own stack. | |
115 | .PP | |
116 | The user process may execute | |
117 | from a read-only text segment, | |
118 | which is shared by all processes | |
119 | executing the same code. | |
120 | There is no | |
121 | .IT functional | |
122 | benefit | |
123 | from shared-text segments. | |
124 | An | |
125 | .IT efficiency | |
126 | benefit comes from the fact | |
127 | that there is no need to swap read-only | |
128 | segments out because the original | |
129 | copy on secondary memory is still current. | |
130 | This is a great benefit to interactive | |
131 | programs that tend to be swapped while | |
132 | waiting for terminal input. | |
133 | Furthermore, | |
134 | if two processes are | |
135 | executing | |
136 | simultaneously | |
137 | from the same copy of a read-only segment, | |
138 | only one copy needs to reside in | |
139 | primary memory. | |
140 | This is a secondary effect, | |
141 | because | |
142 | simultaneous execution of a program | |
143 | is not common. | |
144 | It is ironic that this effect, | |
145 | which reduces the use of primary memory, | |
146 | only comes into play when there is | |
147 | an overabundance of primary memory, | |
148 | that is, | |
149 | when there is enough memory | |
150 | to keep waiting processes loaded. | |
151 | .PP | |
152 | All current read-only text segments in the | |
153 | system are maintained from the | |
154 | .IT "text table" . | |
155 | A text table entry holds the location of the | |
156 | text segment on secondary memory. | |
157 | If the segment is loaded, | |
158 | that table also holds the primary memory location | |
159 | and the count of the number of processes | |
160 | sharing this entry. | |
161 | When this count is reduced to zero, | |
162 | the entry is freed along with any | |
163 | primary and secondary memory holding the segment. | |
164 | When a process first executes a shared-text segment, | |
165 | a text table entry is allocated and the | |
166 | segment is loaded onto secondary memory. | |
167 | If a second process executes a text segment | |
168 | that is already allocated, | |
169 | the entry reference count is simply incremented. | |
170 | .PP | |
171 | A user process has some strictly private | |
172 | read-write data | |
173 | contained in its | |
174 | data segment. | |
175 | As far as possible, | |
176 | the system does not use the user's | |
177 | data segment to hold system data. | |
178 | In particular, | |
179 | there are no I/O buffers in the | |
180 | user address space. | |
181 | .PP | |
182 | The user data segment has two growing boundaries. | |
183 | One, increased automatically by the system | |
184 | as a result of memory faults, | |
185 | is used for a stack. | |
186 | The second boundary is only grown (or shrunk) by | |
187 | explicit requests. | |
188 | The contents of newly allocated primary memory | |
189 | is initialized to zero. | |
190 | .PP | |
191 | Also associated and swapped with | |
192 | a process is a small fixed-size | |
193 | system data segment. | |
194 | This segment contains all | |
195 | the data about the process | |
196 | that the system needs only when the | |
197 | process is active. | |
198 | Examples of the kind of data contained | |
199 | in the system data segment are: | |
200 | saved central processor registers, | |
201 | open file descriptors, | |
202 | accounting information, | |
203 | scratch data area, | |
204 | and the stack for the system phase | |
205 | of the process. | |
206 | The system data segment is not | |
207 | addressable from the user process | |
208 | and is therefore protected. | |
209 | .PP | |
210 | Last, | |
211 | there is a process table with | |
212 | one entry per process. | |
213 | This entry contains all the data | |
214 | needed by the system when the process | |
215 | is | |
216 | .IT not | |
217 | active. | |
218 | Examples are | |
219 | the process's name, | |
220 | the location of the other segments, | |
221 | and scheduling information. | |
222 | The process table entry is allocated | |
223 | when the process is created, and freed | |
224 | when the process terminates. | |
225 | This process entry is always directly | |
226 | addressable by the kernel. | |
227 | .PP | |
228 | Figure 1 shows the relationships | |
229 | between the various process control | |
230 | data. | |
231 | In a sense, | |
232 | the process table is the | |
233 | definition of all processes, | |
234 | because | |
235 | all the data associated with a process | |
236 | may be accessed | |
237 | starting from the process table entry. | |
238 | .KS | |
239 | .sp 2.44i | |
240 | .sp 2v | |
241 | .ce | |
242 | Fig. 1\(emProcess control data structure. | |
243 | .KE | |
244 | .NH 2 | |
245 | Process creation and program execution | |
246 | .PP | |
247 | Processes are created by the system primitive | |
248 | .UL fork . | |
249 | The newly created process (child) is a copy of the original process (parent). | |
250 | There is no detectable sharing of primary memory between the two processes. | |
251 | (Of course, | |
252 | if the parent process was executing from a read-only | |
253 | text segment, | |
254 | the child will share the text segment.) | |
255 | Copies of all writable data segments | |
256 | are made for the child process. | |
257 | Files that were open before the | |
258 | .UL fork | |
259 | are | |
260 | truly shared after the | |
261 | .UL fork . | |
262 | The processes are informed as to their part in the | |
263 | relationship to | |
264 | allow them to select their own | |
265 | (usually non-identical) | |
266 | destiny. | |
267 | The parent may | |
268 | .UL wait | |
269 | for the termination of | |
270 | any of its children. | |
271 | .PP | |
272 | A process may | |
273 | .UL exec | |
274 | a file. | |
275 | This consists of exchanging the current text and data | |
276 | segments of the process for new text and data | |
277 | segments specified in the file. | |
278 | The old segments are lost. | |
279 | Doing an | |
280 | .UL exec | |
281 | does | |
282 | .IT not | |
283 | change processes; | |
284 | the process that did the | |
285 | .UL exec | |
286 | persists, | |
287 | but | |
288 | after the | |
289 | .UL exec | |
290 | it is executing a different program. | |
291 | Files that were open | |
292 | before the | |
293 | .UL exec | |
294 | remain open after the | |
295 | .UL exec . | |
296 | .PP | |
297 | If a program, | |
298 | say the first pass of a compiler, | |
299 | wishes to overlay itself with another program, | |
300 | say the second pass, | |
301 | then it simply | |
302 | .UL exec s | |
303 | the second program. | |
304 | This is analogous | |
305 | to a ``goto.'' | |
306 | If a program wishes to regain control | |
307 | after | |
308 | .UL exec ing | |
309 | a second program, | |
310 | it should | |
311 | .UL fork | |
312 | a child process, | |
313 | have the child | |
314 | .UL exec | |
315 | the second program, and | |
316 | have the parent | |
317 | .UL wait | |
318 | for the child. | |
319 | This is analogous to a ``call.'' | |
320 | Breaking up the call into a binding followed by | |
321 | a transfer is similar to the subroutine linkage in | |
322 | SL-5. | |
323 | .[ | |
324 | griswold hanson sl5 overview | |
325 | .] | |
326 | .NH 2 | |
327 | Swapping | |
328 | .PP | |
329 | The major data associated with a process | |
330 | (the user data segment, | |
331 | the system data segment, and | |
332 | the text segment) | |
333 | are swapped to and from secondary | |
334 | memory, as needed. | |
335 | The user data segment and the system data segment | |
336 | are kept in contiguous primary memory to reduce | |
337 | swapping latency. | |
338 | (When low-latency devices, such as bubbles, | |
339 | .UC CCD s, | |
340 | or scatter/gather devices, | |
341 | are used, | |
342 | this decision will have to be reconsidered.) | |
343 | Allocation of both primary | |
344 | and secondary memory is performed | |
345 | by the same simple first-fit algorithm. | |
346 | When a process grows, | |
347 | a new piece of primary memory is allocated. | |
348 | The contents of the old memory is copied to the new memory. | |
349 | The old memory is freed | |
350 | and the tables are updated. | |
351 | If there is not enough primary memory, | |
352 | secondary memory is allocated instead. | |
353 | The process is swapped out onto the | |
354 | secondary memory, | |
355 | ready to be swapped in with | |
356 | its new size. | |
357 | .PP | |
358 | One separate process in the kernel, | |
359 | the swapping process, | |
360 | simply swaps the other | |
361 | processes in and out of primary memory. | |
362 | It examines the | |
363 | process table looking for a process | |
364 | that is swapped out and is | |
365 | ready to run. | |
366 | It allocates primary memory for that | |
367 | process and | |
368 | reads its segments into | |
369 | primary memory, where that process competes for the | |
370 | central processor with other loaded processes. | |
371 | If no primary memory is available, | |
372 | the swapping process makes memory available | |
373 | by examining the process table for processes | |
374 | that can be swapped out. | |
375 | It selects a process to swap out, | |
376 | writes it to secondary memory, | |
377 | frees the primary memory, | |
378 | and then goes back to look for a process | |
379 | to swap in. | |
380 | .PP | |
381 | Thus there are two specific algorithms | |
382 | to the swapping process. | |
383 | Which of the possibly many processes that | |
384 | are swapped out is to be swapped in? | |
385 | This is decided by secondary storage residence | |
386 | time. | |
387 | The one with the longest time out is swapped in first. | |
388 | There is a slight penalty for larger processes. | |
389 | Which of the possibly many processes that | |
390 | are loaded is to be swapped out? | |
391 | Processes that are waiting for slow events | |
392 | (i.e., not currently running or waiting for | |
393 | disk I/O) | |
394 | are picked first, | |
395 | by age in primary memory, | |
396 | again with size penalties. | |
397 | The other processes are examined | |
398 | by the same age algorithm, | |
399 | but are not taken out unless they are | |
400 | at least of some age. | |
401 | This adds | |
402 | hysteresis to the swapping and | |
403 | prevents total thrashing. | |
404 | .PP | |
405 | These swapping algorithms are the | |
406 | most suspect in the system. | |
407 | With limited primary memory, | |
408 | these algorithms cause total swapping. | |
409 | This is not bad in itself, because | |
410 | the swapping does not impact the | |
411 | execution of the resident processes. | |
412 | However, if the swapping device must | |
413 | also be used for file storage, | |
414 | the swapping traffic severely | |
415 | impacts the file system traffic. | |
416 | It is exactly these small systems | |
417 | that tend to double usage of limited disk | |
418 | resources. | |
419 | .NH 2 | |
420 | Synchronization and scheduling | |
421 | .PP | |
422 | Process synchronization is accomplished by having processes | |
423 | wait for events. | |
424 | Events are represented by arbitrary integers. | |
425 | By convention, | |
426 | events are chosen to be addresses of | |
427 | tables associated with those events. | |
428 | For example, a process that is waiting for | |
429 | any of its children to terminate will wait | |
430 | for an event that is the address of | |
431 | its own process table entry. | |
432 | When a process terminates, | |
433 | it signals the event represented by | |
434 | its parent's process table entry. | |
435 | Signaling an event on which no process | |
436 | is waiting has no effect. | |
437 | Similarly, | |
438 | signaling an event on which many processes | |
439 | are waiting will wake all of them up. | |
440 | This differs considerably from | |
441 | Dijkstra's P and V | |
442 | synchronization operations, | |
443 | .[ | |
444 | dijkstra sequential processes 1968 | |
445 | .] | |
446 | in that | |
447 | no memory is associated with events. | |
448 | Thus there need be no allocation of events | |
449 | prior to their use. | |
450 | Events exist simply by being used. | |
451 | .PP | |
452 | On the negative side, | |
453 | because there is no memory associated with events, | |
454 | no notion of ``how much'' | |
455 | can be signaled via the event mechanism. | |
456 | For example, | |
457 | processes that want memory might | |
458 | wait on an event associated with | |
459 | memory allocation. | |
460 | When any amount of memory becomes available, | |
461 | the event would be signaled. | |
462 | All the competing processes would then wake | |
463 | up to fight over the new memory. | |
464 | (In reality, | |
465 | the swapping process is the only process | |
466 | that waits for primary memory to become available.) | |
467 | .PP | |
468 | If an event occurs | |
469 | between the time a process decides | |
470 | to wait for that event and the | |
471 | time that process enters the wait state, | |
472 | then | |
473 | the process will wait on an event that has | |
474 | already happened (and may never happen again). | |
475 | This race condition happens because there is no memory associated with | |
476 | the event to indicate that the event has occurred; | |
477 | the only action of an event is to change a set of processes | |
478 | from wait state to run state. | |
479 | This problem is relieved largely | |
480 | by the fact that process switching can | |
481 | only occur in the kernel by explicit calls | |
482 | to the event-wait mechanism. | |
483 | If the event in question is signaled by another | |
484 | process, | |
485 | then there is no problem. | |
486 | But if the event is signaled by a hardware | |
487 | interrupt, | |
488 | then special care must be taken. | |
489 | These synchronization races pose the biggest | |
490 | problem when | |
491 | .UX | |
492 | is adapted to multiple-processor configurations. | |
493 | .[ | |
494 | hawley meyer multiprocessing unix | |
495 | .] | |
496 | .PP | |
497 | The event-wait code in the kernel | |
498 | is like a co-routine linkage. | |
499 | At any time, | |
500 | all but one of the processes has called event-wait. | |
501 | The remaining process is the one currently executing. | |
502 | When it calls event-wait, | |
503 | a process whose event has been signaled | |
504 | is selected and that process | |
505 | returns from its call to event-wait. | |
506 | .PP | |
507 | Which of the runable processes is to run next? | |
508 | Associated with each process is a priority. | |
509 | The priority of a system process is assigned by the code | |
510 | issuing the wait on an event. | |
511 | This is roughly equivalent to the response | |
512 | that one would expect on such an event. | |
513 | Disk events have high priority, | |
514 | teletype events are low, | |
515 | and time-of-day events are very low. | |
516 | (From observation, | |
517 | the difference in system process priorities | |
518 | has little or no performance impact.) | |
519 | All user-process priorities are lower than the | |
520 | lowest system priority. | |
521 | User-process priorities are assigned | |
522 | by an algorithm based on the | |
523 | recent ratio of the amount of compute time to real time consumed | |
524 | by the process. | |
525 | A process that has used a lot of | |
526 | compute time in the last real-time | |
527 | unit is assigned a low user priority. | |
528 | Because interactive processes are characterized | |
529 | by low ratios of compute to real time, | |
530 | interactive response is maintained without any | |
531 | special arrangements. | |
532 | .PP | |
533 | The scheduling algorithm simply picks | |
534 | the process with the highest priority, | |
535 | thus | |
536 | picking all system processes first and | |
537 | user processes second. | |
538 | The compute-to-real-time ratio is updated | |
539 | every second. | |
540 | Thus, | |
541 | all other things being equal, | |
542 | looping user processes will be | |
543 | scheduled round-robin with a | |
544 | 1-second quantum. | |
545 | A high-priority process waking up will | |
546 | preempt a running, low-priority process. | |
547 | The scheduling algorithm has a very desirable | |
548 | negative feedback character. | |
549 | If a process uses its high priority | |
550 | to hog the computer, | |
551 | its priority will drop. | |
552 | At the same time, if a low-priority | |
553 | process is ignored for a long time, | |
554 | its priority will rise. | |
555 | .NH | |
556 | I/O SYSTEM | |
557 | .PP | |
558 | The I/O system | |
559 | is broken into two completely separate systems: | |
560 | the block I/O system and the character I/O system. | |
561 | In retrospect, | |
562 | the names should have been ``structured I/O'' | |
563 | and ``unstructured I/O,'' respectively; | |
564 | while the term ``block I/O'' has some meaning, | |
565 | ``character I/O'' is a complete misnomer. | |
566 | .PP | |
567 | Devices are characterized by a major device number, | |
568 | a minor device number, and | |
569 | a class (block or character). | |
570 | For each class, | |
571 | there is an array of entry points into the device drivers. | |
572 | The major device number is used to index the array | |
573 | when calling the code for a particular device driver. | |
574 | The minor device number is passed to the | |
575 | device driver as an argument. | |
576 | The minor number has no significance other | |
577 | than that attributed to it by the driver. | |
578 | Usually, | |
579 | the driver uses the minor number to access | |
580 | one of several identical physical devices. | |
581 | .PP | |
582 | The use of the array of entry points | |
583 | (configuration table) | |
584 | as the only connection between the | |
585 | system code and the device drivers is | |
586 | very important. | |
587 | Early versions of the system had a much | |
588 | less formal connection with the drivers, | |
589 | so that it was extremely hard to handcraft | |
590 | differently configured systems. | |
591 | Now it is possible to create new | |
592 | device drivers in an average of a few hours. | |
593 | The configuration table in most cases | |
594 | is created automatically by a program | |
595 | that reads the system's parts list. | |
596 | .NH 2 | |
597 | Block I/O system | |
598 | .PP | |
599 | The model block I/O device consists | |
600 | of randomly addressed, secondary | |
601 | memory blocks of 512 bytes each. | |
602 | The blocks are uniformly addressed | |
603 | 0, 1, .\|.\|. up to the size of the device. | |
604 | The block device driver has the job of | |
605 | emulating this model on a | |
606 | physical device. | |
607 | .PP | |
608 | The block I/O devices are accessed | |
609 | through a layer of buffering software. | |
610 | The system maintains a list of buffers | |
611 | (typically between 10 and 70) | |
612 | each assigned a device name and | |
613 | a device address. | |
614 | This buffer pool constitutes a data cache | |
615 | for the block devices. | |
616 | On a read request, | |
617 | the cache is searched for the desired block. | |
618 | If the block is found, | |
619 | the data are made available to the | |
620 | requester without any physical I/O. | |
621 | If the block is not in the cache, | |
622 | the least recently used block in the cache is renamed, | |
623 | the correct device driver is called to | |
624 | fill up the renamed buffer, and then the | |
625 | data are made available. | |
626 | Write requests are handled in an analogous manner. | |
627 | The correct buffer is found | |
628 | and relabeled if necessary. | |
629 | The write is performed simply by marking | |
630 | the buffer as ``dirty.'' | |
631 | The physical I/O is then deferred until | |
632 | the buffer is renamed. | |
633 | .PP | |
634 | The benefits in reduction of physical I/O | |
635 | of this scheme are substantial, | |
636 | especially considering the file system implementation. | |
637 | There are, | |
638 | however, | |
639 | some drawbacks. | |
640 | The asynchronous nature of the | |
641 | algorithm makes error reporting | |
642 | and meaningful user error handling | |
643 | almost impossible. | |
644 | The cavalier approach to I/O error | |
645 | handling in the | |
646 | .UX | |
647 | system is partly due to the asynchronous | |
648 | nature of the block I/O system. | |
649 | A second problem is in the delayed writes. | |
650 | If the system stops unexpectedly, | |
651 | it is almost certain that there is a | |
652 | lot of logically complete, | |
653 | but physically incomplete, | |
654 | I/O in the buffers. | |
655 | There is a system primitive to | |
656 | flush all outstanding I/O activity | |
657 | from the buffers. | |
658 | Periodic use of this primitive helps, | |
659 | but does not solve, the problem. | |
660 | Finally, | |
661 | the associativity in the buffers | |
662 | can alter the physical I/O sequence | |
663 | from that of the logical I/O sequence. | |
664 | This means that there are times | |
665 | when data structures on disk are inconsistent, | |
666 | even though the software is careful | |
667 | to perform I/O in the correct order. | |
668 | On non-random devices, | |
669 | notably magnetic tape, | |
670 | the inversions of writes can be disastrous. | |
671 | The problem with magnetic tapes is ``cured'' by | |
672 | allowing only one outstanding write request | |
673 | per drive. | |
674 | .NH 2 | |
675 | Character I/O system | |
676 | .PP | |
677 | The character I/O system consists of all | |
678 | devices that do not fall into the block I/O model. | |
679 | This includes the ``classical'' character devices | |
680 | such as communications lines, paper tape, and | |
681 | line printers. | |
682 | It also includes magnetic tape and disks when | |
683 | they are not used in a stereotyped way, | |
684 | for example, 80-byte physical records on tape | |
685 | and track-at-a-time disk copies. | |
686 | In short, | |
687 | the character I/O interface | |
688 | means ``everything other than block.'' | |
689 | I/O requests from the user are sent to the | |
690 | device driver essentially unaltered. | |
691 | The implementation of these requests is, of course, | |
692 | up to the device driver. | |
693 | There are guidelines and conventions | |
694 | to help the implementation of | |
695 | certain types of device drivers. | |
696 | .NH 3 | |
697 | Disk drivers | |
698 | .PP | |
699 | Disk drivers are implemented | |
700 | with a queue of transaction records. | |
701 | Each record holds a read/write flag, | |
702 | a primary memory address, | |
703 | a secondary memory address, and | |
704 | a transfer byte count. | |
705 | Swapping is accomplished by passing | |
706 | such a record to the swapping device driver. | |
707 | The block I/O interface is implemented by | |
708 | passing such records with requests to | |
709 | fill and empty system buffers. | |
710 | The character I/O interface to the disk | |
711 | drivers create a transaction record that | |
712 | points directly into the user area. | |
713 | The routine that creates this record also insures | |
714 | that the user is not swapped during this | |
715 | I/O transaction. | |
716 | Thus by implementing the general disk driver, | |
717 | it is possible to use the disk | |
718 | as a block device, | |
719 | a character device, and a swap device. | |
720 | The only really disk-specific code in normal | |
721 | disk drivers is the pre-sort of transactions to | |
722 | minimize latency for a particular device, and | |
723 | the actual issuing of the I/O request. | |
724 | .NH 3 | |
725 | Character lists | |
726 | .PP | |
727 | Real character-oriented devices may | |
728 | be implemented using the common | |
729 | code to handle character lists. | |
730 | A character list is a queue of characters. | |
731 | One routine puts a character on a queue. | |
732 | Another gets a character from a queue. | |
733 | It is also possible to ask how many | |
734 | characters are currently on a queue. | |
735 | Storage for all queues in the system comes | |
736 | from a single common pool. | |
737 | Putting a character on a queue will allocate | |
738 | space from the common pool and link the | |
739 | character onto the data structure defining the queue. | |
740 | Getting a character from a queue returns | |
741 | the corresponding space to the pool. | |
742 | .PP | |
743 | A typical character-output device | |
744 | (paper tape punch, for example) | |
745 | is implemented by passing characters | |
746 | from the user onto a character queue until | |
747 | some maximum number of characters is on the queue. | |
748 | The I/O is prodded to start as | |
749 | soon as there is anything on the queue | |
750 | and, once started, | |
751 | it is sustained by hardware completion interrupts. | |
752 | Each time there is a completion interrupt, | |
753 | the driver gets the next character from the queue | |
754 | and sends it to the hardware. | |
755 | The number of characters on the queue is checked and, | |
756 | as the count falls through some intermediate level, | |
757 | an event (the queue address) is signaled. | |
758 | The process that is passing characters from | |
759 | the user to the queue can be waiting on the event, and | |
760 | refill the queue to its maximum | |
761 | when the event occurs. | |
762 | .PP | |
763 | A typical character input device | |
764 | (for example, a paper tape reader) | |
765 | is handled in a very similar manner. | |
766 | .PP | |
767 | Another class of character devices is the terminals. | |
768 | A terminal is represented by three | |
769 | character queues. | |
770 | There are two input queues (raw and canonical) | |
771 | and an output queue. | |
772 | Characters going to the output of a terminal | |
773 | are handled by common code exactly as described | |
774 | above. | |
775 | The main difference is that there is also code | |
776 | to interpret the output stream as | |
777 | .UC ASCII | |
778 | characters and to perform some translations, | |
779 | e.g., escapes for deficient terminals. | |
780 | Another common aspect of terminals is code | |
781 | to insert real-time delay after certain control characters. | |
782 | .PP | |
783 | Input on terminals is a little different. | |
784 | Characters are collected from the terminal and | |
785 | placed on a raw input queue. | |
786 | Some device-dependent code conversion and | |
787 | escape interpretation is handled here. | |
788 | When a line is complete in the raw queue, | |
789 | an event is signaled. | |
790 | The code catching this signal then copies a | |
791 | line from the raw queue to a canonical queue | |
792 | performing the character erase and line kill editing. | |
793 | User read requests on terminals can be | |
794 | directed at either the raw or canonical queues. | |
795 | .NH 3 | |
796 | Other character devices | |
797 | .PP | |
798 | Finally, | |
799 | there are devices that fit no general category. | |
800 | These devices are set up as character I/O drivers. | |
801 | An example is a driver that reads and writes | |
802 | unmapped primary memory as an I/O device. | |
803 | Some devices are too | |
804 | fast to be treated a character at time, | |
805 | but do not fit the disk I/O mold. | |
806 | Examples are fast communications lines and | |
807 | fast line printers. | |
808 | These devices either have their own buffers | |
809 | or ``borrow'' block I/O buffers for a while and | |
810 | then give them back. | |
811 | .NH | |
812 | THE FILE SYSTEM | |
813 | .PP | |
814 | In the | |
815 | .UX | |
816 | system, | |
817 | a file is a (one-dimensional) array of bytes. | |
818 | No other structure of files is implied by the | |
819 | system. | |
820 | Files are attached anywhere | |
821 | (and possibly multiply) | |
822 | onto a hierarchy of directories. | |
823 | Directories are simply files that | |
824 | users cannot write. | |
825 | For a further discussion | |
826 | of the external view of files and directories, | |
827 | see Ref.\0 | |
828 | .[ | |
829 | ritchie thompson unix bstj 1978 | |
830 | %Q This issue | |
831 | .]. | |
832 | .PP | |
833 | The | |
834 | .UX | |
835 | file system is a disk data structure | |
836 | accessed completely through | |
837 | the block I/O system. | |
838 | As stated before, | |
839 | the canonical view of a ``disk'' is | |
840 | a randomly addressable array of | |
841 | 512-byte blocks. | |
842 | A file system breaks the disk into | |
843 | four self-identifying regions. | |
844 | The first block (address 0) | |
845 | is unused by the file system. | |
846 | It is left aside for booting procedures. | |
847 | The second block (address 1) | |
848 | contains the so-called ``super-block.'' | |
849 | This block, | |
850 | among other things, | |
851 | contains the size of the disk and | |
852 | the boundaries of the other regions. | |
853 | Next comes the i-list, | |
854 | a list of file definitions. | |
855 | Each file definition is | |
856 | a 64-byte structure, called an i-node. | |
857 | The offset of a particular i-node | |
858 | within the i-list is called its i-number. | |
859 | The combination of device name | |
860 | (major and minor numbers) and i-number | |
861 | serves to uniquely name a particular file. | |
862 | After the i-list, | |
863 | and to the end of the disk, | |
864 | come free storage blocks that | |
865 | are available for the contents of files. | |
866 | .PP | |
867 | The free space on a disk is maintained | |
868 | by a linked list of available disk blocks. | |
869 | Every block in this chain contains a disk address | |
870 | of the next block in the chain. | |
871 | The remaining space contains the address of up to | |
872 | 50 disk blocks that are also free. | |
873 | Thus with one I/O operation, | |
874 | the system obtains 50 free blocks and a | |
875 | pointer where to find more. | |
876 | The disk allocation algorithms are | |
877 | very straightforward. | |
878 | Since all allocation is in fixed-size | |
879 | blocks and there is strict accounting of | |
880 | space, | |
881 | there is no need to compact or garbage collect. | |
882 | However, | |
883 | as disk space becomes dispersed, | |
884 | latency gradually increases. | |
885 | Some installations choose to occasionally compact | |
886 | disk space to reduce latency. | |
887 | .PP | |
888 | An i-node contains 13 disk addresses. | |
889 | The first 10 of these addresses point directly at | |
890 | the first 10 blocks of a file. | |
891 | If a file is larger than 10 blocks (5,120 bytes), | |
892 | then the eleventh address points at a block | |
893 | that contains the addresses of the next 128 blocks of the file. | |
894 | If the file is still larger than this | |
895 | (70,656 bytes), | |
896 | then the twelfth block points at up to 128 blocks, | |
897 | each pointing to 128 blocks of the file. | |
898 | Files yet larger | |
899 | (8,459,264 bytes) | |
900 | use the thirteenth address for a ``triple indirect'' address. | |
901 | The algorithm ends here with the maximum file size | |
902 | of 1,082,201,087 bytes. | |
903 | .PP | |
904 | A logical directory hierarchy is added | |
905 | to this flat physical structure simply | |
906 | by adding a new type of file, the directory. | |
907 | A directory is accessed exactly as an ordinary file. | |
908 | It contains 16-byte entries consisting of | |
909 | a 14-byte name and an i-number. | |
910 | The root of the hierarchy is at a known i-number | |
911 | (\fIviz.,\fR 2). | |
912 | The file system structure allows an arbitrary, directed graph | |
913 | of directories with regular files linked in | |
914 | at arbitrary places in this graph. | |
915 | In fact, | |
916 | very early | |
917 | .UX | |
918 | systems used such a structure. | |
919 | Administration of such a structure became so | |
920 | chaotic that later systems were restricted | |
921 | to a directory tree. | |
922 | Even now, | |
923 | with regular files linked multiply | |
924 | into arbitrary places in the tree, | |
925 | accounting for space has become a problem. | |
926 | It may become necessary to restrict the entire | |
927 | structure to a tree, | |
928 | and allow a new form of linking that | |
929 | is subservient to the tree structure. | |
930 | .PP | |
931 | The file system allows | |
932 | easy creation, | |
933 | easy removal, | |
934 | easy random accessing, | |
935 | and very easy space allocation. | |
936 | With most physical addresses confined | |
937 | to a small contiguous section of disk, | |
938 | it is also easy to dump, restore, and | |
939 | check the consistency of the file system. | |
940 | Large files suffer from indirect addressing, | |
941 | but the cache prevents most of the implied physical I/O | |
942 | without adding much execution. | |
943 | The space overhead properties of this scheme are quite good. | |
944 | For example, | |
945 | on one particular file system, | |
946 | there are 25,000 files containing 130M bytes of data-file content. | |
947 | The overhead (i-node, indirect blocks, and last block breakage) | |
948 | is about 11.5M bytes. | |
949 | The directory structure to support these files | |
950 | has about 1,500 directories containing 0.6M bytes of directory content | |
951 | and about 0.5M bytes of overhead in accessing the directories. | |
952 | Added up any way, | |
953 | this comes out to less than a 10 percent overhead for actual | |
954 | stored data. | |
955 | Most systems have this much overhead in | |
956 | padded trailing blanks alone. | |
957 | .NH 2 | |
958 | File system implementation | |
959 | .PP | |
960 | Because the i-node defines a file, | |
961 | the implementation of the file system centers | |
962 | around access to the i-node. | |
963 | The system maintains a table of all active | |
964 | i-nodes. | |
965 | As a new file is accessed, | |
966 | the system locates the corresponding i-node, | |
967 | allocates an i-node table entry, and reads | |
968 | the i-node into primary memory. | |
969 | As in the buffer cache, | |
970 | the table entry is considered to be the current | |
971 | version of the i-node. | |
972 | Modifications to the i-node are made to | |
973 | the table entry. | |
974 | When the last access to the i-node goes | |
975 | away, | |
976 | the table entry is copied back to the | |
977 | secondary store i-list and the table entry is freed. | |
978 | .PP | |
979 | All I/O operations on files are carried out | |
980 | with the aid of the corresponding i-node table entry. | |
981 | The accessing of a file is a straightforward | |
982 | implementation of the algorithms mentioned previously. | |
983 | The user is not aware of i-nodes and i-numbers. | |
984 | References to the file system are made in terms of | |
985 | path names of the directory tree. | |
986 | Converting a path name into an i-node table entry | |
987 | is also straightforward. | |
988 | Starting at some known i-node | |
989 | (the root or the current directory of some process), | |
990 | the next component of the path name is | |
991 | searched by reading the directory. | |
992 | This gives an i-number and an implied device | |
993 | (that of the directory). | |
994 | Thus the next i-node table entry can be accessed. | |
995 | If that was the last component of the path name, | |
996 | then this i-node is the result. | |
997 | If not, | |
998 | this i-node is the directory needed to look up | |
999 | the next component of the path name, and the | |
1000 | algorithm is repeated. | |
1001 | .PP | |
1002 | The user process accesses the file system with | |
1003 | certain primitives. | |
1004 | The most common of these are | |
1005 | .UL open , | |
1006 | .UL create , | |
1007 | .UL read , | |
1008 | .UL write , | |
1009 | .UL seek , | |
1010 | and | |
1011 | .UL close . | |
1012 | The data structures maintained are shown in Fig. 2. | |
1013 | .KS | |
1014 | .sp 22P | |
1015 | .ce | |
1016 | Fig. 2\(emFile system data structure. | |
1017 | .KE | |
1018 | In the system data segment associated with a user, | |
1019 | there is room for some (usually between 10 and 50) open files. | |
1020 | This open file table consists of pointers that can be used to access | |
1021 | corresponding i-node table entries. | |
1022 | Associated with each of these open files is | |
1023 | a current I/O pointer. | |
1024 | This is a byte offset of | |
1025 | the next read/write operation on the file. | |
1026 | The system treats each read/write request | |
1027 | as random with an implied seek to the | |
1028 | I/O pointer. | |
1029 | The user usually thinks of the file as | |
1030 | sequential with the I/O pointer | |
1031 | automatically counting the number of bytes | |
1032 | that have been read/written from the file. | |
1033 | The user may, | |
1034 | of course, | |
1035 | perform random I/O by setting the I/O pointer | |
1036 | before reads/writes. | |
1037 | .PP | |
1038 | With file sharing, | |
1039 | it is necessary to allow related | |
1040 | processes to share a common I/O pointer | |
1041 | and yet have separate I/O pointers | |
1042 | for independent processes | |
1043 | that access the same file. | |
1044 | With these two conditions, | |
1045 | the I/O pointer cannot reside | |
1046 | in the i-node table nor can | |
1047 | it reside in the list of | |
1048 | open files for the process. | |
1049 | A new table | |
1050 | (the open file table) | |
1051 | was invented for the sole purpose | |
1052 | of holding the I/O pointer. | |
1053 | Processes that share the same open | |
1054 | file | |
1055 | (the result of | |
1056 | .UL fork s) | |
1057 | share a common open file table entry. | |
1058 | A separate open of the same file will | |
1059 | only share the i-node table entry, | |
1060 | but will have distinct open file table entries. | |
1061 | .PP | |
1062 | The main file system primitives are implemented as follows. | |
1063 | .UL \&open | |
1064 | converts a file system path name into an i-node | |
1065 | table entry. | |
1066 | A pointer to the i-node table entry is placed in a | |
1067 | newly created open file table entry. | |
1068 | A pointer to the file table entry is placed in the | |
1069 | system data segment for the process. | |
1070 | .UL \&create | |
1071 | first creates a new i-node entry, | |
1072 | writes the i-number into a directory, and | |
1073 | then builds the same structure as for an | |
1074 | .UL open . | |
1075 | .UL \&read | |
1076 | and | |
1077 | .UL write | |
1078 | just access the i-node entry as described above. | |
1079 | .UL \&seek | |
1080 | simply manipulates the I/O pointer. | |
1081 | No physical seeking is done. | |
1082 | .UL \&close | |
1083 | just frees the structures built by | |
1084 | .UL open | |
1085 | and | |
1086 | .UL create . | |
1087 | Reference counts are kept on the open file table entries and | |
1088 | the i-node table entries to free these structures after | |
1089 | the last reference goes away. | |
1090 | .UL \&unlink | |
1091 | simply decrements the count of the | |
1092 | number of directories pointing at the given i-node. | |
1093 | When the last reference to an i-node table entry | |
1094 | goes away, | |
1095 | if the i-node has no directories pointing to it, | |
1096 | then the file is removed and the i-node is freed. | |
1097 | This delayed removal of files prevents | |
1098 | problems arising from removing active files. | |
1099 | A file may be removed while still open. | |
1100 | The resulting unnamed file vanishes | |
1101 | when the file is closed. | |
1102 | This is a method of obtaining temporary files. | |
1103 | .PP | |
1104 | There is a type of unnamed | |
1105 | .UC FIFO | |
1106 | file called a | |
1107 | .UL pipe. | |
1108 | Implementation of | |
1109 | .UL pipe s | |
1110 | consists of implied | |
1111 | .UL seek s | |
1112 | before each | |
1113 | .UL read | |
1114 | or | |
1115 | .UL write | |
1116 | in order to implement | |
1117 | first-in-first-out. | |
1118 | There are also checks and synchronization | |
1119 | to prevent the | |
1120 | writer from grossly outproducing the | |
1121 | reader and to prevent the reader from | |
1122 | overtaking the writer. | |
1123 | .NH 2 | |
1124 | Mounted file systems | |
1125 | .PP | |
1126 | The file system of a | |
1127 | .UX | |
1128 | system | |
1129 | starts with some designated block device | |
1130 | formatted as described above to contain | |
1131 | a hierarchy. | |
1132 | The root of this structure is the root of | |
1133 | the | |
1134 | .UX | |
1135 | file system. | |
1136 | A second formatted block device may be | |
1137 | mounted | |
1138 | at any leaf of | |
1139 | the current hierarchy. | |
1140 | This logically extends the current hierarchy. | |
1141 | The implementation of | |
1142 | mounting | |
1143 | is trivial. | |
1144 | A mount table is maintained containing | |
1145 | pairs of designated leaf i-nodes and | |
1146 | block devices. | |
1147 | When converting a path name into an i-node, | |
1148 | a check is made to see if the new i-node is a | |
1149 | designated leaf. | |
1150 | If it is, | |
1151 | the i-node of the root | |
1152 | of the block device replaces it. | |
1153 | .PP | |
1154 | Allocation of space for a file is taken | |
1155 | from the free pool on the device on which the | |
1156 | file lives. | |
1157 | Thus a file system consisting of many | |
1158 | mounted devices does not have a common pool of | |
1159 | free secondary storage space. | |
1160 | This separation of space on different | |
1161 | devices is necessary to allow easy | |
1162 | unmounting | |
1163 | of a device. | |
1164 | .NH 2 | |
1165 | Other system functions | |
1166 | .PP | |
1167 | There are some other things that the system | |
1168 | does for the user\-a | |
1169 | little accounting, | |
1170 | a little tracing/debugging, | |
1171 | and a little access protection. | |
1172 | Most of these things are not very | |
1173 | well developed | |
1174 | because our use of the system in computing science research | |
1175 | does not need them. | |
1176 | There are some features that are missed in some | |
1177 | applications, for example, better inter-process communication. | |
1178 | .PP | |
1179 | The | |
1180 | .UX | |
1181 | kernel is an I/O multiplexer more than | |
1182 | a complete operating system. | |
1183 | This is as it should be. | |
1184 | Because of this outlook, | |
1185 | many features are | |
1186 | found in most | |
1187 | other operating systems that are missing from the | |
1188 | .UX | |
1189 | kernel. | |
1190 | For example, | |
1191 | the | |
1192 | .UX | |
1193 | kernel does not support | |
1194 | file access methods, | |
1195 | file disposition, | |
1196 | file formats, | |
1197 | file maximum size, | |
1198 | spooling, | |
1199 | command language, | |
1200 | logical records, | |
1201 | physical records, | |
1202 | assignment of logical file names, | |
1203 | logical file names, | |
1204 | more than one character set, | |
1205 | an operator's console, | |
1206 | an operator, | |
1207 | log-in, | |
1208 | or log-out. | |
1209 | Many of these things are symptoms rather than features. | |
1210 | Many of these things are implemented | |
1211 | in user software | |
1212 | using the kernel as a tool. | |
1213 | A good example of this is the command language. | |
1214 | .[ | |
1215 | bourne shell 1978 bstj | |
1216 | %Q This issue | |
1217 | .] | |
1218 | Each user may have his own command language. | |
1219 | Maintenance of such code is as easy as | |
1220 | maintaining user code. | |
1221 | The idea of implementing ``system'' code with general | |
1222 | user primitives | |
1223 | comes directly from | |
1224 | .UC MULTICS . | |
1225 | .[ | |
1226 | organick multics 1972 | |
1227 | .] | |
1228 | .LP | |
1229 | .[ | |
1230 | $LIST$ | |
1231 | .] |