Commit | Line | Data |
---|---|---|
f55d5308 C |
1 | .TL |
2 | The UNIX I/O System | |
3 | .AU | |
4 | Dennis M. Ritchie | |
5 | .AI | |
6 | .MH | |
7 | .PP | |
8 | This paper gives an overview of the workings of the UNIX\(dg | |
9 | .FS | |
10 | \(dgUNIX is a Trademark of Bell Laboratories. | |
11 | .FE | |
12 | I/O system. | |
13 | It was written with an eye toward providing | |
14 | guidance to writers of device driver routines, | |
15 | and is oriented more toward describing the environment | |
16 | and nature of device drivers than the implementation | |
17 | of that part of the file system which deals with | |
18 | ordinary files. | |
19 | .PP | |
20 | It is assumed that the reader has a good knowledge | |
21 | of the overall structure of the file system as discussed | |
22 | in the paper ``The UNIX Time-sharing System.'' | |
23 | A more detailed discussion | |
24 | appears in | |
25 | ``UNIX Implementation;'' | |
26 | the current document restates parts of that one, | |
27 | but is still more detailed. | |
28 | It is most useful in | |
29 | conjunction with a copy of the system code, | |
30 | since it is basically an exegesis of that code. | |
31 | .SH | |
32 | Device Classes | |
33 | .PP | |
34 | There are two classes of device: | |
35 | .I block | |
36 | and | |
37 | .I character. | |
38 | The block interface is suitable for devices | |
39 | like disks, tapes, and DECtape | |
40 | which work, or can work, with addressible 512-byte blocks. | |
41 | Ordinary magnetic tape just barely fits in this category, | |
42 | since by use of forward | |
43 | and | |
44 | backward spacing any block can be read, even though | |
45 | blocks can be written only at the end of the tape. | |
46 | Block devices can at least potentially contain a mounted | |
47 | file system. | |
48 | The interface to block devices is very highly structured; | |
49 | the drivers for these devices share a great many routines | |
50 | as well as a pool of buffers. | |
51 | .PP | |
52 | Character-type devices have a much | |
53 | more straightforward interface, although | |
54 | more work must be done by the driver itself. | |
55 | .PP | |
56 | Devices of both types are named by a | |
57 | .I major | |
58 | and a | |
59 | .I minor | |
60 | device number. | |
61 | These numbers are generally stored as an integer | |
62 | with the minor device number | |
63 | in the low-order 8 bits and the major device number | |
64 | in the next-higher 8 bits; | |
65 | macros | |
66 | .I major | |
67 | and | |
68 | .I minor | |
69 | are available to access these numbers. | |
70 | The major device number selects which driver will deal with | |
71 | the device; the minor device number is not used | |
72 | by the rest of the system but is passed to the | |
73 | driver at appropriate times. | |
74 | Typically the minor number | |
75 | selects a subdevice attached to | |
76 | a given controller, or one of | |
77 | several similar hardware interfaces. | |
78 | .PP | |
79 | The major device numbers for block and character devices | |
80 | are used as indices in separate tables; | |
81 | they both start at 0 and therefore overlap. | |
82 | .SH | |
83 | Overview of I/O | |
84 | .PP | |
85 | The purpose of | |
86 | the | |
87 | .I open | |
88 | and | |
89 | .I creat | |
90 | system calls is to set up entries in three separate | |
91 | system tables. | |
92 | The first of these is the | |
93 | .I u_ofile | |
94 | table, | |
95 | which is stored in the system's per-process | |
96 | data area | |
97 | .I u. | |
98 | This table is indexed by | |
99 | the file descriptor returned by the | |
100 | .I open | |
101 | or | |
102 | .I creat, | |
103 | and is accessed during | |
104 | a | |
105 | .I read, | |
106 | .I write, | |
107 | or other operation on the open file. | |
108 | An entry contains only | |
109 | a pointer to the corresponding | |
110 | entry of the | |
111 | .I file | |
112 | table, | |
113 | which is a per-system data base. | |
114 | There is one entry in the | |
115 | .I file | |
116 | table for each | |
117 | instance of | |
118 | .I open | |
119 | or | |
120 | .I creat. | |
121 | This table is per-system because the same instance | |
122 | of an open file must be shared among the several processes | |
123 | which can result from | |
124 | .I forks | |
125 | after the file is opened. | |
126 | A | |
127 | .I file | |
128 | table entry contains | |
129 | flags which indicate whether the file | |
130 | was open for reading or writing or is a pipe, and | |
131 | a count which is used to decide when all processes | |
132 | using the entry have terminated or closed the file | |
133 | (so the entry can be abandoned). | |
134 | There is also a 32-bit file offset | |
135 | which is used to indicate where in the file the next read | |
136 | or write will take place. | |
137 | Finally, there is a pointer to the | |
138 | entry for the file in the | |
139 | .I inode | |
140 | table, | |
141 | which contains a copy of the file's i-node. | |
142 | .PP | |
143 | Certain open files can be designated ``multiplexed'' | |
144 | files, and several other flags apply to such | |
145 | channels. | |
146 | In such a case, instead of an offset, | |
147 | there is a pointer to an associated multiplex channel table. | |
148 | Multiplex channels will not be discussed here. | |
149 | .PP | |
150 | An entry in the | |
151 | .I file | |
152 | table corresponds precisely to an instance of | |
153 | .I open | |
154 | or | |
155 | .I creat; | |
156 | if the same file is opened several times, | |
157 | it will have several | |
158 | entries in this table. | |
159 | However, | |
160 | there is at most one entry | |
161 | in the | |
162 | .I inode | |
163 | table for a given file. | |
164 | Also, a file may enter the | |
165 | .I inode | |
166 | table not only because it is open, | |
167 | but also because it is the current directory | |
168 | of some process or because it | |
169 | is a special file containing a currently-mounted | |
170 | file system. | |
171 | .PP | |
172 | An entry in the | |
173 | .I inode | |
174 | table differs somewhat from the | |
175 | corresponding i-node as stored on the disk; | |
176 | the modified and accessed times are not stored, | |
177 | and the entry is augmented | |
178 | by a flag word containing information about the entry, | |
179 | a count used to determine when it may be | |
180 | allowed to disappear, | |
181 | and the device and i-number | |
182 | whence the entry came. | |
183 | Also, the several block numbers that give addressing | |
184 | information for the file are expanded from | |
185 | the 3-byte, compressed format used on the disk to full | |
186 | .I long | |
187 | quantities. | |
188 | .PP | |
189 | During the processing of an | |
190 | .I open | |
191 | or | |
192 | .I creat | |
193 | call for a special file, | |
194 | the system always calls the device's | |
195 | .I open | |
196 | routine to allow for any special processing | |
197 | required (rewinding a tape, turning on | |
198 | the data-terminal-ready lead of a modem, etc.). | |
199 | However, | |
200 | the | |
201 | .I close | |
202 | routine is called only when the last | |
203 | process closes a file, | |
204 | that is, when the i-node table entry | |
205 | is being deallocated. | |
206 | Thus it is not feasible | |
207 | for a device to maintain, or depend on, | |
208 | a count of its users, although it is quite | |
209 | possible to | |
210 | implement an exclusive-use device which cannot | |
211 | be reopened until it has been closed. | |
212 | .PP | |
213 | When a | |
214 | .I read | |
215 | or | |
216 | .I write | |
217 | takes place, | |
218 | the user's arguments | |
219 | and the | |
220 | .I file | |
221 | table entry are used to set up the | |
222 | variables | |
223 | .I u.u_base, | |
224 | .I u.u_count, | |
225 | and | |
226 | .I u.u_offset | |
227 | which respectively contain the (user) address | |
228 | of the I/O target area, the byte-count for the transfer, | |
229 | and the current location in the file. | |
230 | If the file referred to is | |
231 | a character-type special file, the appropriate read | |
232 | or write routine is called; it is responsible | |
233 | for transferring data and updating the | |
234 | count and current location appropriately | |
235 | as discussed below. | |
236 | Otherwise, the current location is used to calculate | |
237 | a logical block number in the file. | |
238 | If the file is an ordinary file the logical block | |
239 | number must be mapped (possibly using indirect blocks) | |
240 | to a physical block number; a block-type | |
241 | special file need not be mapped. | |
242 | This mapping is performed by the | |
243 | .I bmap | |
244 | routine. | |
245 | In any event, the resulting physical block number | |
246 | is used, as discussed below, to | |
247 | read or write the appropriate device. | |
248 | .SH | |
249 | Character Device Drivers | |
250 | .PP | |
251 | The | |
252 | .I cdevsw | |
253 | table specifies the interface routines present for | |
254 | character devices. | |
255 | Each device provides five routines: | |
256 | open, close, read, write, and special-function | |
257 | (to implement the | |
258 | .I ioctl | |
259 | system call). | |
260 | Any of these may be missing. | |
261 | If a call on the routine | |
262 | should be ignored, | |
263 | (e.g. | |
264 | .I open | |
265 | on non-exclusive devices that require no setup) | |
266 | the | |
267 | .I cdevsw | |
268 | entry can be given as | |
269 | .I nulldev; | |
270 | if it should be considered an error, | |
271 | (e.g. | |
272 | .I write | |
273 | on read-only devices) | |
274 | .I nodev | |
275 | is used. | |
276 | For terminals, | |
277 | the | |
278 | .I cdevsw | |
279 | structure also contains a pointer to the | |
280 | .I tty | |
281 | structure associated with the terminal. | |
282 | .PP | |
283 | The | |
284 | .I open | |
285 | routine is called each time the file | |
286 | is opened with the full device number as argument. | |
287 | The second argument is a flag which is | |
288 | non-zero only if the device is to be written upon. | |
289 | .PP | |
290 | The | |
291 | .I close | |
292 | routine is called only when the file | |
293 | is closed for the last time, | |
294 | that is when the very last process in | |
295 | which the file is open closes it. | |
296 | This means it is not possible for the driver to | |
297 | maintain its own count of its users. | |
298 | The first argument is the device number; | |
299 | the second is a flag which is non-zero | |
300 | if the file was open for writing in the process which | |
301 | performs the final | |
302 | .I close. | |
303 | .PP | |
304 | When | |
305 | .I write | |
306 | is called, it is supplied the device | |
307 | as argument. | |
308 | The per-user variable | |
309 | .I u.u_count | |
310 | has been set to | |
311 | the number of characters indicated by the user; | |
312 | for character devices, this number may be 0 | |
313 | initially. | |
314 | .I u.u_base | |
315 | is the address supplied by the user from which to start | |
316 | taking characters. | |
317 | The system may call the | |
318 | routine internally, so the | |
319 | flag | |
320 | .I u.u_segflg | |
321 | is supplied that indicates, | |
322 | if | |
323 | .I on, | |
324 | that | |
325 | .I u.u_base | |
326 | refers to the system address space instead of | |
327 | the user's. | |
328 | .PP | |
329 | The | |
330 | .I write | |
331 | routine | |
332 | should copy up to | |
333 | .I u.u_count | |
334 | characters from the user's buffer to the device, | |
335 | decrementing | |
336 | .I u.u_count | |
337 | for each character passed. | |
338 | For most drivers, which work one character at a time, | |
339 | the routine | |
340 | .I "cpass( )" | |
341 | is used to pick up characters | |
342 | from the user's buffer. | |
343 | Successive calls on it return | |
344 | the characters to be written until | |
345 | .I u.u_count | |
346 | goes to 0 or an error occurs, | |
347 | when it returns \(mi1. | |
348 | .I Cpass | |
349 | takes care of interrogating | |
350 | .I u.u_segflg | |
351 | and updating | |
352 | .I u.u_count. | |
353 | .PP | |
354 | Write routines which want to transfer | |
355 | a probably large number of characters into an internal | |
356 | buffer may also use the routine | |
357 | .I "iomove(buffer, offset, count, flag)" | |
358 | which is faster when many characters must be moved. | |
359 | .I Iomove | |
360 | transfers up to | |
361 | .I count | |
362 | characters into the | |
363 | .I buffer | |
364 | starting | |
365 | .I offset | |
366 | bytes from the start of the buffer; | |
367 | .I flag | |
368 | should be | |
369 | .I B_WRITE | |
370 | (which is 0) in the write case. | |
371 | Caution: | |
372 | the caller is responsible for making sure | |
373 | the count is not too large and is non-zero. | |
374 | As an efficiency note, | |
375 | .I iomove | |
376 | is much slower if any of | |
377 | .I "buffer+offset, count" | |
378 | or | |
379 | .I u.u_base | |
380 | is odd. | |
381 | .PP | |
382 | The device's | |
383 | .I read | |
384 | routine is called under conditions similar to | |
385 | .I write, | |
386 | except that | |
387 | .I u.u_count | |
388 | is guaranteed to be non-zero. | |
389 | To return characters to the user, the routine | |
390 | .I "passc(c)" | |
391 | is available; it takes care of housekeeping | |
392 | like | |
393 | .I cpass | |
394 | and returns \(mi1 as the last character | |
395 | specified by | |
396 | .I u.u_count | |
397 | is returned to the user; | |
398 | before that time, 0 is returned. | |
399 | .I Iomove | |
400 | is also usable as with | |
401 | .I write; | |
402 | the flag should be | |
403 | .I B_READ | |
404 | but the same cautions apply. | |
405 | .PP | |
406 | The ``special-functions'' routine | |
407 | is invoked by the | |
408 | .I stty | |
409 | and | |
410 | .I gtty | |
411 | system calls as follows: | |
412 | .I "(*p) (dev, v)" | |
413 | where | |
414 | .I p | |
415 | is a pointer to the device's routine, | |
416 | .I dev | |
417 | is the device number, | |
418 | and | |
419 | .I v | |
420 | is a vector. | |
421 | In the | |
422 | .I gtty | |
423 | case, | |
424 | the device is supposed to place up to 3 words of status information | |
425 | into the vector; this will be returned to the caller. | |
426 | In the | |
427 | .I stty | |
428 | case, | |
429 | .I v | |
430 | is 0; | |
431 | the device should take up to 3 words of | |
432 | control information from | |
433 | the array | |
434 | .I "u.u_arg[0...2]." | |
435 | .PP | |
436 | Finally, each device should have appropriate interrupt-time | |
437 | routines. | |
438 | When an interrupt occurs, it is turned into a C-compatible call | |
439 | on the devices's interrupt routine. | |
440 | The interrupt-catching mechanism makes | |
441 | the low-order four bits of the ``new PS'' word in the | |
442 | trap vector for the interrupt available | |
443 | to the interrupt handler. | |
444 | This is conventionally used by drivers | |
445 | which deal with multiple similar devices | |
446 | to encode the minor device number. | |
447 | After the interrupt has been processed, | |
448 | a return from the interrupt handler will | |
449 | return from the interrupt itself. | |
450 | .PP | |
451 | A number of subroutines are available which are useful | |
452 | to character device drivers. | |
453 | Most of these handlers, for example, need a place | |
454 | to buffer characters in the internal interface | |
455 | between their ``top half'' (read/write) | |
456 | and ``bottom half'' (interrupt) routines. | |
457 | For relatively low data-rate devices, the best mechanism | |
458 | is the character queue maintained by the | |
459 | routines | |
460 | .I getc | |
461 | and | |
462 | .I putc. | |
463 | A queue header has the structure | |
464 | .DS | |
465 | struct { | |
466 | int c_cc; /* character count */ | |
467 | char *c_cf; /* first character */ | |
468 | char *c_cl; /* last character */ | |
469 | } queue; | |
470 | .DE | |
471 | A character is placed on the end of a queue by | |
472 | .I "putc(c, &queue)" | |
473 | where | |
474 | .I c | |
475 | is the character and | |
476 | .I queue | |
477 | is the queue header. | |
478 | The routine returns \(mi1 if there is no space | |
479 | to put the character, 0 otherwise. | |
480 | The first character on the queue may be retrieved | |
481 | by | |
482 | .I "getc(&queue)" | |
483 | which returns either the (non-negative) character | |
484 | or \(mi1 if the queue is empty. | |
485 | .PP | |
486 | Notice that the space for characters in queues is | |
487 | shared among all devices in the system | |
488 | and in the standard system there are only some 600 | |
489 | character slots available. | |
490 | Thus device handlers, | |
491 | especially write routines, must take | |
492 | care to avoid gobbling up excessive numbers of characters. | |
493 | .PP | |
494 | The other major help available | |
495 | to device handlers is the sleep-wakeup mechanism. | |
496 | The call | |
497 | .I "sleep(event, priority)" | |
498 | causes the process to wait (allowing other processes to run) | |
499 | until the | |
500 | .I event | |
501 | occurs; | |
502 | at that time, the process is marked ready-to-run | |
503 | and the call will return when there is no | |
504 | process with higher | |
505 | .I priority. | |
506 | .PP | |
507 | The call | |
508 | .I "wakeup(event)" | |
509 | indicates that the | |
510 | .I event | |
511 | has happened, that is, causes processes sleeping | |
512 | on the event to be awakened. | |
513 | The | |
514 | .I event | |
515 | is an arbitrary quantity agreed upon | |
516 | by the sleeper and the waker-up. | |
517 | By convention, it is the address of some data area used | |
518 | by the driver, which guarantees that events | |
519 | are unique. | |
520 | .PP | |
521 | Processes sleeping on an event should not assume | |
522 | that the event has really happened; | |
523 | they should check that the conditions which | |
524 | caused them to sleep no longer hold. | |
525 | .PP | |
526 | Priorities can range from 0 to 127; | |
527 | a higher numerical value indicates a less-favored | |
528 | scheduling situation. | |
529 | A distinction is made between processes sleeping | |
530 | at priority less than the parameter | |
531 | .I PZERO | |
532 | and those at numerically larger priorities. | |
533 | The former cannot | |
534 | be interrupted by signals, although it | |
535 | is conceivable that it may be swapped out. | |
536 | Thus it is a bad idea to sleep with | |
537 | priority less than PZERO on an event which might never occur. | |
538 | On the other hand, calls to | |
539 | .I sleep | |
540 | with larger priority | |
541 | may never return if the process is terminated by | |
542 | some signal in the meantime. | |
543 | Incidentally, it is a gross error to call | |
544 | .I sleep | |
545 | in a routine called at interrupt time, since the process | |
546 | which is running is almost certainly not the | |
547 | process which should go to sleep. | |
548 | Likewise, none of the variables in the user area | |
549 | ``\fIu\fB.\fR'' | |
550 | should be touched, let alone changed, by an interrupt routine. | |
551 | .PP | |
552 | If a device driver | |
553 | wishes to wait for some event for which it is inconvenient | |
554 | or impossible to supply a | |
555 | .I wakeup, | |
556 | (for example, a device going on-line, which does not | |
557 | generally cause an interrupt), | |
558 | the call | |
559 | .I "sleep(&lbolt, priority) | |
560 | may be given. | |
561 | .I Lbolt | |
562 | is an external cell whose address is awakened once every 4 seconds | |
563 | by the clock interrupt routine. | |
564 | .PP | |
565 | The routines | |
566 | .I "spl4( ), spl5( ), spl6( ), spl7( )" | |
567 | are available to | |
568 | set the processor priority level as indicated to avoid | |
569 | inconvenient interrupts from the device. | |
570 | .PP | |
571 | If a device needs to know about real-time intervals, | |
572 | then | |
573 | .I "timeout(func, arg, interval) | |
574 | will be useful. | |
575 | This routine arranges that after | |
576 | .I interval | |
577 | sixtieths of a second, the | |
578 | .I func | |
579 | will be called with | |
580 | .I arg | |
581 | as argument, in the style | |
582 | .I "(*func)(arg). | |
583 | Timeouts are used, for example, | |
584 | to provide real-time delays after function characters | |
585 | like new-line and tab in typewriter output, | |
586 | and to terminate an attempt to | |
587 | read the 201 Dataphone | |
588 | .I dp | |
589 | if there is no response within a specified number | |
590 | of seconds. | |
591 | Notice that the number of sixtieths of a second is limited to 32767, | |
592 | since it must appear to be positive, | |
593 | and that only a bounded number of timeouts | |
594 | can be going on at once. | |
595 | Also, the specified | |
596 | .I func | |
597 | is called at clock-interrupt time, so it should | |
598 | conform to the requirements of interrupt routines | |
599 | in general. | |
600 | .SH | |
601 | The Block-device Interface | |
602 | .PP | |
603 | Handling of block devices is mediated by a collection | |
604 | of routines that manage a set of buffers containing | |
605 | the images of blocks of data on the various devices. | |
606 | The most important purpose of these routines is to assure | |
607 | that several processes that access the same block of the same | |
608 | device in multiprogrammed fashion maintain a consistent | |
609 | view of the data in the block. | |
610 | A secondary but still important purpose is to increase | |
611 | the efficiency of the system by | |
612 | keeping in-core copies of blocks that are being | |
613 | accessed frequently. | |
614 | The main data base for this mechanism is the | |
615 | table of buffers | |
616 | .I buf. | |
617 | Each buffer header contains a pair of pointers | |
618 | .I "(b_forw, b_back)" | |
619 | which maintain a doubly-linked list | |
620 | of the buffers associated with a particular | |
621 | block device, and a | |
622 | pair of pointers | |
623 | .I "(av_forw, av_back)" | |
624 | which generally maintain a doubly-linked list of blocks | |
625 | which are ``free,'' that is, | |
626 | eligible to be reallocated for another transaction. | |
627 | Buffers that have I/O in progress | |
628 | or are busy for other purposes do not appear in this list. | |
629 | The buffer header | |
630 | also contains the device and block number to which the | |
631 | buffer refers, and a pointer to the actual storage associated with | |
632 | the buffer. | |
633 | There is a word count | |
634 | which is the negative of the number of words | |
635 | to be transferred to or from the buffer; | |
636 | there is also an error byte and a residual word | |
637 | count used to communicate information | |
638 | from an I/O routine to its caller. | |
639 | Finally, there is a flag word | |
640 | with bits indicating the status of the buffer. | |
641 | These flags will be discussed below. | |
642 | .PP | |
643 | Seven routines constitute | |
644 | the most important part of the interface with the | |
645 | rest of the system. | |
646 | Given a device and block number, | |
647 | both | |
648 | .I bread | |
649 | and | |
650 | .I getblk | |
651 | return a pointer to a buffer header for the block; | |
652 | the difference is that | |
653 | .I bread | |
654 | is guaranteed to return a buffer actually containing the | |
655 | current data for the block, | |
656 | while | |
657 | .I getblk | |
658 | returns a buffer which contains the data in the | |
659 | block only if it is already in core (whether it is | |
660 | or not is indicated by the | |
661 | .I B_DONE | |
662 | bit; see below). | |
663 | In either case the buffer, and the corresponding | |
664 | device block, is made ``busy,'' | |
665 | so that other processes referring to it | |
666 | are obliged to wait until it becomes free. | |
667 | .I Getblk | |
668 | is used, for example, | |
669 | when a block is about to be totally rewritten, | |
670 | so that its previous contents are | |
671 | not useful; | |
672 | still, no other process can be allowed to refer to the block | |
673 | until the new data is placed into it. | |
674 | .PP | |
675 | The | |
676 | .I breada | |
677 | routine is used to implement read-ahead. | |
678 | it is logically similar to | |
679 | .I bread, | |
680 | but takes as an additional argument the number of | |
681 | a block (on the same device) to be read asynchronously | |
682 | after the specifically requested block is available. | |
683 | .PP | |
684 | Given a pointer to a buffer, | |
685 | the | |
686 | .I brelse | |
687 | routine | |
688 | makes the buffer again available to other processes. | |
689 | It is called, for example, after | |
690 | data has been extracted following a | |
691 | .I bread. | |
692 | There are three subtly-different write routines, | |
693 | all of which take a buffer pointer as argument, | |
694 | and all of which logically release the buffer for | |
695 | use by others and place it on the free list. | |
696 | .I Bwrite | |
697 | puts the | |
698 | buffer on the appropriate device queue, | |
699 | waits for the write to be done, | |
700 | and sets the user's error flag if required. | |
701 | .I Bawrite | |
702 | places the buffer on the device's queue, but does not wait | |
703 | for completion, so that errors cannot be reflected directly to | |
704 | the user. | |
705 | .I Bdwrite | |
706 | does not start any I/O operation at all, | |
707 | but merely marks | |
708 | the buffer so that if it happens | |
709 | to be grabbed from the free list to contain | |
710 | data from some other block, the data in it will | |
711 | first be written | |
712 | out. | |
713 | .PP | |
714 | .I Bwrite | |
715 | is used when one wants to be sure that | |
716 | I/O takes place correctly, and that | |
717 | errors are reflected to the proper user; | |
718 | it is used, for example, when updating i-nodes. | |
719 | .I Bawrite | |
720 | is useful when more overlap is desired | |
721 | (because no wait is required for I/O to finish) | |
722 | but when it is reasonably certain that the | |
723 | write is really required. | |
724 | .I Bdwrite | |
725 | is used when there is doubt that the write is | |
726 | needed at the moment. | |
727 | For example, | |
728 | .I bdwrite | |
729 | is called when the last byte of a | |
730 | .I write | |
731 | system call falls short of the end of a | |
732 | block, on the assumption that | |
733 | another | |
734 | .I write | |
735 | will be given soon which will re-use the same block. | |
736 | On the other hand, | |
737 | as the end of a block is passed, | |
738 | .I bawrite | |
739 | is called, since probably the block will | |
740 | not be accessed again soon and one might as | |
741 | well start the writing process as soon as possible. | |
742 | .PP | |
743 | In any event, notice that the routines | |
744 | .I "getblk" | |
745 | and | |
746 | .I bread | |
747 | dedicate the given block exclusively to the | |
748 | use of the caller, and make others wait, | |
749 | while one of | |
750 | .I "brelse, bwrite, bawrite," | |
751 | or | |
752 | .I bdwrite | |
753 | must eventually be called to free the block for use by others. | |
754 | .PP | |
755 | As mentioned, each buffer header contains a flag | |
756 | word which indicates the status of the buffer. | |
757 | Since they provide | |
758 | one important channel for information between the drivers and the | |
759 | block I/O system, it is important to understand these flags. | |
760 | The following names are manifest constants which | |
761 | select the associated flag bits. | |
762 | .IP B_READ 10 | |
763 | This bit is set when the buffer is handed to the device strategy routine | |
764 | (see below) to indicate a read operation. | |
765 | The symbol | |
766 | .I B_WRITE | |
767 | is defined as 0 and does not define a flag; it is provided | |
768 | as a mnemonic convenience to callers of routines like | |
769 | .I swap | |
770 | which have a separate argument | |
771 | which indicates read or write. | |
772 | .IP B_DONE 10 | |
773 | This bit is set | |
774 | to 0 when a block is handed to the the device strategy | |
775 | routine and is turned on when the operation completes, | |
776 | whether normally as the result of an error. | |
777 | It is also used as part of the return argument of | |
778 | .I getblk | |
779 | to indicate if 1 that the returned | |
780 | buffer actually contains the data in the requested block. | |
781 | .IP B_ERROR 10 | |
782 | This bit may be set to 1 when | |
783 | .I B_DONE | |
784 | is set to indicate that an I/O or other error occurred. | |
785 | If it is set the | |
786 | .I b_error | |
787 | byte of the buffer header may contain an error code | |
788 | if it is non-zero. | |
789 | If | |
790 | .I b_error | |
791 | is 0 the nature of the error is not specified. | |
792 | Actually no driver at present sets | |
793 | .I b_error; | |
794 | the latter is provided for a future improvement | |
795 | whereby a more detailed error-reporting | |
796 | scheme may be implemented. | |
797 | .IP B_BUSY 10 | |
798 | This bit indicates that the buffer header is not on | |
799 | the free list, i.e. is | |
800 | dedicated to someone's exclusive use. | |
801 | The buffer still remains attached to the list of | |
802 | blocks associated with its device, however. | |
803 | When | |
804 | .I getblk | |
805 | (or | |
806 | .I bread, | |
807 | which calls it) searches the buffer list | |
808 | for a given device and finds the requested | |
809 | block with this bit on, it sleeps until the bit | |
810 | clears. | |
811 | .IP B_PHYS 10 | |
812 | This bit is set for raw I/O transactions that | |
813 | need to allocate the Unibus map on an 11/70. | |
814 | .IP B_MAP 10 | |
815 | This bit is set on buffers that have the Unibus map allocated, | |
816 | so that the | |
817 | .I iodone | |
818 | routine knows to deallocate the map. | |
819 | .IP B_WANTED 10 | |
820 | This flag is used in conjunction with the | |
821 | .I B_BUSY | |
822 | bit. | |
823 | Before sleeping as described | |
824 | just above, | |
825 | .I getblk | |
826 | sets this flag. | |
827 | Conversely, when the block is freed and the busy bit | |
828 | goes down (in | |
829 | .I brelse) | |
830 | a | |
831 | .I wakeup | |
832 | is given for the block header whenever | |
833 | .I B_WANTED | |
834 | is on. | |
835 | This strategem avoids the overhead | |
836 | of having to call | |
837 | .I wakeup | |
838 | every time a buffer is freed on the chance that someone | |
839 | might want it. | |
840 | .IP B_AGE | |
841 | This bit may be set on buffers just before releasing them; if it | |
842 | is on, | |
843 | the buffer is placed at the head of the free list, rather than at the | |
844 | tail. | |
845 | It is a performance heuristic | |
846 | used when the caller judges that the same block will not soon be used again. | |
847 | .IP B_ASYNC 10 | |
848 | This bit is set by | |
849 | .I bawrite | |
850 | to indicate to the appropriate device driver | |
851 | that the buffer should be released when the | |
852 | write has been finished, usually at interrupt time. | |
853 | The difference between | |
854 | .I bwrite | |
855 | and | |
856 | .I bawrite | |
857 | is that the former starts I/O, waits until it is done, and | |
858 | frees the buffer. | |
859 | The latter merely sets this bit and starts I/O. | |
860 | The bit indicates that | |
861 | .I relse | |
862 | should be called for the buffer on completion. | |
863 | .IP B_DELWRI 10 | |
864 | This bit is set by | |
865 | .I bdwrite | |
866 | before releasing the buffer. | |
867 | When | |
868 | .I getblk, | |
869 | while searching for a free block, | |
870 | discovers the bit is 1 in a buffer it would otherwise grab, | |
871 | it causes the block to be written out before reusing it. | |
872 | .SH | |
873 | Block Device Drivers | |
874 | .PP | |
875 | The | |
876 | .I bdevsw | |
877 | table contains the names of the interface routines | |
878 | and that of a table for each block device. | |
879 | .PP | |
880 | Just as for character devices, block device drivers may supply | |
881 | an | |
882 | .I open | |
883 | and a | |
884 | .I close | |
885 | routine | |
886 | called respectively on each open and on the final close | |
887 | of the device. | |
888 | Instead of separate read and write routines, | |
889 | each block device driver has a | |
890 | .I strategy | |
891 | routine which is called with a pointer to a buffer | |
892 | header as argument. | |
893 | As discussed, the buffer header contains | |
894 | a read/write flag, the core address, | |
895 | the block number, a (negative) word count, | |
896 | and the major and minor device number. | |
897 | The role of the strategy routine | |
898 | is to carry out the operation as requested by the | |
899 | information in the buffer header. | |
900 | When the transaction is complete the | |
901 | .I B_DONE | |
902 | (and possibly the | |
903 | .I B_ERROR) | |
904 | bits should be set. | |
905 | Then if the | |
906 | .I B_ASYNC | |
907 | bit is set, | |
908 | .I brelse | |
909 | should be called; | |
910 | otherwise, | |
911 | .I wakeup. | |
912 | In cases where the device | |
913 | is capable, under error-free operation, | |
914 | of transferring fewer words than requested, | |
915 | the device's word-count register should be placed | |
916 | in the residual count slot of | |
917 | the buffer header; | |
918 | otherwise, the residual count should be set to 0. | |
919 | This particular mechanism is really for the benefit | |
920 | of the magtape driver; | |
921 | when reading this device | |
922 | records shorter than requested are quite normal, | |
923 | and the user should be told the actual length of the record. | |
924 | .PP | |
925 | Although the most usual argument | |
926 | to the strategy routines | |
927 | is a genuine buffer header allocated as discussed above, | |
928 | all that is actually required | |
929 | is that the argument be a pointer to a place containing the | |
930 | appropriate information. | |
931 | For example the | |
932 | .I swap | |
933 | routine, which manages movement | |
934 | of core images to and from the swapping device, | |
935 | uses the strategy routine | |
936 | for this device. | |
937 | Care has to be taken that | |
938 | no extraneous bits get turned on in the | |
939 | flag word. | |
940 | .PP | |
941 | The device's table specified by | |
942 | .I bdevsw | |
943 | has a | |
944 | byte to contain an active flag and an error count, | |
945 | a pair of links which constitute the | |
946 | head of the chain of buffers for the device | |
947 | .I "(b_forw, b_back)," | |
948 | and a first and last pointer for a device queue. | |
949 | Of these things, all are used solely by the device driver | |
950 | itself | |
951 | except for the buffer-chain pointers. | |
952 | Typically the flag encodes the state of the | |
953 | device, and is used at a minimum to | |
954 | indicate that the device is currently engaged in | |
955 | transferring information and no new command should be issued. | |
956 | The error count is useful for counting retries | |
957 | when errors occur. | |
958 | The device queue is used to remember stacked requests; | |
959 | in the simplest case it may be maintained as a first-in | |
960 | first-out list. | |
961 | Since buffers which have been handed over to | |
962 | the strategy routines are never | |
963 | on the list of free buffers, | |
964 | the pointers in the buffer which maintain the free list | |
965 | .I "(av_forw, av_back)" | |
966 | are also used to contain the pointers | |
967 | which maintain the device queues. | |
968 | .PP | |
969 | A couple of routines | |
970 | are provided which are useful to block device drivers. | |
971 | .I "iodone(bp)" | |
972 | arranges that the buffer to which | |
973 | .I bp | |
974 | points be released or awakened, | |
975 | as appropriate, | |
976 | when the | |
977 | strategy module has finished with the buffer, | |
978 | either normally or after an error. | |
979 | (In the latter case the | |
980 | .I B_ERROR | |
981 | bit has presumably been set.) | |
982 | .PP | |
983 | The routine | |
984 | .I "geterror(bp)" | |
985 | can be used to examine the error bit in a buffer header | |
986 | and arrange that any error indication found therein is | |
987 | reflected to the user. | |
988 | It may be called only in the non-interrupt | |
989 | part of a driver when I/O has completed | |
990 | .I (B_DONE | |
991 | has been set). | |
992 | .SH | |
993 | Raw Block-device I/O | |
994 | .PP | |
995 | A scheme has been set up whereby block device drivers may | |
996 | provide the ability to transfer information | |
997 | directly between the user's core image and the device | |
998 | without the use of buffers and in blocks as large as | |
999 | the caller requests. | |
1000 | The method involves setting up a character-type special file | |
1001 | corresponding to the raw device | |
1002 | and providing | |
1003 | .I read | |
1004 | and | |
1005 | .I write | |
1006 | routines which set up what is usually a private, | |
1007 | non-shared buffer header with the appropriate information | |
1008 | and call the device's strategy routine. | |
1009 | If desired, separate | |
1010 | .I open | |
1011 | and | |
1012 | .I close | |
1013 | routines may be provided but this is usually unnecessary. | |
1014 | A special-function routine might come in handy, especially for | |
1015 | magtape. | |
1016 | .PP | |
1017 | A great deal of work has to be done to generate the | |
1018 | ``appropriate information'' | |
1019 | to put in the argument buffer for | |
1020 | the strategy module; | |
1021 | the worst part is to map relocated user addresses to physical addresses. | |
1022 | Most of this work is done by | |
1023 | .I "physio(strat, bp, dev, rw) | |
1024 | whose arguments are the name of the | |
1025 | strategy routine | |
1026 | .I strat, | |
1027 | the buffer pointer | |
1028 | .I bp, | |
1029 | the device number | |
1030 | .I dev, | |
1031 | and a read-write flag | |
1032 | .I rw | |
1033 | whose value is either | |
1034 | .I B_READ | |
1035 | or | |
1036 | .I B_WRITE. | |
1037 | .I Physio | |
1038 | makes sure that the user's base address and count are | |
1039 | even (because most devices work in words) | |
1040 | and that the core area affected is contiguous | |
1041 | in physical space; | |
1042 | it delays until the buffer is not busy, and makes it | |
1043 | busy while the operation is in progress; | |
1044 | and it sets up user error return information. |