BSD 4_2 development
[unix-history] / usr / doc / iosys
CommitLineData
50c2245b
C
1.TL
2The UNIX I/O System
3.AU
4Dennis M. Ritchie
5.AI
6.MH
7.PP
8This paper gives an overview of the workings of the UNIX\(dg
9.FS
10\(dgUNIX is a Trademark of Bell Laboratories.
11.FE
12I/O system.
13It was written with an eye toward providing
14guidance to writers of device driver routines,
15and is oriented more toward describing the environment
16and nature of device drivers than the implementation
17of that part of the file system which deals with
18ordinary files.
19.PP
20It is assumed that the reader has a good knowledge
21of the overall structure of the file system as discussed
22in the paper ``The UNIX Time-sharing System.''
23A more detailed discussion
24appears in
25``UNIX Implementation;''
26the current document restates parts of that one,
27but is still more detailed.
28It is most useful in
29conjunction with a copy of the system code,
30since it is basically an exegesis of that code.
31.SH
32Device Classes
33.PP
34There are two classes of device:
35.I block
36and
37.I character.
38The block interface is suitable for devices
39like disks, tapes, and DECtape
40which work, or can work, with addressible 512-byte blocks.
41Ordinary magnetic tape just barely fits in this category,
42since by use of forward
43and
44backward spacing any block can be read, even though
45blocks can be written only at the end of the tape.
46Block devices can at least potentially contain a mounted
47file system.
48The interface to block devices is very highly structured;
49the drivers for these devices share a great many routines
50as well as a pool of buffers.
51.PP
52Character-type devices have a much
53more straightforward interface, although
54more work must be done by the driver itself.
55.PP
56Devices of both types are named by a
57.I major
58and a
59.I minor
60device number.
61These numbers are generally stored as an integer
62with the minor device number
63in the low-order 8 bits and the major device number
64in the next-higher 8 bits;
65macros
66.I major
67and
68.I minor
69are available to access these numbers.
70The major device number selects which driver will deal with
71the device; the minor device number is not used
72by the rest of the system but is passed to the
73driver at appropriate times.
74Typically the minor number
75selects a subdevice attached to
76a given controller, or one of
77several similar hardware interfaces.
78.PP
79The major device numbers for block and character devices
80are used as indices in separate tables;
81they both start at 0 and therefore overlap.
82.SH
83Overview of I/O
84.PP
85The purpose of
86the
87.I open
88and
89.I creat
90system calls is to set up entries in three separate
91system tables.
92The first of these is the
93.I u_ofile
94table,
95which is stored in the system's per-process
96data area
97.I u.
98This table is indexed by
99the file descriptor returned by the
100.I open
101or
102.I creat,
103and is accessed during
104a
105.I read,
106.I write,
107or other operation on the open file.
108An entry contains only
109a pointer to the corresponding
110entry of the
111.I file
112table,
113which is a per-system data base.
114There is one entry in the
115.I file
116table for each
117instance of
118.I open
119or
120.I creat.
121This table is per-system because the same instance
122of an open file must be shared among the several processes
123which can result from
124.I forks
125after the file is opened.
126A
127.I file
128table entry contains
129flags which indicate whether the file
130was open for reading or writing or is a pipe, and
131a count which is used to decide when all processes
132using the entry have terminated or closed the file
133(so the entry can be abandoned).
134There is also a 32-bit file offset
135which is used to indicate where in the file the next read
136or write will take place.
137Finally, there is a pointer to the
138entry for the file in the
139.I inode
140table,
141which contains a copy of the file's i-node.
142.PP
143Certain open files can be designated ``multiplexed''
144files, and several other flags apply to such
145channels.
146In such a case, instead of an offset,
147there is a pointer to an associated multiplex channel table.
148Multiplex channels will not be discussed here.
149.PP
150An entry in the
151.I file
152table corresponds precisely to an instance of
153.I open
154or
155.I creat;
156if the same file is opened several times,
157it will have several
158entries in this table.
159However,
160there is at most one entry
161in the
162.I inode
163table for a given file.
164Also, a file may enter the
165.I inode
166table not only because it is open,
167but also because it is the current directory
168of some process or because it
169is a special file containing a currently-mounted
170file system.
171.PP
172An entry in the
173.I inode
174table differs somewhat from the
175corresponding i-node as stored on the disk;
176the modified and accessed times are not stored,
177and the entry is augmented
178by a flag word containing information about the entry,
179a count used to determine when it may be
180allowed to disappear,
181and the device and i-number
182whence the entry came.
183Also, the several block numbers that give addressing
184information for the file are expanded from
185the 3-byte, compressed format used on the disk to full
186.I long
187quantities.
188.PP
189During the processing of an
190.I open
191or
192.I creat
193call for a special file,
194the system always calls the device's
195.I open
196routine to allow for any special processing
197required (rewinding a tape, turning on
198the data-terminal-ready lead of a modem, etc.).
199However,
200the
201.I close
202routine is called only when the last
203process closes a file,
204that is, when the i-node table entry
205is being deallocated.
206Thus it is not feasible
207for a device to maintain, or depend on,
208a count of its users, although it is quite
209possible to
210implement an exclusive-use device which cannot
211be reopened until it has been closed.
212.PP
213When a
214.I read
215or
216.I write
217takes place,
218the user's arguments
219and the
220.I file
221table entry are used to set up the
222variables
223.I u.u_base,
224.I u.u_count,
225and
226.I u.u_offset
227which respectively contain the (user) address
228of the I/O target area, the byte-count for the transfer,
229and the current location in the file.
230If the file referred to is
231a character-type special file, the appropriate read
232or write routine is called; it is responsible
233for transferring data and updating the
234count and current location appropriately
235as discussed below.
236Otherwise, the current location is used to calculate
237a logical block number in the file.
238If the file is an ordinary file the logical block
239number must be mapped (possibly using indirect blocks)
240to a physical block number; a block-type
241special file need not be mapped.
242This mapping is performed by the
243.I bmap
244routine.
245In any event, the resulting physical block number
246is used, as discussed below, to
247read or write the appropriate device.
248.SH
249Character Device Drivers
250.PP
251The
252.I cdevsw
253table specifies the interface routines present for
254character devices.
255Each device provides five routines:
256open, close, read, write, and special-function
257(to implement the
258.I ioctl
259system call).
260Any of these may be missing.
261If a call on the routine
262should be ignored,
263(e.g.
264.I open
265on non-exclusive devices that require no setup)
266the
267.I cdevsw
268entry can be given as
269.I nulldev;
270if it should be considered an error,
271(e.g.
272.I write
273on read-only devices)
274.I nodev
275is used.
276For terminals,
277the
278.I cdevsw
279structure also contains a pointer to the
280.I tty
281structure associated with the terminal.
282.PP
283The
284.I open
285routine is called each time the file
286is opened with the full device number as argument.
287The second argument is a flag which is
288non-zero only if the device is to be written upon.
289.PP
290The
291.I close
292routine is called only when the file
293is closed for the last time,
294that is when the very last process in
295which the file is open closes it.
296This means it is not possible for the driver to
297maintain its own count of its users.
298The first argument is the device number;
299the second is a flag which is non-zero
300if the file was open for writing in the process which
301performs the final
302.I close.
303.PP
304When
305.I write
306is called, it is supplied the device
307as argument.
308The per-user variable
309.I u.u_count
310has been set to
311the number of characters indicated by the user;
312for character devices, this number may be 0
313initially.
314.I u.u_base
315is the address supplied by the user from which to start
316taking characters.
317The system may call the
318routine internally, so the
319flag
320.I u.u_segflg
321is supplied that indicates,
322if
323.I on,
324that
325.I u.u_base
326refers to the system address space instead of
327the user's.
328.PP
329The
330.I write
331routine
332should copy up to
333.I u.u_count
334characters from the user's buffer to the device,
335decrementing
336.I u.u_count
337for each character passed.
338For most drivers, which work one character at a time,
339the routine
340.I "cpass( )"
341is used to pick up characters
342from the user's buffer.
343Successive calls on it return
344the characters to be written until
345.I u.u_count
346goes to 0 or an error occurs,
347when it returns \(mi1.
348.I Cpass
349takes care of interrogating
350.I u.u_segflg
351and updating
352.I u.u_count.
353.PP
354Write routines which want to transfer
355a probably large number of characters into an internal
356buffer may also use the routine
357.I "iomove(buffer, offset, count, flag)"
358which is faster when many characters must be moved.
359.I Iomove
360transfers up to
361.I count
362characters into the
363.I buffer
364starting
365.I offset
366bytes from the start of the buffer;
367.I flag
368should be
369.I B_WRITE
370(which is 0) in the write case.
371Caution:
372the caller is responsible for making sure
373the count is not too large and is non-zero.
374As an efficiency note,
375.I iomove
376is much slower if any of
377.I "buffer+offset, count"
378or
379.I u.u_base
380is odd.
381.PP
382The device's
383.I read
384routine is called under conditions similar to
385.I write,
386except that
387.I u.u_count
388is guaranteed to be non-zero.
389To return characters to the user, the routine
390.I "passc(c)"
391is available; it takes care of housekeeping
392like
393.I cpass
394and returns \(mi1 as the last character
395specified by
396.I u.u_count
397is returned to the user;
398before that time, 0 is returned.
399.I Iomove
400is also usable as with
401.I write;
402the flag should be
403.I B_READ
404but the same cautions apply.
405.PP
406The ``special-functions'' routine
407is invoked by the
408.I stty
409and
410.I gtty
411system calls as follows:
412.I "(*p) (dev, v)"
413where
414.I p
415is a pointer to the device's routine,
416.I dev
417is the device number,
418and
419.I v
420is a vector.
421In the
422.I gtty
423case,
424the device is supposed to place up to 3 words of status information
425into the vector; this will be returned to the caller.
426In the
427.I stty
428case,
429.I v
430is 0;
431the device should take up to 3 words of
432control information from
433the array
434.I "u.u_arg[0...2]."
435.PP
436Finally, each device should have appropriate interrupt-time
437routines.
438When an interrupt occurs, it is turned into a C-compatible call
439on the devices's interrupt routine.
440The interrupt-catching mechanism makes
441the low-order four bits of the ``new PS'' word in the
442trap vector for the interrupt available
443to the interrupt handler.
444This is conventionally used by drivers
445which deal with multiple similar devices
446to encode the minor device number.
447After the interrupt has been processed,
448a return from the interrupt handler will
449return from the interrupt itself.
450.PP
451A number of subroutines are available which are useful
452to character device drivers.
453Most of these handlers, for example, need a place
454to buffer characters in the internal interface
455between their ``top half'' (read/write)
456and ``bottom half'' (interrupt) routines.
457For relatively low data-rate devices, the best mechanism
458is the character queue maintained by the
459routines
460.I getc
461and
462.I putc.
463A queue header has the structure
464.DS
465struct {
466 int c_cc; /* character count */
467 char *c_cf; /* first character */
468 char *c_cl; /* last character */
469} queue;
470.DE
471A character is placed on the end of a queue by
472.I "putc(c, &queue)"
473where
474.I c
475is the character and
476.I queue
477is the queue header.
478The routine returns \(mi1 if there is no space
479to put the character, 0 otherwise.
480The first character on the queue may be retrieved
481by
482.I "getc(&queue)"
483which returns either the (non-negative) character
484or \(mi1 if the queue is empty.
485.PP
486Notice that the space for characters in queues is
487shared among all devices in the system
488and in the standard system there are only some 600
489character slots available.
490Thus device handlers,
491especially write routines, must take
492care to avoid gobbling up excessive numbers of characters.
493.PP
494The other major help available
495to device handlers is the sleep-wakeup mechanism.
496The call
497.I "sleep(event, priority)"
498causes the process to wait (allowing other processes to run)
499until the
500.I event
501occurs;
502at that time, the process is marked ready-to-run
503and the call will return when there is no
504process with higher
505.I priority.
506.PP
507The call
508.I "wakeup(event)"
509indicates that the
510.I event
511has happened, that is, causes processes sleeping
512on the event to be awakened.
513The
514.I event
515is an arbitrary quantity agreed upon
516by the sleeper and the waker-up.
517By convention, it is the address of some data area used
518by the driver, which guarantees that events
519are unique.
520.PP
521Processes sleeping on an event should not assume
522that the event has really happened;
523they should check that the conditions which
524caused them to sleep no longer hold.
525.PP
526Priorities can range from 0 to 127;
527a higher numerical value indicates a less-favored
528scheduling situation.
529A distinction is made between processes sleeping
530at priority less than the parameter
531.I PZERO
532and those at numerically larger priorities.
533The former cannot
534be interrupted by signals, although it
535is conceivable that it may be swapped out.
536Thus it is a bad idea to sleep with
537priority less than PZERO on an event which might never occur.
538On the other hand, calls to
539.I sleep
540with larger priority
541may never return if the process is terminated by
542some signal in the meantime.
543Incidentally, it is a gross error to call
544.I sleep
545in a routine called at interrupt time, since the process
546which is running is almost certainly not the
547process which should go to sleep.
548Likewise, none of the variables in the user area
549``\fIu\fB.\fR''
550should be touched, let alone changed, by an interrupt routine.
551.PP
552If a device driver
553wishes to wait for some event for which it is inconvenient
554or impossible to supply a
555.I wakeup,
556(for example, a device going on-line, which does not
557generally cause an interrupt),
558the call
559.I "sleep(&lbolt, priority)
560may be given.
561.I Lbolt
562is an external cell whose address is awakened once every 4 seconds
563by the clock interrupt routine.
564.PP
565The routines
566.I "spl4( ), spl5( ), spl6( ), spl7( )"
567are available to
568set the processor priority level as indicated to avoid
569inconvenient interrupts from the device.
570.PP
571If a device needs to know about real-time intervals,
572then
573.I "timeout(func, arg, interval)
574will be useful.
575This routine arranges that after
576.I interval
577sixtieths of a second, the
578.I func
579will be called with
580.I arg
581as argument, in the style
582.I "(*func)(arg).
583Timeouts are used, for example,
584to provide real-time delays after function characters
585like new-line and tab in typewriter output,
586and to terminate an attempt to
587read the 201 Dataphone
588.I dp
589if there is no response within a specified number
590of seconds.
591Notice that the number of sixtieths of a second is limited to 32767,
592since it must appear to be positive,
593and that only a bounded number of timeouts
594can be going on at once.
595Also, the specified
596.I func
597is called at clock-interrupt time, so it should
598conform to the requirements of interrupt routines
599in general.
600.SH
601The Block-device Interface
602.PP
603Handling of block devices is mediated by a collection
604of routines that manage a set of buffers containing
605the images of blocks of data on the various devices.
606The most important purpose of these routines is to assure
607that several processes that access the same block of the same
608device in multiprogrammed fashion maintain a consistent
609view of the data in the block.
610A secondary but still important purpose is to increase
611the efficiency of the system by
612keeping in-core copies of blocks that are being
613accessed frequently.
614The main data base for this mechanism is the
615table of buffers
616.I buf.
617Each buffer header contains a pair of pointers
618.I "(b_forw, b_back)"
619which maintain a doubly-linked list
620of the buffers associated with a particular
621block device, and a
622pair of pointers
623.I "(av_forw, av_back)"
624which generally maintain a doubly-linked list of blocks
625which are ``free,'' that is,
626eligible to be reallocated for another transaction.
627Buffers that have I/O in progress
628or are busy for other purposes do not appear in this list.
629The buffer header
630also contains the device and block number to which the
631buffer refers, and a pointer to the actual storage associated with
632the buffer.
633There is a word count
634which is the negative of the number of words
635to be transferred to or from the buffer;
636there is also an error byte and a residual word
637count used to communicate information
638from an I/O routine to its caller.
639Finally, there is a flag word
640with bits indicating the status of the buffer.
641These flags will be discussed below.
642.PP
643Seven routines constitute
644the most important part of the interface with the
645rest of the system.
646Given a device and block number,
647both
648.I bread
649and
650.I getblk
651return a pointer to a buffer header for the block;
652the difference is that
653.I bread
654is guaranteed to return a buffer actually containing the
655current data for the block,
656while
657.I getblk
658returns a buffer which contains the data in the
659block only if it is already in core (whether it is
660or not is indicated by the
661.I B_DONE
662bit; see below).
663In either case the buffer, and the corresponding
664device block, is made ``busy,''
665so that other processes referring to it
666are obliged to wait until it becomes free.
667.I Getblk
668is used, for example,
669when a block is about to be totally rewritten,
670so that its previous contents are
671not useful;
672still, no other process can be allowed to refer to the block
673until the new data is placed into it.
674.PP
675The
676.I breada
677routine is used to implement read-ahead.
678it is logically similar to
679.I bread,
680but takes as an additional argument the number of
681a block (on the same device) to be read asynchronously
682after the specifically requested block is available.
683.PP
684Given a pointer to a buffer,
685the
686.I brelse
687routine
688makes the buffer again available to other processes.
689It is called, for example, after
690data has been extracted following a
691.I bread.
692There are three subtly-different write routines,
693all of which take a buffer pointer as argument,
694and all of which logically release the buffer for
695use by others and place it on the free list.
696.I Bwrite
697puts the
698buffer on the appropriate device queue,
699waits for the write to be done,
700and sets the user's error flag if required.
701.I Bawrite
702places the buffer on the device's queue, but does not wait
703for completion, so that errors cannot be reflected directly to
704the user.
705.I Bdwrite
706does not start any I/O operation at all,
707but merely marks
708the buffer so that if it happens
709to be grabbed from the free list to contain
710data from some other block, the data in it will
711first be written
712out.
713.PP
714.I Bwrite
715is used when one wants to be sure that
716I/O takes place correctly, and that
717errors are reflected to the proper user;
718it is used, for example, when updating i-nodes.
719.I Bawrite
720is useful when more overlap is desired
721(because no wait is required for I/O to finish)
722but when it is reasonably certain that the
723write is really required.
724.I Bdwrite
725is used when there is doubt that the write is
726needed at the moment.
727For example,
728.I bdwrite
729is called when the last byte of a
730.I write
731system call falls short of the end of a
732block, on the assumption that
733another
734.I write
735will be given soon which will re-use the same block.
736On the other hand,
737as the end of a block is passed,
738.I bawrite
739is called, since probably the block will
740not be accessed again soon and one might as
741well start the writing process as soon as possible.
742.PP
743In any event, notice that the routines
744.I "getblk"
745and
746.I bread
747dedicate the given block exclusively to the
748use of the caller, and make others wait,
749while one of
750.I "brelse, bwrite, bawrite,"
751or
752.I bdwrite
753must eventually be called to free the block for use by others.
754.PP
755As mentioned, each buffer header contains a flag
756word which indicates the status of the buffer.
757Since they provide
758one important channel for information between the drivers and the
759block I/O system, it is important to understand these flags.
760The following names are manifest constants which
761select the associated flag bits.
762.IP B_READ 10
763This bit is set when the buffer is handed to the device strategy routine
764(see below) to indicate a read operation.
765The symbol
766.I B_WRITE
767is defined as 0 and does not define a flag; it is provided
768as a mnemonic convenience to callers of routines like
769.I swap
770which have a separate argument
771which indicates read or write.
772.IP B_DONE 10
773This bit is set
774to 0 when a block is handed to the the device strategy
775routine and is turned on when the operation completes,
776whether normally as the result of an error.
777It is also used as part of the return argument of
778.I getblk
779to indicate if 1 that the returned
780buffer actually contains the data in the requested block.
781.IP B_ERROR 10
782This bit may be set to 1 when
783.I B_DONE
784is set to indicate that an I/O or other error occurred.
785If it is set the
786.I b_error
787byte of the buffer header may contain an error code
788if it is non-zero.
789If
790.I b_error
791is 0 the nature of the error is not specified.
792Actually no driver at present sets
793.I b_error;
794the latter is provided for a future improvement
795whereby a more detailed error-reporting
796scheme may be implemented.
797.IP B_BUSY 10
798This bit indicates that the buffer header is not on
799the free list, i.e. is
800dedicated to someone's exclusive use.
801The buffer still remains attached to the list of
802blocks associated with its device, however.
803When
804.I getblk
805(or
806.I bread,
807which calls it) searches the buffer list
808for a given device and finds the requested
809block with this bit on, it sleeps until the bit
810clears.
811.IP B_PHYS 10
812This bit is set for raw I/O transactions that
813need to allocate the Unibus map on an 11/70.
814.IP B_MAP 10
815This bit is set on buffers that have the Unibus map allocated,
816so that the
817.I iodone
818routine knows to deallocate the map.
819.IP B_WANTED 10
820This flag is used in conjunction with the
821.I B_BUSY
822bit.
823Before sleeping as described
824just above,
825.I getblk
826sets this flag.
827Conversely, when the block is freed and the busy bit
828goes down (in
829.I brelse)
830a
831.I wakeup
832is given for the block header whenever
833.I B_WANTED
834is on.
835This strategem avoids the overhead
836of having to call
837.I wakeup
838every time a buffer is freed on the chance that someone
839might want it.
840.IP B_AGE
841This bit may be set on buffers just before releasing them; if it
842is on,
843the buffer is placed at the head of the free list, rather than at the
844tail.
845It is a performance heuristic
846used when the caller judges that the same block will not soon be used again.
847.IP B_ASYNC 10
848This bit is set by
849.I bawrite
850to indicate to the appropriate device driver
851that the buffer should be released when the
852write has been finished, usually at interrupt time.
853The difference between
854.I bwrite
855and
856.I bawrite
857is that the former starts I/O, waits until it is done, and
858frees the buffer.
859The latter merely sets this bit and starts I/O.
860The bit indicates that
861.I relse
862should be called for the buffer on completion.
863.IP B_DELWRI 10
864This bit is set by
865.I bdwrite
866before releasing the buffer.
867When
868.I getblk,
869while searching for a free block,
870discovers the bit is 1 in a buffer it would otherwise grab,
871it causes the block to be written out before reusing it.
872.SH
873Block Device Drivers
874.PP
875The
876.I bdevsw
877table contains the names of the interface routines
878and that of a table for each block device.
879.PP
880Just as for character devices, block device drivers may supply
881an
882.I open
883and a
884.I close
885routine
886called respectively on each open and on the final close
887of the device.
888Instead of separate read and write routines,
889each block device driver has a
890.I strategy
891routine which is called with a pointer to a buffer
892header as argument.
893As discussed, the buffer header contains
894a read/write flag, the core address,
895the block number, a (negative) word count,
896and the major and minor device number.
897The role of the strategy routine
898is to carry out the operation as requested by the
899information in the buffer header.
900When the transaction is complete the
901.I B_DONE
902(and possibly the
903.I B_ERROR)
904bits should be set.
905Then if the
906.I B_ASYNC
907bit is set,
908.I brelse
909should be called;
910otherwise,
911.I wakeup.
912In cases where the device
913is capable, under error-free operation,
914of transferring fewer words than requested,
915the device's word-count register should be placed
916in the residual count slot of
917the buffer header;
918otherwise, the residual count should be set to 0.
919This particular mechanism is really for the benefit
920of the magtape driver;
921when reading this device
922records shorter than requested are quite normal,
923and the user should be told the actual length of the record.
924.PP
925Although the most usual argument
926to the strategy routines
927is a genuine buffer header allocated as discussed above,
928all that is actually required
929is that the argument be a pointer to a place containing the
930appropriate information.
931For example the
932.I swap
933routine, which manages movement
934of core images to and from the swapping device,
935uses the strategy routine
936for this device.
937Care has to be taken that
938no extraneous bits get turned on in the
939flag word.
940.PP
941The device's table specified by
942.I bdevsw
943has a
944byte to contain an active flag and an error count,
945a pair of links which constitute the
946head of the chain of buffers for the device
947.I "(b_forw, b_back),"
948and a first and last pointer for a device queue.
949Of these things, all are used solely by the device driver
950itself
951except for the buffer-chain pointers.
952Typically the flag encodes the state of the
953device, and is used at a minimum to
954indicate that the device is currently engaged in
955transferring information and no new command should be issued.
956The error count is useful for counting retries
957when errors occur.
958The device queue is used to remember stacked requests;
959in the simplest case it may be maintained as a first-in
960first-out list.
961Since buffers which have been handed over to
962the strategy routines are never
963on the list of free buffers,
964the pointers in the buffer which maintain the free list
965.I "(av_forw, av_back)"
966are also used to contain the pointers
967which maintain the device queues.
968.PP
969A couple of routines
970are provided which are useful to block device drivers.
971.I "iodone(bp)"
972arranges that the buffer to which
973.I bp
974points be released or awakened,
975as appropriate,
976when the
977strategy module has finished with the buffer,
978either normally or after an error.
979(In the latter case the
980.I B_ERROR
981bit has presumably been set.)
982.PP
983The routine
984.I "geterror(bp)"
985can be used to examine the error bit in a buffer header
986and arrange that any error indication found therein is
987reflected to the user.
988It may be called only in the non-interrupt
989part of a driver when I/O has completed
990.I (B_DONE
991has been set).
992.SH
993Raw Block-device I/O
994.PP
995A scheme has been set up whereby block device drivers may
996provide the ability to transfer information
997directly between the user's core image and the device
998without the use of buffers and in blocks as large as
999the caller requests.
1000The method involves setting up a character-type special file
1001corresponding to the raw device
1002and providing
1003.I read
1004and
1005.I write
1006routines which set up what is usually a private,
1007non-shared buffer header with the appropriate information
1008and call the device's strategy routine.
1009If desired, separate
1010.I open
1011and
1012.I close
1013routines may be provided but this is usually unnecessary.
1014A special-function routine might come in handy, especially for
1015magtape.
1016.PP
1017A great deal of work has to be done to generate the
1018``appropriate information''
1019to put in the argument buffer for
1020the strategy module;
1021the worst part is to map relocated user addresses to physical addresses.
1022Most of this work is done by
1023.I "physio(strat, bp, dev, rw)
1024whose arguments are the name of the
1025strategy routine
1026.I strat,
1027the buffer pointer
1028.I bp,
1029the device number
1030.I dev,
1031and a read-write flag
1032.I rw
1033whose value is either
1034.I B_READ
1035or
1036.I B_WRITE.
1037.I Physio
1038makes sure that the user's base address and count are
1039even (because most devices work in words)
1040and that the core area affected is contiguous
1041in physical space;
1042it delays until the buffer is not busy, and makes it
1043busy while the operation is in progress;
1044and it sets up user error return information.