BSD 3 development
[unix-history] / usr / doc / implement
CommitLineData
2074ceed
BJ
1.de P1
2.DS
3..
4.de P2
5.DE
6..
7.de UL
8.lg 0
9.if n .ul
10\%\&\\$3\f3\\$1\fR\&\\$2
11.lg
12..
13.de UC
14\&\\$3\s-1\\$1\\s0\&\\$2
15..
16.de IT
17.lg 0
18.if n .ul
19\%\&\\$3\f2\\$1\fR\&\\$2
20.lg
21..
22.de SP
23.sp \\$1
24..
25.hw device
26.TL
27UNIX Implementation
28.AU "MH 2C-523" 2394
29K. Thompson
30.AI
31.MH
32.AB
33This paper describes in high-level terms the
34implementation of the resident
35.UX
36kernel.
37This discussion is broken into three parts.
38The first part describes
39how the
40.UX
41system views processes, users, and programs.
42The second part describes the I/O system.
43The last part describes the
44.UX
45file system.
46.AE
47.NH
48INTRODUCTION
49.PP
50The
51.UX
52kernel consists of about 10,000
53lines of C code and about 1,000 lines of assembly code.
54The assembly code can be further broken down into
55200 lines included for
56the sake of efficiency
57(they could have been written in C)
58and 800 lines to perform hardware
59functions not possible in C.
60.PP
61This code represents 5 to 10 percent of what has
62been lumped into the broad expression
63``the
64.UX
65operating system.''
66The kernel is the only
67.UX
68code that
69cannot be substituted by a user to his
70own liking.
71For this reason,
72the kernel should make as few real
73decisions as possible.
74This does not mean to allow the user
75a million options to do the same thing.
76Rather, it means to allow only one way to
77do one thing,
78but have that way be the least-common divisor
79of all the options that might have been provided.
80.PP
81What is or is not implemented in the kernel
82represents both a great responsibility and a great power.
83It is a soap-box platform on
84``the way things should be done.''
85Even so, if
86``the way'' is too radical,
87no one will follow it.
88Every important decision was weighed
89carefully.
90Throughout,
91simplicity has been substituted for efficiency.
92Complex algorithms are used only if
93their complexity can be localized.
94.NH
95PROCESS CONTROL
96.PP
97In the
98.UX
99system,
100a user executes programs in an
101environment called a user process.
102When a system function is required,
103the user process calls the system
104as a subroutine.
105At some point in this call,
106there is a distinct switch of environments.
107After this,
108the process is said to be a system process.
109In the normal definition of processes,
110the user and system processes are different
111phases of the same process
112(they never execute simultaneously).
113For protection,
114each system process has its own stack.
115.PP
116The user process may execute
117from a read-only text segment,
118which is shared by all processes
119executing the same code.
120There is no
121.IT functional
122benefit
123from shared-text segments.
124An
125.IT efficiency
126benefit comes from the fact
127that there is no need to swap read-only
128segments out because the original
129copy on secondary memory is still current.
130This is a great benefit to interactive
131programs that tend to be swapped while
132waiting for terminal input.
133Furthermore,
134if two processes are
135executing
136simultaneously
137from the same copy of a read-only segment,
138only one copy needs to reside in
139primary memory.
140This is a secondary effect,
141because
142simultaneous execution of a program
143is not common.
144It is ironic that this effect,
145which reduces the use of primary memory,
146only comes into play when there is
147an overabundance of primary memory,
148that is,
149when there is enough memory
150to keep waiting processes loaded.
151.PP
152All current read-only text segments in the
153system are maintained from the
154.IT "text table" .
155A text table entry holds the location of the
156text segment on secondary memory.
157If the segment is loaded,
158that table also holds the primary memory location
159and the count of the number of processes
160sharing this entry.
161When this count is reduced to zero,
162the entry is freed along with any
163primary and secondary memory holding the segment.
164When a process first executes a shared-text segment,
165a text table entry is allocated and the
166segment is loaded onto secondary memory.
167If a second process executes a text segment
168that is already allocated,
169the entry reference count is simply incremented.
170.PP
171A user process has some strictly private
172read-write data
173contained in its
174data segment.
175As far as possible,
176the system does not use the user's
177data segment to hold system data.
178In particular,
179there are no I/O buffers in the
180user address space.
181.PP
182The user data segment has two growing boundaries.
183One, increased automatically by the system
184as a result of memory faults,
185is used for a stack.
186The second boundary is only grown (or shrunk) by
187explicit requests.
188The contents of newly allocated primary memory
189is initialized to zero.
190.PP
191Also associated and swapped with
192a process is a small fixed-size
193system data segment.
194This segment contains all
195the data about the process
196that the system needs only when the
197process is active.
198Examples of the kind of data contained
199in the system data segment are:
200saved central processor registers,
201open file descriptors,
202accounting information,
203scratch data area,
204and the stack for the system phase
205of the process.
206The system data segment is not
207addressable from the user process
208and is therefore protected.
209.PP
210Last,
211there is a process table with
212one entry per process.
213This entry contains all the data
214needed by the system when the process
215is
216.IT not
217active.
218Examples are
219the process's name,
220the location of the other segments,
221and scheduling information.
222The process table entry is allocated
223when the process is created, and freed
224when the process terminates.
225This process entry is always directly
226addressable by the kernel.
227.PP
228Figure 1 shows the relationships
229between the various process control
230data.
231In a sense,
232the process table is the
233definition of all processes,
234because
235all the data associated with a process
236may be accessed
237starting from the process table entry.
238.KS
239.sp 2.44i
240.sp 2v
241.ce
242Fig. 1\(emProcess control data structure.
243.KE
244.NH 2
245Process creation and program execution
246.PP
247Processes are created by the system primitive
248.UL fork .
249The newly created process (child) is a copy of the original process (parent).
250There is no detectable sharing of primary memory between the two processes.
251(Of course,
252if the parent process was executing from a read-only
253text segment,
254the child will share the text segment.)
255Copies of all writable data segments
256are made for the child process.
257Files that were open before the
258.UL fork
259are
260truly shared after the
261.UL fork .
262The processes are informed as to their part in the
263relationship to
264allow them to select their own
265(usually non-identical)
266destiny.
267The parent may
268.UL wait
269for the termination of
270any of its children.
271.PP
272A process may
273.UL exec
274a file.
275This consists of exchanging the current text and data
276segments of the process for new text and data
277segments specified in the file.
278The old segments are lost.
279Doing an
280.UL exec
281does
282.IT not
283change processes;
284the process that did the
285.UL exec
286persists,
287but
288after the
289.UL exec
290it is executing a different program.
291Files that were open
292before the
293.UL exec
294remain open after the
295.UL exec .
296.PP
297If a program,
298say the first pass of a compiler,
299wishes to overlay itself with another program,
300say the second pass,
301then it simply
302.UL exec s
303the second program.
304This is analogous
305to a ``goto.''
306If a program wishes to regain control
307after
308.UL exec ing
309a second program,
310it should
311.UL fork
312a child process,
313have the child
314.UL exec
315the second program, and
316have the parent
317.UL wait
318for the child.
319This is analogous to a ``call.''
320Breaking up the call into a binding followed by
321a transfer is similar to the subroutine linkage in
322SL-5.
323.[
324griswold hanson sl5 overview
325.]
326.NH 2
327Swapping
328.PP
329The major data associated with a process
330(the user data segment,
331the system data segment, and
332the text segment)
333are swapped to and from secondary
334memory, as needed.
335The user data segment and the system data segment
336are kept in contiguous primary memory to reduce
337swapping latency.
338(When low-latency devices, such as bubbles,
339.UC CCD s,
340or scatter/gather devices,
341are used,
342this decision will have to be reconsidered.)
343Allocation of both primary
344and secondary memory is performed
345by the same simple first-fit algorithm.
346When a process grows,
347a new piece of primary memory is allocated.
348The contents of the old memory is copied to the new memory.
349The old memory is freed
350and the tables are updated.
351If there is not enough primary memory,
352secondary memory is allocated instead.
353The process is swapped out onto the
354secondary memory,
355ready to be swapped in with
356its new size.
357.PP
358One separate process in the kernel,
359the swapping process,
360simply swaps the other
361processes in and out of primary memory.
362It examines the
363process table looking for a process
364that is swapped out and is
365ready to run.
366It allocates primary memory for that
367process and
368reads its segments into
369primary memory, where that process competes for the
370central processor with other loaded processes.
371If no primary memory is available,
372the swapping process makes memory available
373by examining the process table for processes
374that can be swapped out.
375It selects a process to swap out,
376writes it to secondary memory,
377frees the primary memory,
378and then goes back to look for a process
379to swap in.
380.PP
381Thus there are two specific algorithms
382to the swapping process.
383Which of the possibly many processes that
384are swapped out is to be swapped in?
385This is decided by secondary storage residence
386time.
387The one with the longest time out is swapped in first.
388There is a slight penalty for larger processes.
389Which of the possibly many processes that
390are loaded is to be swapped out?
391Processes that are waiting for slow events
392(i.e., not currently running or waiting for
393disk I/O)
394are picked first,
395by age in primary memory,
396again with size penalties.
397The other processes are examined
398by the same age algorithm,
399but are not taken out unless they are
400at least of some age.
401This adds
402hysteresis to the swapping and
403prevents total thrashing.
404.PP
405These swapping algorithms are the
406most suspect in the system.
407With limited primary memory,
408these algorithms cause total swapping.
409This is not bad in itself, because
410the swapping does not impact the
411execution of the resident processes.
412However, if the swapping device must
413also be used for file storage,
414the swapping traffic severely
415impacts the file system traffic.
416It is exactly these small systems
417that tend to double usage of limited disk
418resources.
419.NH 2
420Synchronization and scheduling
421.PP
422Process synchronization is accomplished by having processes
423wait for events.
424Events are represented by arbitrary integers.
425By convention,
426events are chosen to be addresses of
427tables associated with those events.
428For example, a process that is waiting for
429any of its children to terminate will wait
430for an event that is the address of
431its own process table entry.
432When a process terminates,
433it signals the event represented by
434its parent's process table entry.
435Signaling an event on which no process
436is waiting has no effect.
437Similarly,
438signaling an event on which many processes
439are waiting will wake all of them up.
440This differs considerably from
441Dijkstra's P and V
442synchronization operations,
443.[
444dijkstra sequential processes 1968
445.]
446in that
447no memory is associated with events.
448Thus there need be no allocation of events
449prior to their use.
450Events exist simply by being used.
451.PP
452On the negative side,
453because there is no memory associated with events,
454no notion of ``how much''
455can be signaled via the event mechanism.
456For example,
457processes that want memory might
458wait on an event associated with
459memory allocation.
460When any amount of memory becomes available,
461the event would be signaled.
462All the competing processes would then wake
463up to fight over the new memory.
464(In reality,
465the swapping process is the only process
466that waits for primary memory to become available.)
467.PP
468If an event occurs
469between the time a process decides
470to wait for that event and the
471time that process enters the wait state,
472then
473the process will wait on an event that has
474already happened (and may never happen again).
475This race condition happens because there is no memory associated with
476the event to indicate that the event has occurred;
477the only action of an event is to change a set of processes
478from wait state to run state.
479This problem is relieved largely
480by the fact that process switching can
481only occur in the kernel by explicit calls
482to the event-wait mechanism.
483If the event in question is signaled by another
484process,
485then there is no problem.
486But if the event is signaled by a hardware
487interrupt,
488then special care must be taken.
489These synchronization races pose the biggest
490problem when
491.UX
492is adapted to multiple-processor configurations.
493.[
494hawley meyer multiprocessing unix
495.]
496.PP
497The event-wait code in the kernel
498is like a co-routine linkage.
499At any time,
500all but one of the processes has called event-wait.
501The remaining process is the one currently executing.
502When it calls event-wait,
503a process whose event has been signaled
504is selected and that process
505returns from its call to event-wait.
506.PP
507Which of the runable processes is to run next?
508Associated with each process is a priority.
509The priority of a system process is assigned by the code
510issuing the wait on an event.
511This is roughly equivalent to the response
512that one would expect on such an event.
513Disk events have high priority,
514teletype events are low,
515and time-of-day events are very low.
516(From observation,
517the difference in system process priorities
518has little or no performance impact.)
519All user-process priorities are lower than the
520lowest system priority.
521User-process priorities are assigned
522by an algorithm based on the
523recent ratio of the amount of compute time to real time consumed
524by the process.
525A process that has used a lot of
526compute time in the last real-time
527unit is assigned a low user priority.
528Because interactive processes are characterized
529by low ratios of compute to real time,
530interactive response is maintained without any
531special arrangements.
532.PP
533The scheduling algorithm simply picks
534the process with the highest priority,
535thus
536picking all system processes first and
537user processes second.
538The compute-to-real-time ratio is updated
539every second.
540Thus,
541all other things being equal,
542looping user processes will be
543scheduled round-robin with a
5441-second quantum.
545A high-priority process waking up will
546preempt a running, low-priority process.
547The scheduling algorithm has a very desirable
548negative feedback character.
549If a process uses its high priority
550to hog the computer,
551its priority will drop.
552At the same time, if a low-priority
553process is ignored for a long time,
554its priority will rise.
555.NH
556I/O SYSTEM
557.PP
558The I/O system
559is broken into two completely separate systems:
560the block I/O system and the character I/O system.
561In retrospect,
562the names should have been ``structured I/O''
563and ``unstructured I/O,'' respectively;
564while the term ``block I/O'' has some meaning,
565``character I/O'' is a complete misnomer.
566.PP
567Devices are characterized by a major device number,
568a minor device number, and
569a class (block or character).
570For each class,
571there is an array of entry points into the device drivers.
572The major device number is used to index the array
573when calling the code for a particular device driver.
574The minor device number is passed to the
575device driver as an argument.
576The minor number has no significance other
577than that attributed to it by the driver.
578Usually,
579the driver uses the minor number to access
580one of several identical physical devices.
581.PP
582The use of the array of entry points
583(configuration table)
584as the only connection between the
585system code and the device drivers is
586very important.
587Early versions of the system had a much
588less formal connection with the drivers,
589so that it was extremely hard to handcraft
590differently configured systems.
591Now it is possible to create new
592device drivers in an average of a few hours.
593The configuration table in most cases
594is created automatically by a program
595that reads the system's parts list.
596.NH 2
597Block I/O system
598.PP
599The model block I/O device consists
600of randomly addressed, secondary
601memory blocks of 512 bytes each.
602The blocks are uniformly addressed
6030, 1, .\|.\|. up to the size of the device.
604The block device driver has the job of
605emulating this model on a
606physical device.
607.PP
608The block I/O devices are accessed
609through a layer of buffering software.
610The system maintains a list of buffers
611(typically between 10 and 70)
612each assigned a device name and
613a device address.
614This buffer pool constitutes a data cache
615for the block devices.
616On a read request,
617the cache is searched for the desired block.
618If the block is found,
619the data are made available to the
620requester without any physical I/O.
621If the block is not in the cache,
622the least recently used block in the cache is renamed,
623the correct device driver is called to
624fill up the renamed buffer, and then the
625data are made available.
626Write requests are handled in an analogous manner.
627The correct buffer is found
628and relabeled if necessary.
629The write is performed simply by marking
630the buffer as ``dirty.''
631The physical I/O is then deferred until
632the buffer is renamed.
633.PP
634The benefits in reduction of physical I/O
635of this scheme are substantial,
636especially considering the file system implementation.
637There are,
638however,
639some drawbacks.
640The asynchronous nature of the
641algorithm makes error reporting
642and meaningful user error handling
643almost impossible.
644The cavalier approach to I/O error
645handling in the
646.UX
647system is partly due to the asynchronous
648nature of the block I/O system.
649A second problem is in the delayed writes.
650If the system stops unexpectedly,
651it is almost certain that there is a
652lot of logically complete,
653but physically incomplete,
654I/O in the buffers.
655There is a system primitive to
656flush all outstanding I/O activity
657from the buffers.
658Periodic use of this primitive helps,
659but does not solve, the problem.
660Finally,
661the associativity in the buffers
662can alter the physical I/O sequence
663from that of the logical I/O sequence.
664This means that there are times
665when data structures on disk are inconsistent,
666even though the software is careful
667to perform I/O in the correct order.
668On non-random devices,
669notably magnetic tape,
670the inversions of writes can be disastrous.
671The problem with magnetic tapes is ``cured'' by
672allowing only one outstanding write request
673per drive.
674.NH 2
675Character I/O system
676.PP
677The character I/O system consists of all
678devices that do not fall into the block I/O model.
679This includes the ``classical'' character devices
680such as communications lines, paper tape, and
681line printers.
682It also includes magnetic tape and disks when
683they are not used in a stereotyped way,
684for example, 80-byte physical records on tape
685and track-at-a-time disk copies.
686In short,
687the character I/O interface
688means ``everything other than block.''
689I/O requests from the user are sent to the
690device driver essentially unaltered.
691The implementation of these requests is, of course,
692up to the device driver.
693There are guidelines and conventions
694to help the implementation of
695certain types of device drivers.
696.NH 3
697Disk drivers
698.PP
699Disk drivers are implemented
700with a queue of transaction records.
701Each record holds a read/write flag,
702a primary memory address,
703a secondary memory address, and
704a transfer byte count.
705Swapping is accomplished by passing
706such a record to the swapping device driver.
707The block I/O interface is implemented by
708passing such records with requests to
709fill and empty system buffers.
710The character I/O interface to the disk
711drivers create a transaction record that
712points directly into the user area.
713The routine that creates this record also insures
714that the user is not swapped during this
715I/O transaction.
716Thus by implementing the general disk driver,
717it is possible to use the disk
718as a block device,
719a character device, and a swap device.
720The only really disk-specific code in normal
721disk drivers is the pre-sort of transactions to
722minimize latency for a particular device, and
723the actual issuing of the I/O request.
724.NH 3
725Character lists
726.PP
727Real character-oriented devices may
728be implemented using the common
729code to handle character lists.
730A character list is a queue of characters.
731One routine puts a character on a queue.
732Another gets a character from a queue.
733It is also possible to ask how many
734characters are currently on a queue.
735Storage for all queues in the system comes
736from a single common pool.
737Putting a character on a queue will allocate
738space from the common pool and link the
739character onto the data structure defining the queue.
740Getting a character from a queue returns
741the corresponding space to the pool.
742.PP
743A typical character-output device
744(paper tape punch, for example)
745is implemented by passing characters
746from the user onto a character queue until
747some maximum number of characters is on the queue.
748The I/O is prodded to start as
749soon as there is anything on the queue
750and, once started,
751it is sustained by hardware completion interrupts.
752Each time there is a completion interrupt,
753the driver gets the next character from the queue
754and sends it to the hardware.
755The number of characters on the queue is checked and,
756as the count falls through some intermediate level,
757an event (the queue address) is signaled.
758The process that is passing characters from
759the user to the queue can be waiting on the event, and
760refill the queue to its maximum
761when the event occurs.
762.PP
763A typical character input device
764(for example, a paper tape reader)
765is handled in a very similar manner.
766.PP
767Another class of character devices is the terminals.
768A terminal is represented by three
769character queues.
770There are two input queues (raw and canonical)
771and an output queue.
772Characters going to the output of a terminal
773are handled by common code exactly as described
774above.
775The main difference is that there is also code
776to interpret the output stream as
777.UC ASCII
778characters and to perform some translations,
779e.g., escapes for deficient terminals.
780Another common aspect of terminals is code
781to insert real-time delay after certain control characters.
782.PP
783Input on terminals is a little different.
784Characters are collected from the terminal and
785placed on a raw input queue.
786Some device-dependent code conversion and
787escape interpretation is handled here.
788When a line is complete in the raw queue,
789an event is signaled.
790The code catching this signal then copies a
791line from the raw queue to a canonical queue
792performing the character erase and line kill editing.
793User read requests on terminals can be
794directed at either the raw or canonical queues.
795.NH 3
796Other character devices
797.PP
798Finally,
799there are devices that fit no general category.
800These devices are set up as character I/O drivers.
801An example is a driver that reads and writes
802unmapped primary memory as an I/O device.
803Some devices are too
804fast to be treated a character at time,
805but do not fit the disk I/O mold.
806Examples are fast communications lines and
807fast line printers.
808These devices either have their own buffers
809or ``borrow'' block I/O buffers for a while and
810then give them back.
811.NH
812THE FILE SYSTEM
813.PP
814In the
815.UX
816system,
817a file is a (one-dimensional) array of bytes.
818No other structure of files is implied by the
819system.
820Files are attached anywhere
821(and possibly multiply)
822onto a hierarchy of directories.
823Directories are simply files that
824users cannot write.
825For a further discussion
826of the external view of files and directories,
827see Ref.\0
828.[
829ritchie thompson unix bstj 1978
830%Q This issue
831.].
832.PP
833The
834.UX
835file system is a disk data structure
836accessed completely through
837the block I/O system.
838As stated before,
839the canonical view of a ``disk'' is
840a randomly addressable array of
841512-byte blocks.
842A file system breaks the disk into
843four self-identifying regions.
844The first block (address 0)
845is unused by the file system.
846It is left aside for booting procedures.
847The second block (address 1)
848contains the so-called ``super-block.''
849This block,
850among other things,
851contains the size of the disk and
852the boundaries of the other regions.
853Next comes the i-list,
854a list of file definitions.
855Each file definition is
856a 64-byte structure, called an i-node.
857The offset of a particular i-node
858within the i-list is called its i-number.
859The combination of device name
860(major and minor numbers) and i-number
861serves to uniquely name a particular file.
862After the i-list,
863and to the end of the disk,
864come free storage blocks that
865are available for the contents of files.
866.PP
867The free space on a disk is maintained
868by a linked list of available disk blocks.
869Every block in this chain contains a disk address
870of the next block in the chain.
871The remaining space contains the address of up to
87250 disk blocks that are also free.
873Thus with one I/O operation,
874the system obtains 50 free blocks and a
875pointer where to find more.
876The disk allocation algorithms are
877very straightforward.
878Since all allocation is in fixed-size
879blocks and there is strict accounting of
880space,
881there is no need to compact or garbage collect.
882However,
883as disk space becomes dispersed,
884latency gradually increases.
885Some installations choose to occasionally compact
886disk space to reduce latency.
887.PP
888An i-node contains 13 disk addresses.
889The first 10 of these addresses point directly at
890the first 10 blocks of a file.
891If a file is larger than 10 blocks (5,120 bytes),
892then the eleventh address points at a block
893that contains the addresses of the next 128 blocks of the file.
894If the file is still larger than this
895(70,656 bytes),
896then the twelfth block points at up to 128 blocks,
897each pointing to 128 blocks of the file.
898Files yet larger
899(8,459,264 bytes)
900use the thirteenth address for a ``triple indirect'' address.
901The algorithm ends here with the maximum file size
902of 1,082,201,087 bytes.
903.PP
904A logical directory hierarchy is added
905to this flat physical structure simply
906by adding a new type of file, the directory.
907A directory is accessed exactly as an ordinary file.
908It contains 16-byte entries consisting of
909a 14-byte name and an i-number.
910The root of the hierarchy is at a known i-number
911(\fIviz.,\fR 2).
912The file system structure allows an arbitrary, directed graph
913of directories with regular files linked in
914at arbitrary places in this graph.
915In fact,
916very early
917.UX
918systems used such a structure.
919Administration of such a structure became so
920chaotic that later systems were restricted
921to a directory tree.
922Even now,
923with regular files linked multiply
924into arbitrary places in the tree,
925accounting for space has become a problem.
926It may become necessary to restrict the entire
927structure to a tree,
928and allow a new form of linking that
929is subservient to the tree structure.
930.PP
931The file system allows
932easy creation,
933easy removal,
934easy random accessing,
935and very easy space allocation.
936With most physical addresses confined
937to a small contiguous section of disk,
938it is also easy to dump, restore, and
939check the consistency of the file system.
940Large files suffer from indirect addressing,
941but the cache prevents most of the implied physical I/O
942without adding much execution.
943The space overhead properties of this scheme are quite good.
944For example,
945on one particular file system,
946there are 25,000 files containing 130M bytes of data-file content.
947The overhead (i-node, indirect blocks, and last block breakage)
948is about 11.5M bytes.
949The directory structure to support these files
950has about 1,500 directories containing 0.6M bytes of directory content
951and about 0.5M bytes of overhead in accessing the directories.
952Added up any way,
953this comes out to less than a 10 percent overhead for actual
954stored data.
955Most systems have this much overhead in
956padded trailing blanks alone.
957.NH 2
958File system implementation
959.PP
960Because the i-node defines a file,
961the implementation of the file system centers
962around access to the i-node.
963The system maintains a table of all active
964i-nodes.
965As a new file is accessed,
966the system locates the corresponding i-node,
967allocates an i-node table entry, and reads
968the i-node into primary memory.
969As in the buffer cache,
970the table entry is considered to be the current
971version of the i-node.
972Modifications to the i-node are made to
973the table entry.
974When the last access to the i-node goes
975away,
976the table entry is copied back to the
977secondary store i-list and the table entry is freed.
978.PP
979All I/O operations on files are carried out
980with the aid of the corresponding i-node table entry.
981The accessing of a file is a straightforward
982implementation of the algorithms mentioned previously.
983The user is not aware of i-nodes and i-numbers.
984References to the file system are made in terms of
985path names of the directory tree.
986Converting a path name into an i-node table entry
987is also straightforward.
988Starting at some known i-node
989(the root or the current directory of some process),
990the next component of the path name is
991searched by reading the directory.
992This gives an i-number and an implied device
993(that of the directory).
994Thus the next i-node table entry can be accessed.
995If that was the last component of the path name,
996then this i-node is the result.
997If not,
998this i-node is the directory needed to look up
999the next component of the path name, and the
1000algorithm is repeated.
1001.PP
1002The user process accesses the file system with
1003certain primitives.
1004The most common of these are
1005.UL open ,
1006.UL create ,
1007.UL read ,
1008.UL write ,
1009.UL seek ,
1010and
1011.UL close .
1012The data structures maintained are shown in Fig. 2.
1013.KS
1014.sp 22P
1015.ce
1016Fig. 2\(emFile system data structure.
1017.KE
1018In the system data segment associated with a user,
1019there is room for some (usually between 10 and 50) open files.
1020This open file table consists of pointers that can be used to access
1021corresponding i-node table entries.
1022Associated with each of these open files is
1023a current I/O pointer.
1024This is a byte offset of
1025the next read/write operation on the file.
1026The system treats each read/write request
1027as random with an implied seek to the
1028I/O pointer.
1029The user usually thinks of the file as
1030sequential with the I/O pointer
1031automatically counting the number of bytes
1032that have been read/written from the file.
1033The user may,
1034of course,
1035perform random I/O by setting the I/O pointer
1036before reads/writes.
1037.PP
1038With file sharing,
1039it is necessary to allow related
1040processes to share a common I/O pointer
1041and yet have separate I/O pointers
1042for independent processes
1043that access the same file.
1044With these two conditions,
1045the I/O pointer cannot reside
1046in the i-node table nor can
1047it reside in the list of
1048open files for the process.
1049A new table
1050(the open file table)
1051was invented for the sole purpose
1052of holding the I/O pointer.
1053Processes that share the same open
1054file
1055(the result of
1056.UL fork s)
1057share a common open file table entry.
1058A separate open of the same file will
1059only share the i-node table entry,
1060but will have distinct open file table entries.
1061.PP
1062The main file system primitives are implemented as follows.
1063.UL \&open
1064converts a file system path name into an i-node
1065table entry.
1066A pointer to the i-node table entry is placed in a
1067newly created open file table entry.
1068A pointer to the file table entry is placed in the
1069system data segment for the process.
1070.UL \&create
1071first creates a new i-node entry,
1072writes the i-number into a directory, and
1073then builds the same structure as for an
1074.UL open .
1075.UL \&read
1076and
1077.UL write
1078just access the i-node entry as described above.
1079.UL \&seek
1080simply manipulates the I/O pointer.
1081No physical seeking is done.
1082.UL \&close
1083just frees the structures built by
1084.UL open
1085and
1086.UL create .
1087Reference counts are kept on the open file table entries and
1088the i-node table entries to free these structures after
1089the last reference goes away.
1090.UL \&unlink
1091simply decrements the count of the
1092number of directories pointing at the given i-node.
1093When the last reference to an i-node table entry
1094goes away,
1095if the i-node has no directories pointing to it,
1096then the file is removed and the i-node is freed.
1097This delayed removal of files prevents
1098problems arising from removing active files.
1099A file may be removed while still open.
1100The resulting unnamed file vanishes
1101when the file is closed.
1102This is a method of obtaining temporary files.
1103.PP
1104There is a type of unnamed
1105.UC FIFO
1106file called a
1107.UL pipe.
1108Implementation of
1109.UL pipe s
1110consists of implied
1111.UL seek s
1112before each
1113.UL read
1114or
1115.UL write
1116in order to implement
1117first-in-first-out.
1118There are also checks and synchronization
1119to prevent the
1120writer from grossly outproducing the
1121reader and to prevent the reader from
1122overtaking the writer.
1123.NH 2
1124Mounted file systems
1125.PP
1126The file system of a
1127.UX
1128system
1129starts with some designated block device
1130formatted as described above to contain
1131a hierarchy.
1132The root of this structure is the root of
1133the
1134.UX
1135file system.
1136A second formatted block device may be
1137mounted
1138at any leaf of
1139the current hierarchy.
1140This logically extends the current hierarchy.
1141The implementation of
1142mounting
1143is trivial.
1144A mount table is maintained containing
1145pairs of designated leaf i-nodes and
1146block devices.
1147When converting a path name into an i-node,
1148a check is made to see if the new i-node is a
1149designated leaf.
1150If it is,
1151the i-node of the root
1152of the block device replaces it.
1153.PP
1154Allocation of space for a file is taken
1155from the free pool on the device on which the
1156file lives.
1157Thus a file system consisting of many
1158mounted devices does not have a common pool of
1159free secondary storage space.
1160This separation of space on different
1161devices is necessary to allow easy
1162unmounting
1163of a device.
1164.NH 2
1165Other system functions
1166.PP
1167There are some other things that the system
1168does for the user\-a
1169little accounting,
1170a little tracing/debugging,
1171and a little access protection.
1172Most of these things are not very
1173well developed
1174because our use of the system in computing science research
1175does not need them.
1176There are some features that are missed in some
1177applications, for example, better inter-process communication.
1178.PP
1179The
1180.UX
1181kernel is an I/O multiplexer more than
1182a complete operating system.
1183This is as it should be.
1184Because of this outlook,
1185many features are
1186found in most
1187other operating systems that are missing from the
1188.UX
1189kernel.
1190For example,
1191the
1192.UX
1193kernel does not support
1194file access methods,
1195file disposition,
1196file formats,
1197file maximum size,
1198spooling,
1199command language,
1200logical records,
1201physical records,
1202assignment of logical file names,
1203logical file names,
1204more than one character set,
1205an operator's console,
1206an operator,
1207log-in,
1208or log-out.
1209Many of these things are symptoms rather than features.
1210Many of these things are implemented
1211in user software
1212using the kernel as a tool.
1213A good example of this is the command language.
1214.[
1215bourne shell 1978 bstj
1216%Q This issue
1217.]
1218Each user may have his own command language.
1219Maintenance of such code is as easy as
1220maintaining user code.
1221The idea of implementing ``system'' code with general
1222user primitives
1223comes directly from
1224.UC MULTICS .
1225.[
1226organick multics 1972
1227.]
1228.LP
1229.[
1230$LIST$
1231.]