Bell 32V development

author Tom London <tbl@research.uucp>

Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)

committer Tom London <tbl@research.uucp>

Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)
author Tom London <tbl@research.uucp>
Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)
committer Tom London <tbl@research.uucp>
Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)
diff --git a/usr/doc/implement b/usr/doc/implement

new file mode 100644 (file)

index 0000000..e7f128e
--- /dev/null
+++ b/usr/doc/implement
@@ -0,0 +1,1231 @@
+.de P1
+.DS
+..
+.de P2
+.DE
+..
+.de UL
+.lg 0
+.if n .ul
+\%\&\\$3\f3\\$1\fR\&\\$2
+.lg
+..
+.de UC
+\&\\$3\s-1\\$1\\s0\&\\$2
+..
+.de IT
+.lg 0
+.if n .ul
+\%\&\\$3\f2\\$1\fR\&\\$2
+.lg
+..
+.de SP
+.sp \\$1
+..
+.hw device
+.TL
+UNIX Implementation
+.AU "MH 2C-523" 2394
+K. Thompson
+.AI
+.MH
+.AB
+This paper describes in high-level terms the
+implementation of the resident
+.UX
+kernel.
+This discussion is broken into three parts.
+The first part describes
+how the
+.UX
+system views processes, users, and programs.
+The second part describes the I/O system.
+The last part describes the
+.UX
+file system.
+.AE
+.NH
+INTRODUCTION
+.PP
+The
+.UX
+kernel consists of about 10,000
+lines of C code and about 1,000 lines of assembly code.
+The assembly code can be further broken down into
+200 lines included for
+the sake of efficiency
+(they could have been written in C)
+and 800 lines to perform hardware
+functions not possible in C.
+.PP
+This code represents 5 to 10 percent of what has
+been lumped into the broad expression
+``the
+.UX
+operating system.''
+The kernel is the only
+.UX
+code that
+cannot be substituted by a user to his
+own liking.
+For this reason,
+the kernel should make as few real
+decisions as possible.
+This does not mean to allow the user
+a million options to do the same thing.
+Rather, it means to allow only one way to
+do one thing,
+but have that way be the least-common divisor
+of all the options that might have been provided.
+.PP
+What is or is not implemented in the kernel
+represents both a great responsibility and a great power.
+It is a soap-box platform on
+``the way things should be done.''
+Even so, if
+``the way'' is too radical,
+no one will follow it.
+Every important decision was weighed
+carefully.
+Throughout,
+simplicity has been substituted for efficiency.
+Complex algorithms are used only if
+their complexity can be localized.
+.NH
+PROCESS CONTROL
+.PP
+In the
+.UX
+system,
+a user executes programs in an
+environment called a user process.
+When a system function is required,
+the user process calls the system
+as a subroutine.
+At some point in this call,
+there is a distinct switch of environments.
+After this,
+the process is said to be a system process.
+In the normal definition of processes,
+the user and system processes are different
+phases of the same process
+(they never execute simultaneously).
+For protection,
+each system process has its own stack.
+.PP
+The user process may execute
+from a read-only text segment,
+which is shared by all processes
+executing the same code.
+There is no
+.IT functional
+benefit
+from shared-text segments.
+An
+.IT efficiency
+benefit comes from the fact
+that there is no need to swap read-only
+segments out because the original
+copy on secondary memory is still current.
+This is a great benefit to interactive
+programs that tend to be swapped while
+waiting for terminal input.
+Furthermore,
+if two processes are
+executing
+simultaneously
+from the same copy of a read-only segment,
+only one copy needs to reside in
+primary memory.
+This is a secondary effect,
+because
+simultaneous execution of a program
+is not common.
+It is ironic that this effect,
+which reduces the use of primary memory,
+only comes into play when there is
+an overabundance of primary memory,
+that is,
+when there is enough memory
+to keep waiting processes loaded.
+.PP
+All current read-only text segments in the
+system are maintained from the
+.IT "text table" .
+A text table entry holds the location of the
+text segment on secondary memory.
+If the segment is loaded,
+that table also holds the primary memory location
+and the count of the number of processes
+sharing this entry.
+When this count is reduced to zero,
+the entry is freed along with any
+primary and secondary memory holding the segment.
+When a process first executes a shared-text segment,
+a text table entry is allocated and the
+segment is loaded onto secondary memory.
+If a second process executes a text segment
+that is already allocated,
+the entry reference count is simply incremented.
+.PP
+A user process has some strictly private
+read-write data
+contained in its
+data segment.
+As far as possible,
+the system does not use the user's
+data segment to hold system data.
+In particular,
+there are no I/O buffers in the
+user address space.
+.PP
+The user data segment has two growing boundaries.
+One, increased automatically by the system
+as a result of memory faults,
+is used for a stack.
+The second boundary is only grown (or shrunk) by
+explicit requests.
+The contents of newly allocated primary memory
+is initialized to zero.
+.PP
+Also associated and swapped with
+a process is a small fixed-size
+system data segment.
+This segment contains all
+the data about the process
+that the system needs only when the
+process is active.
+Examples of the kind of data contained
+in the system data segment are:
+saved central processor registers,
+open file descriptors,
+accounting information,
+scratch data area,
+and the stack for the system phase
+of the process.
+The system data segment is not
+addressable from the user process
+and is therefore protected.
+.PP
+Last,
+there is a process table with
+one entry per process.
+This entry contains all the data
+needed by the system when the process
+is
+.IT not
+active.
+Examples are
+the process's name,
+the location of the other segments,
+and scheduling information.
+The process table entry is allocated
+when the process is created, and freed
+when the process terminates.
+This process entry is always directly
+addressable by the kernel.
+.PP
+Figure 1 shows the relationships
+between the various process control
+data.
+In a sense,
+the process table is the
+definition of all processes,
+because
+all the data associated with a process
+may be accessed
+starting from the process table entry.
+.KS
+.sp 2.44i
+.sp 2v
+.ce
+Fig. 1\(emProcess control data structure.
+.KE
+.NH 2
+Process creation and program execution
+.PP
+Processes are created by the system primitive
+.UL fork .
+The newly created process (child) is a copy of the original process (parent).
+There is no detectable sharing of primary memory between the two processes.
+(Of course,
+if the parent process was executing from a read-only
+text segment,
+the child will share the text segment.)
+Copies of all writable data segments
+are made for the child process.
+Files that were open before the
+.UL fork
+are
+truly shared after the
+.UL fork .
+The processes are informed as to their part in the
+relationship to
+allow them to select their own
+(usually non-identical)
+destiny.
+The parent may
+.UL wait
+for the termination of
+any of its children.
+.PP
+A process may
+.UL exec
+a file.
+This consists of exchanging the current text and data
+segments of the process for new text and data
+segments specified in the file.
+The old segments are lost.
+Doing an
+.UL exec
+does
+.IT not
+change processes;
+the process that did the
+.UL exec
+persists,
+but
+after the
+.UL exec
+it is executing a different program.
+Files that were open
+before the
+.UL exec
+remain open after the
+.UL exec .
+.PP
+If a program,
+say the first pass of a compiler,
+wishes to overlay itself with another program,
+say the second pass,
+then it simply
+.UL exec s
+the second program.
+This is analogous
+to a ``goto.''
+If a program wishes to regain control
+after
+.UL exec ing
+a second program,
+it should
+.UL fork
+a child process,
+have the child
+.UL exec
+the second program, and
+have the parent
+.UL wait
+for the child.
+This is analogous to a ``call.''
+Breaking up the call into a binding followed by
+a transfer is similar to the subroutine linkage in
+SL-5.
+.[
+griswold hanson sl5 overview
+.]
+.NH 2
+Swapping
+.PP
+The major data associated with a process
+(the user data segment,
+the system data segment, and
+the text segment)
+are swapped to and from secondary
+memory, as needed.
+The user data segment and the system data segment
+are kept in contiguous primary memory to reduce
+swapping latency.
+(When low-latency devices, such as bubbles,
+.UC CCD s,
+or scatter/gather devices,
+are used,
+this decision will have to be reconsidered.)
+Allocation of both primary
+and secondary memory is performed
+by the same simple first-fit algorithm.
+When a process grows,
+a new piece of primary memory is allocated.
+The contents of the old memory is copied to the new memory.
+The old memory is freed
+and the tables are updated.
+If there is not enough primary memory,
+secondary memory is allocated instead.
+The process is swapped out onto the
+secondary memory,
+ready to be swapped in with
+its new size.
+.PP
+One separate process in the kernel,
+the swapping process,
+simply swaps the other
+processes in and out of primary memory.
+It examines the
+process table looking for a process
+that is swapped out and is
+ready to run.
+It allocates primary memory for that
+process and
+reads its segments into
+primary memory, where that process competes for the
+central processor with other loaded processes.
+If no primary memory is available,
+the swapping process makes memory available
+by examining the process table for processes
+that can be swapped out.
+It selects a process to swap out,
+writes it to secondary memory,
+frees the primary memory,
+and then goes back to look for a process
+to swap in.
+.PP
+Thus there are two specific algorithms
+to the swapping process.
+Which of the possibly many processes that
+are swapped out is to be swapped in?
+This is decided by secondary storage residence
+time.
+The one with the longest time out is swapped in first.
+There is a slight penalty for larger processes.
+Which of the possibly many processes that
+are loaded is to be swapped out?
+Processes that are waiting for slow events
+(i.e., not currently running or waiting for
+disk I/O)
+are picked first,
+by age in primary memory,
+again with size penalties.
+The other processes are examined
+by the same age algorithm,
+but are not taken out unless they are
+at least of some age.
+This adds
+hysteresis to the swapping and
+prevents total thrashing.
+.PP
+These swapping algorithms are the
+most suspect in the system.
+With limited primary memory,
+these algorithms cause total swapping.
+This is not bad in itself, because
+the swapping does not impact the
+execution of the resident processes.
+However, if the swapping device must
+also be used for file storage,
+the swapping traffic severely
+impacts the file system traffic.
+It is exactly these small systems
+that tend to double usage of limited disk
+resources.
+.NH 2
+Synchronization and scheduling
+.PP
+Process synchronization is accomplished by having processes
+wait for events.
+Events are represented by arbitrary integers.
+By convention,
+events are chosen to be addresses of
+tables associated with those events.
+For example, a process that is waiting for
+any of its children to terminate will wait
+for an event that is the address of
+its own process table entry.
+When a process terminates,
+it signals the event represented by
+its parent's process table entry.
+Signaling an event on which no process
+is waiting has no effect.
+Similarly,
+signaling an event on which many processes
+are waiting will wake all of them up.
+This differs considerably from
+Dijkstra's P and V
+synchronization operations,
+.[
+dijkstra sequential processes 1968
+.]
+in that
+no memory is associated with events.
+Thus there need be no allocation of events
+prior to their use.
+Events exist simply by being used.
+.PP
+On the negative side,
+because there is no memory associated with events,
+no notion of ``how much''
+can be signaled via the event mechanism.
+For example,
+processes that want memory might
+wait on an event associated with
+memory allocation.
+When any amount of memory becomes available,
+the event would be signaled.
+All the competing processes would then wake
+up to fight over the new memory.
+(In reality,
+the swapping process is the only process
+that waits for primary memory to become available.)
+.PP
+If an event occurs
+between the time a process decides
+to wait for that event and the
+time that process enters the wait state,
+then
+the process will wait on an event that has
+already happened (and may never happen again).
+This race condition happens because there is no memory associated with
+the event to indicate that the event has occurred;
+the only action of an event is to change a set of processes
+from wait state to run state.
+This problem is relieved largely
+by the fact that process switching can
+only occur in the kernel by explicit calls
+to the event-wait mechanism.
+If the event in question is signaled by another
+process,
+then there is no problem.
+But if the event is signaled by a hardware
+interrupt,
+then special care must be taken.
+These synchronization races pose the biggest
+problem when
+.UX
+is adapted to multiple-processor configurations.
+.[
+hawley meyer multiprocessing unix
+.]
+.PP
+The event-wait code in the kernel
+is like a co-routine linkage.
+At any time,
+all but one of the processes has called event-wait.
+The remaining process is the one currently executing.
+When it calls event-wait,
+a process whose event has been signaled
+is selected and that process
+returns from its call to event-wait.
+.PP
+Which of the runable processes is to run next?
+Associated with each process is a priority.
+The priority of a system process is assigned by the code
+issuing the wait on an event.
+This is roughly equivalent to the response
+that one would expect on such an event.
+Disk events have high priority,
+teletype events are low,
+and time-of-day events are very low.
+(From observation,
+the difference in system process priorities
+has little or no performance impact.)
+All user-process priorities are lower than the
+lowest system priority.
+User-process priorities are assigned
+by an algorithm based on the
+recent ratio of the amount of compute time to real time consumed
+by the process.
+A process that has used a lot of
+compute time in the last real-time
+unit is assigned a low user priority.
+Because interactive processes are characterized
+by low ratios of compute to real time,
+interactive response is maintained without any
+special arrangements.
+.PP
+The scheduling algorithm simply picks
+the process with the highest priority,
+thus
+picking all system processes first and
+user processes second.
+The compute-to-real-time ratio is updated
+every second.
+Thus,
+all other things being equal,
+looping user processes will be
+scheduled round-robin with a
+1-second quantum.
+A high-priority process waking up will
+preempt a running, low-priority process.
+The scheduling algorithm has a very desirable
+negative feedback character.
+If a process uses its high priority
+to hog the computer,
+its priority will drop.
+At the same time, if a low-priority
+process is ignored for a long time,
+its priority will rise.
+.NH
+I/O SYSTEM
+.PP
+The I/O system
+is broken into two completely separate systems:
+the block I/O system and the character I/O system.
+In retrospect,
+the names should have been ``structured I/O''
+and ``unstructured I/O,'' respectively;
+while the term ``block I/O'' has some meaning,
+``character I/O'' is a complete misnomer.
+.PP
+Devices are characterized by a major device number,
+a minor device number, and
+a class (block or character).
+For each class,
+there is an array of entry points into the device drivers.
+The major device number is used to index the array
+when calling the code for a particular device driver.
+The minor device number is passed to the
+device driver as an argument.
+The minor number has no significance other
+than that attributed to it by the driver.
+Usually,
+the driver uses the minor number to access
+one of several identical physical devices.
+.PP
+The use of the array of entry points
+(configuration table)
+as the only connection between the
+system code and the device drivers is
+very important.
+Early versions of the system had a much
+less formal connection with the drivers,
+so that it was extremely hard to handcraft
+differently configured systems.
+Now it is possible to create new
+device drivers in an average of a few hours.
+The configuration table in most cases
+is created automatically by a program
+that reads the system's parts list.
+.NH 2
+Block I/O system
+.PP
+The model block I/O device consists
+of randomly addressed, secondary
+memory blocks of 512 bytes each.
+The blocks are uniformly addressed
+0, 1, .\|.\|. up to the size of the device.
+The block device driver has the job of
+emulating this model on a
+physical device.
+.PP
+The block I/O devices are accessed
+through a layer of buffering software.
+The system maintains a list of buffers
+(typically between 10 and 70)
+each assigned a device name and
+a device address.
+This buffer pool constitutes a data cache
+for the block devices.
+On a read request,
+the cache is searched for the desired block.
+If the block is found,
+the data are made available to the
+requester without any physical I/O.
+If the block is not in the cache,
+the least recently used block in the cache is renamed,
+the correct device driver is called to
+fill up the renamed buffer, and then the
+data are made available.
+Write requests are handled in an analogous manner.
+The correct buffer is found
+and relabeled if necessary.
+The write is performed simply by marking
+the buffer as ``dirty.''
+The physical I/O is then deferred until
+the buffer is renamed.
+.PP
+The benefits in reduction of physical I/O
+of this scheme are substantial,
+especially considering the file system implementation.
+There are,
+however,
+some drawbacks.
+The asynchronous nature of the
+algorithm makes error reporting
+and meaningful user error handling
+almost impossible.
+The cavalier approach to I/O error
+handling in the
+.UX
+system is partly due to the asynchronous
+nature of the block I/O system.
+A second problem is in the delayed writes.
+If the system stops unexpectedly,
+it is almost certain that there is a
+lot of logically complete,
+but physically incomplete,
+I/O in the buffers.
+There is a system primitive to
+flush all outstanding I/O activity
+from the buffers.
+Periodic use of this primitive helps,
+but does not solve, the problem.
+Finally,
+the associativity in the buffers
+can alter the physical I/O sequence
+from that of the logical I/O sequence.
+This means that there are times
+when data structures on disk are inconsistent,
+even though the software is careful
+to perform I/O in the correct order.
+On non-random devices,
+notably magnetic tape,
+the inversions of writes can be disastrous.
+The problem with magnetic tapes is ``cured'' by
+allowing only one outstanding write request
+per drive.
+.NH 2
+Character I/O system
+.PP
+The character I/O system consists of all
+devices that do not fall into the block I/O model.
+This includes the ``classical'' character devices
+such as communications lines, paper tape, and
+line printers.
+It also includes magnetic tape and disks when
+they are not used in a stereotyped way,
+for example, 80-byte physical records on tape
+and track-at-a-time disk copies.
+In short,
+the character I/O interface
+means ``everything other than block.''
+I/O requests from the user are sent to the
+device driver essentially unaltered.
+The implementation of these requests is, of course,
+up to the device driver.
+There are guidelines and conventions
+to help the implementation of
+certain types of device drivers.
+.NH 3
+Disk drivers
+.PP
+Disk drivers are implemented
+with a queue of transaction records.
+Each record holds a read/write flag,
+a primary memory address,
+a secondary memory address, and
+a transfer byte count.
+Swapping is accomplished by passing
+such a record to the swapping device driver.
+The block I/O interface is implemented by
+passing such records with requests to
+fill and empty system buffers.
+The character I/O interface to the disk
+drivers create a transaction record that
+points directly into the user area.
+The routine that creates this record also insures
+that the user is not swapped during this
+I/O transaction.
+Thus by implementing the general disk driver,
+it is possible to use the disk
+as a block device,
+a character device, and a swap device.
+The only really disk-specific code in normal
+disk drivers is the pre-sort of transactions to
+minimize latency for a particular device, and
+the actual issuing of the I/O request.
+.NH 3
+Character lists
+.PP
+Real character-oriented devices may
+be implemented using the common
+code to handle character lists.
+A character list is a queue of characters.
+One routine puts a character on a queue.
+Another gets a character from a queue.
+It is also possible to ask how many
+characters are currently on a queue.
+Storage for all queues in the system comes
+from a single common pool.
+Putting a character on a queue will allocate
+space from the common pool and link the
+character onto the data structure defining the queue.
+Getting a character from a queue returns
+the corresponding space to the pool.
+.PP
+A typical character-output device
+(paper tape punch, for example)
+is implemented by passing characters
+from the user onto a character queue until
+some maximum number of characters is on the queue.
+The I/O is prodded to start as
+soon as there is anything on the queue
+and, once started,
+it is sustained by hardware completion interrupts.
+Each time there is a completion interrupt,
+the driver gets the next character from the queue
+and sends it to the hardware.
+The number of characters on the queue is checked and,
+as the count falls through some intermediate level,
+an event (the queue address) is signaled.
+The process that is passing characters from
+the user to the queue can be waiting on the event, and
+refill the queue to its maximum
+when the event occurs.
+.PP
+A typical character input device
+(for example, a paper tape reader)
+is handled in a very similar manner.
+.PP
+Another class of character devices is the terminals.
+A terminal is represented by three
+character queues.
+There are two input queues (raw and canonical)
+and an output queue.
+Characters going to the output of a terminal
+are handled by common code exactly as described
+above.
+The main difference is that there is also code
+to interpret the output stream as
+.UC  ASCII
+characters and to perform some translations,
+e.g., escapes for deficient terminals.
+Another common aspect of terminals is code
+to insert real-time delay after certain control characters.
+.PP
+Input on terminals is a little different.
+Characters are collected from the terminal and
+placed on a raw input queue.
+Some device-dependent code conversion and
+escape interpretation is handled here.
+When a line is complete in the raw queue,
+an event is signaled.
+The code catching this signal then copies a
+line from the raw queue to a canonical queue
+performing the character erase and line kill editing.
+User read requests on terminals can be
+directed at either the raw or canonical queues.
+.NH 3
+Other character devices
+.PP
+Finally,
+there are devices that fit no general category.
+These devices are set up as character I/O drivers.
+An example is a driver that reads and writes
+unmapped primary memory as an I/O device.
+Some devices are too
+fast to be treated a character at time,
+but do not fit the disk I/O mold.
+Examples are fast communications lines and
+fast line printers.
+These devices either have their own buffers
+or ``borrow'' block I/O buffers for a while and
+then give them back.
+.NH
+THE FILE SYSTEM
+.PP
+In the
+.UX
+system,
+a file is a (one-dimensional) array of bytes.
+No other structure of files is implied by the
+system.
+Files are attached anywhere
+(and possibly multiply)
+onto a hierarchy of directories.
+Directories are simply files that
+users cannot write.
+For a further discussion
+of the external view of files and directories,
+see Ref.\0
+.[
+ritchie thompson unix bstj 1978
+%Q This issue
+.].
+.PP
+The
+.UX
+file system is a disk data structure
+accessed completely through
+the block I/O system.
+As stated before,
+the canonical view of a ``disk'' is
+a randomly addressable array of
+512-byte blocks.
+A file system breaks the disk into
+four self-identifying regions.
+The first block (address 0)
+is unused by the file system.
+It is left aside for booting procedures.
+The second block (address 1)
+contains the so-called ``super-block.''
+This block,
+among other things,
+contains the size of the disk and
+the boundaries of the other regions.
+Next comes the i-list,
+a list of file definitions.
+Each file definition is
+a 64-byte structure, called an i-node.
+The offset of a particular i-node
+within the i-list is called its i-number.
+The combination of device name
+(major and minor numbers) and i-number
+serves to uniquely name a particular file.
+After the i-list,
+and to the end of the disk,
+come free storage blocks that
+are available for the contents of files.
+.PP
+The free space on a disk is maintained
+by a linked list of available disk blocks.
+Every block in this chain contains a disk address
+of the next block in the chain.
+The remaining space contains the address of up to
+50 disk blocks that are also free.
+Thus with one I/O operation,
+the system obtains 50 free blocks and a
+pointer where to find more.
+The disk allocation algorithms are
+very straightforward.
+Since all allocation is in fixed-size
+blocks and there is strict accounting of
+space,
+there is no need to compact or garbage collect.
+However,
+as disk space becomes dispersed,
+latency gradually increases.
+Some installations choose to occasionally compact
+disk space to reduce latency.
+.PP
+An i-node contains 13 disk addresses.
+The first 10 of these addresses point directly at
+the first 10 blocks of a file.
+If a file is larger than 10 blocks (5,120 bytes),
+then the eleventh address points at a block
+that contains the addresses of the next 128 blocks of the file.
+If the file is still larger than this
+(70,656 bytes),
+then the twelfth block points at up to 128 blocks,
+each pointing to 128 blocks of the file.
+Files yet larger
+(8,459,264 bytes)
+use the thirteenth address for a ``triple indirect'' address.
+The algorithm ends here with the maximum file size
+of 1,082,201,087 bytes.
+.PP
+A logical directory hierarchy is added
+to this flat physical structure simply
+by adding a new type of file, the directory.
+A directory is accessed exactly as an ordinary file.
+It contains 16-byte entries consisting of
+a 14-byte name and an i-number.
+The root of the hierarchy is at a known i-number
+(\fIviz.,\fR 2).
+The file system structure allows an arbitrary, directed graph
+of directories with regular files linked in
+at arbitrary places in this graph.
+In fact,
+very early
+.UX
+systems used such a structure.
+Administration of such a structure became so
+chaotic that later systems were restricted
+to a directory tree.
+Even now,
+with regular files linked multiply
+into arbitrary places in the tree,
+accounting for space has become a problem.
+It may become necessary to restrict the entire
+structure to a tree,
+and allow a new form of linking that
+is subservient to the tree structure.
+.PP
+The file system allows
+easy creation,
+easy removal,
+easy random accessing,
+and very easy space allocation.
+With most physical addresses confined
+to a small contiguous section of disk,
+it is also easy to dump, restore, and
+check the consistency of the file system.
+Large files suffer from indirect addressing,
+but the cache prevents most of the implied physical I/O
+without adding much execution.
+The space overhead properties of this scheme are quite good.
+For example,
+on one particular file system,
+there are 25,000 files containing 130M bytes of data-file content.
+The overhead (i-node, indirect blocks, and last block breakage)
+is about 11.5M bytes.
+The directory structure to support these files
+has about 1,500 directories containing 0.6M bytes of directory content
+and about 0.5M bytes of overhead in accessing the directories.
+Added up any way,
+this comes out to less than a 10 percent overhead for actual
+stored data.
+Most systems have this much overhead in
+padded trailing blanks alone.
+.NH 2
+File system implementation
+.PP
+Because the i-node defines a file,
+the implementation of the file system centers
+around access to the i-node.
+The system maintains a table of all active
+i-nodes.
+As a new file is accessed,
+the system locates the corresponding i-node,
+allocates an i-node table entry, and reads
+the i-node into primary memory.
+As in the buffer cache,
+the table entry is considered to be the current
+version of the i-node.
+Modifications to the i-node are made to
+the table entry.
+When the last access to the i-node goes
+away,
+the table entry is copied back to the
+secondary store i-list and the table entry is freed.
+.PP
+All I/O operations on files are carried out
+with the aid of the corresponding i-node table entry.
+The accessing of a file is a straightforward
+implementation of the algorithms mentioned previously.
+The user is not aware of i-nodes and i-numbers.
+References to the file system are made in terms of
+path names of the directory tree.
+Converting a path name into an i-node table entry
+is also straightforward.
+Starting at some known i-node
+(the root or the current directory of some process),
+the next component of the path name is
+searched by reading the directory.
+This gives an i-number and an implied device
+(that of the directory).
+Thus the next i-node table entry can be accessed.
+If that was the last component of the path name,
+then this i-node is the result.
+If not,
+this i-node is the directory needed to look up
+the next component of the path name, and the
+algorithm is repeated.
+.PP
+The user process accesses the file system with
+certain primitives.
+The most common of these are
+.UL open ,
+.UL create ,
+.UL read ,
+.UL write ,
+.UL seek ,
+and
+.UL close .
+The data structures maintained are shown in Fig. 2.
+.KS
+.sp 22P
+.ce
+Fig. 2\(emFile system data structure.
+.KE
+In the system data segment associated with a user,
+there is room for some (usually between 10 and 50) open files.
+This open file table consists of pointers that can be used to access
+corresponding i-node table entries.
+Associated with each of these open files is
+a current I/O pointer.
+This is a byte offset of
+the next read/write operation on the file.
+The system treats each read/write request
+as random with an implied seek to the
+I/O pointer.
+The user usually thinks of the file as
+sequential with the I/O pointer
+automatically counting the number of bytes
+that have been read/written from the file.
+The user may,
+of course,
+perform random I/O by setting the I/O pointer
+before reads/writes.
+.PP
+With file sharing,
+it is necessary to allow related
+processes to share a common I/O pointer
+and yet have separate I/O pointers
+for independent processes
+that access the same file.
+With these two conditions,
+the I/O pointer cannot reside
+in the i-node table nor can
+it reside in the list of
+open files for the process.
+A new table
+(the open file table)
+was invented for the sole purpose
+of holding the I/O pointer.
+Processes that share the same open
+file
+(the result of
+.UL fork s)
+share a common open file table entry.
+A separate open of the same file will
+only share the i-node table entry,
+but will have distinct open file table entries.
+.PP
+The main file system primitives are implemented as follows.
+.UL \&open
+converts a file system path name into an i-node
+table entry.
+A pointer to the i-node table entry is placed in a
+newly created open file table entry.
+A pointer to the file table entry is placed in the
+system data segment for the process.
+.UL \&create
+first creates a new i-node entry,
+writes the i-number into a directory, and
+then builds the same structure as for an
+.UL open .
+.UL \&read
+and
+.UL write
+just access the i-node entry as described above.
+.UL \&seek
+simply manipulates the I/O pointer.
+No physical seeking is done.
+.UL \&close
+just frees the structures built by
+.UL open
+and
+.UL create .
+Reference counts are kept on the open file table entries and
+the i-node table entries to free these structures after
+the last reference goes away.
+.UL \&unlink
+simply decrements the count of the
+number of directories pointing at the given i-node.
+When the last reference to an i-node table entry
+goes away,
+if the i-node has no directories pointing to it,
+then the file is removed and the i-node is freed.
+This delayed removal of files prevents
+problems arising from removing active files.
+A file may be removed while still open.
+The resulting unnamed file vanishes
+when the file is closed.
+This is a method of obtaining temporary files.
+.PP
+There is a type of unnamed
+.UC  FIFO
+file called a
+.UL pipe.
+Implementation of
+.UL pipe s
+consists of implied
+.UL seek s
+before each
+.UL read
+or
+.UL write
+in order to implement
+first-in-first-out.
+There are also checks and synchronization
+to prevent the
+writer from grossly outproducing the
+reader and to prevent the reader from
+overtaking the writer.
+.NH 2
+Mounted file systems
+.PP
+The file system of a
+.UX
+system
+starts with some designated block device
+formatted as described above to contain
+a hierarchy.
+The root of this structure is the root of
+the
+.UX
+file system.
+A second formatted block device may be
+mounted
+at any leaf of
+the current hierarchy.
+This logically extends the current hierarchy.
+The implementation of
+mounting
+is trivial.
+A mount table is maintained containing
+pairs of designated leaf i-nodes and
+block devices.
+When converting a path name into an i-node,
+a check is made to see if the new i-node is a
+designated leaf.
+If it is,
+the i-node of the root
+of the block device replaces it.
+.PP
+Allocation of space for a file is taken
+from the free pool on the device on which the
+file lives.
+Thus a file system consisting of many
+mounted devices does not have a common pool of
+free secondary storage space.
+This separation of space on different
+devices is necessary to allow easy
+unmounting
+of a device.
+.NH 2
+Other system functions
+.PP
+There are some other things that the system
+does for the user\-a
+little accounting,
+a little tracing/debugging,
+and a little access protection.
+Most of these things are not very
+well developed
+because our use of the system in computing science research
+does not need them.
+There are some features that are missed in some
+applications, for example, better inter-process communication.
+.PP
+The
+.UX
+kernel is an I/O multiplexer more than
+a complete operating system.
+This is as it should be.
+Because of this outlook,
+many features are
+found in most
+other operating systems that are missing from the
+.UX
+kernel.
+For example,
+the
+.UX
+kernel does not support
+file access methods,
+file disposition,
+file formats,
+file maximum size,
+spooling,
+command language,
+logical records,
+physical records,
+assignment of logical file names,
+logical file names,
+more than one character set,
+an operator's console,
+an operator,
+log-in,
+or log-out.
+Many of these things are symptoms rather than features.
+Many of these things are implemented
+in user software
+using the kernel as a tool.
+A good example of this is the command language.
+.[
+bourne shell 1978 bstj
+%Q This issue
+.]
+Each user may have his own command language.
+Maintenance of such code is as easy as
+maintaining user code.
+The idea of implementing ``system'' code with general
+user primitives
+comes directly from
+.UC  MULTICS .
+.[
+organick multics 1972
+.]
+.LP
+.[
+$LIST$
+.]
author	Tom London <tbl@research.uucp>
	Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)
committer	Tom London <tbl@research.uucp>
	Wed, 29 Nov 1978 04:37:37 +0000 (23:37 -0500)