git.subgeniuskitty.com - unix-history/.git/blame_incremental

... / ...

Commit	Line	Data
	1	.de P1
	2	.DS
	3	..
	4	.de P2
	5	.DE
	6	..
	7	.de UL
	8	.lg 0
	9	.if n .ul
	10	\%\&\\$3\f3\\$1\fR\&\\$2
	11	.lg
	12	..
	13	.de UC
	14	\&\\$3\s-1\\$1\\s0\&\\$2
	15	..
	16	.de IT
	17	.lg 0
	18	.if n .ul
	19	\%\&\\$3\f2\\$1\fR\&\\$2
	20	.lg
	21	..
	22	.de SP
	23	.sp \\$1
	24	..
	25	.hw device
	26	.TL
	27	UNIX Implementation
	28	.AU "MH 2C-523" 2394
	29	K. Thompson
	30	.AI
	31	.MH
	32	.AB
	33	This paper describes in high-level terms the
	34	implementation of the resident
	35	.UX
	36	kernel.
	37	This discussion is broken into three parts.
	38	The first part describes
	39	how the
	40	.UX
	41	system views processes, users, and programs.
	42	The second part describes the I/O system.
	43	The last part describes the
	44	.UX
	45	file system.
	46	.AE
	47	.NH
	48	INTRODUCTION
	49	.PP
	50	The
	51	.UX
	52	kernel consists of about 10,000
	53	lines of C code and about 1,000 lines of assembly code.
	54	The assembly code can be further broken down into
	55	200 lines included for
	56	the sake of efficiency
	57	(they could have been written in C)
	58	and 800 lines to perform hardware
	59	functions not possible in C.
	60	.PP
	61	This code represents 5 to 10 percent of what has
	62	been lumped into the broad expression
	63	``the
	64	.UX
	65	operating system.''
	66	The kernel is the only
	67	.UX
	68	code that
	69	cannot be substituted by a user to his
	70	own liking.
	71	For this reason,
	72	the kernel should make as few real
	73	decisions as possible.
	74	This does not mean to allow the user
	75	a million options to do the same thing.
	76	Rather, it means to allow only one way to
	77	do one thing,
	78	but have that way be the least-common divisor
	79	of all the options that might have been provided.
	80	.PP
	81	What is or is not implemented in the kernel
	82	represents both a great responsibility and a great power.
	83	It is a soap-box platform on
	84	``the way things should be done.''
	85	Even so, if
	86	``the way'' is too radical,
	87	no one will follow it.
	88	Every important decision was weighed
	89	carefully.
	90	Throughout,
	91	simplicity has been substituted for efficiency.
	92	Complex algorithms are used only if
	93	their complexity can be localized.
	94	.NH
	95	PROCESS CONTROL
	96	.PP
	97	In the
	98	.UX
	99	system,
	100	a user executes programs in an
	101	environment called a user process.
	102	When a system function is required,
	103	the user process calls the system
	104	as a subroutine.
	105	At some point in this call,
	106	there is a distinct switch of environments.
	107	After this,
	108	the process is said to be a system process.
	109	In the normal definition of processes,
	110	the user and system processes are different
	111	phases of the same process
	112	(they never execute simultaneously).
	113	For protection,
	114	each system process has its own stack.
	115	.PP
	116	The user process may execute
	117	from a read-only text segment,
	118	which is shared by all processes
	119	executing the same code.
	120	There is no
	121	.IT functional
	122	benefit
	123	from shared-text segments.
	124	An
	125	.IT efficiency
	126	benefit comes from the fact
	127	that there is no need to swap read-only
	128	segments out because the original
	129	copy on secondary memory is still current.
	130	This is a great benefit to interactive
	131	programs that tend to be swapped while
	132	waiting for terminal input.
	133	Furthermore,
	134	if two processes are
	135	executing
	136	simultaneously
	137	from the same copy of a read-only segment,
	138	only one copy needs to reside in
	139	primary memory.
	140	This is a secondary effect,
	141	because
	142	simultaneous execution of a program
	143	is not common.
	144	It is ironic that this effect,
	145	which reduces the use of primary memory,
	146	only comes into play when there is
	147	an overabundance of primary memory,
	148	that is,
	149	when there is enough memory
	150	to keep waiting processes loaded.
	151	.PP
	152	All current read-only text segments in the
	153	system are maintained from the
	154	.IT "text table" .
	155	A text table entry holds the location of the
	156	text segment on secondary memory.
	157	If the segment is loaded,
	158	that table also holds the primary memory location
	159	and the count of the number of processes
	160	sharing this entry.
	161	When this count is reduced to zero,
	162	the entry is freed along with any
	163	primary and secondary memory holding the segment.
	164	When a process first executes a shared-text segment,
	165	a text table entry is allocated and the
	166	segment is loaded onto secondary memory.
	167	If a second process executes a text segment
	168	that is already allocated,
	169	the entry reference count is simply incremented.
	170	.PP
	171	A user process has some strictly private
	172	read-write data
	173	contained in its
	174	data segment.
	175	As far as possible,
	176	the system does not use the user's
	177	data segment to hold system data.
	178	In particular,
	179	there are no I/O buffers in the
	180	user address space.
	181	.PP
	182	The user data segment has two growing boundaries.
	183	One, increased automatically by the system
	184	as a result of memory faults,
	185	is used for a stack.
	186	The second boundary is only grown (or shrunk) by
	187	explicit requests.
	188	The contents of newly allocated primary memory
	189	is initialized to zero.
	190	.PP
	191	Also associated and swapped with
	192	a process is a small fixed-size
	193	system data segment.
	194	This segment contains all
	195	the data about the process
	196	that the system needs only when the
	197	process is active.
	198	Examples of the kind of data contained
	199	in the system data segment are:
	200	saved central processor registers,
	201	open file descriptors,
	202	accounting information,
	203	scratch data area,
	204	and the stack for the system phase
	205	of the process.
	206	The system data segment is not
	207	addressable from the user process
	208	and is therefore protected.
	209	.PP
	210	Last,
	211	there is a process table with
	212	one entry per process.
	213	This entry contains all the data
	214	needed by the system when the process
	215	is
	216	.IT not
	217	active.
	218	Examples are
	219	the process's name,
	220	the location of the other segments,
	221	and scheduling information.
	222	The process table entry is allocated
	223	when the process is created, and freed
	224	when the process terminates.
	225	This process entry is always directly
	226	addressable by the kernel.
	227	.PP
	228	Figure 1 shows the relationships
	229	between the various process control
	230	data.
	231	In a sense,
	232	the process table is the
	233	definition of all processes,
	234	because
	235	all the data associated with a process
	236	may be accessed
	237	starting from the process table entry.
	238	.KS
	239	.sp 2.44i
	240	.sp 2v
	241	.ce
	242	Fig. 1\(emProcess control data structure.
	243	.KE
	244	.NH 2
	245	Process creation and program execution
	246	.PP
	247	Processes are created by the system primitive
	248	.UL fork .
	249	The newly created process (child) is a copy of the original process (parent).
	250	There is no detectable sharing of primary memory between the two processes.
	251	(Of course,
	252	if the parent process was executing from a read-only
	253	text segment,
	254	the child will share the text segment.)
	255	Copies of all writable data segments
	256	are made for the child process.
	257	Files that were open before the
	258	.UL fork
	259	are
	260	truly shared after the
	261	.UL fork .
	262	The processes are informed as to their part in the
	263	relationship to
	264	allow them to select their own
	265	(usually non-identical)
	266	destiny.
	267	The parent may
	268	.UL wait
	269	for the termination of
	270	any of its children.
	271	.PP
	272	A process may
	273	.UL exec
	274	a file.
	275	This consists of exchanging the current text and data
	276	segments of the process for new text and data
	277	segments specified in the file.
	278	The old segments are lost.
	279	Doing an
	280	.UL exec
	281	does
	282	.IT not
	283	change processes;
	284	the process that did the
	285	.UL exec
	286	persists,
	287	but
	288	after the
	289	.UL exec
	290	it is executing a different program.
	291	Files that were open
	292	before the
	293	.UL exec
	294	remain open after the
	295	.UL exec .
	296	.PP
	297	If a program,
	298	say the first pass of a compiler,
	299	wishes to overlay itself with another program,
	300	say the second pass,
	301	then it simply
	302	.UL exec s
	303	the second program.
	304	This is analogous
	305	to a ``goto.''
	306	If a program wishes to regain control
	307	after
	308	.UL exec ing
	309	a second program,
	310	it should
	311	.UL fork
	312	a child process,
	313	have the child
	314	.UL exec
	315	the second program, and
	316	have the parent
	317	.UL wait
	318	for the child.
	319	This is analogous to a ``call.''
	320	Breaking up the call into a binding followed by
	321	a transfer is similar to the subroutine linkage in
	322	SL-5.
	323	.[
	324	griswold hanson sl5 overview
	325	.]
	326	.NH 2
	327	Swapping
	328	.PP
	329	The major data associated with a process
	330	(the user data segment,
	331	the system data segment, and
	332	the text segment)
	333	are swapped to and from secondary
	334	memory, as needed.
	335	The user data segment and the system data segment
	336	are kept in contiguous primary memory to reduce
	337	swapping latency.
	338	(When low-latency devices, such as bubbles,
	339	.UC CCD s,
	340	or scatter/gather devices,
	341	are used,
	342	this decision will have to be reconsidered.)
	343	Allocation of both primary
	344	and secondary memory is performed
	345	by the same simple first-fit algorithm.
	346	When a process grows,
	347	a new piece of primary memory is allocated.
	348	The contents of the old memory is copied to the new memory.
	349	The old memory is freed
	350	and the tables are updated.
	351	If there is not enough primary memory,
	352	secondary memory is allocated instead.
	353	The process is swapped out onto the
	354	secondary memory,
	355	ready to be swapped in with
	356	its new size.
	357	.PP
	358	One separate process in the kernel,
	359	the swapping process,
	360	simply swaps the other
	361	processes in and out of primary memory.
	362	It examines the
	363	process table looking for a process
	364	that is swapped out and is
	365	ready to run.
	366	It allocates primary memory for that
	367	process and
	368	reads its segments into
	369	primary memory, where that process competes for the
	370	central processor with other loaded processes.
	371	If no primary memory is available,
	372	the swapping process makes memory available
	373	by examining the process table for processes
	374	that can be swapped out.
	375	It selects a process to swap out,
	376	writes it to secondary memory,
	377	frees the primary memory,
	378	and then goes back to look for a process
	379	to swap in.
	380	.PP
	381	Thus there are two specific algorithms
	382	to the swapping process.
	383	Which of the possibly many processes that
	384	are swapped out is to be swapped in?
	385	This is decided by secondary storage residence
	386	time.
	387	The one with the longest time out is swapped in first.
	388	There is a slight penalty for larger processes.
	389	Which of the possibly many processes that
	390	are loaded is to be swapped out?
	391	Processes that are waiting for slow events
	392	(i.e., not currently running or waiting for
	393	disk I/O)
	394	are picked first,
	395	by age in primary memory,
	396	again with size penalties.
	397	The other processes are examined
	398	by the same age algorithm,
	399	but are not taken out unless they are
	400	at least of some age.
	401	This adds
	402	hysteresis to the swapping and
	403	prevents total thrashing.
	404	.PP
	405	These swapping algorithms are the
	406	most suspect in the system.
	407	With limited primary memory,
	408	these algorithms cause total swapping.
	409	This is not bad in itself, because
	410	the swapping does not impact the
	411	execution of the resident processes.
	412	However, if the swapping device must
	413	also be used for file storage,
	414	the swapping traffic severely
	415	impacts the file system traffic.
	416	It is exactly these small systems
	417	that tend to double usage of limited disk
	418	resources.
	419	.NH 2
	420	Synchronization and scheduling
	421	.PP
	422	Process synchronization is accomplished by having processes
	423	wait for events.
	424	Events are represented by arbitrary integers.
	425	By convention,
	426	events are chosen to be addresses of
	427	tables associated with those events.
	428	For example, a process that is waiting for
	429	any of its children to terminate will wait
	430	for an event that is the address of
	431	its own process table entry.
	432	When a process terminates,
	433	it signals the event represented by
	434	its parent's process table entry.
	435	Signaling an event on which no process
	436	is waiting has no effect.
	437	Similarly,
	438	signaling an event on which many processes
	439	are waiting will wake all of them up.
	440	This differs considerably from
	441	Dijkstra's P and V
	442	synchronization operations,
	443	.[
	444	dijkstra sequential processes 1968
	445	.]
	446	in that
	447	no memory is associated with events.
	448	Thus there need be no allocation of events
	449	prior to their use.
	450	Events exist simply by being used.
	451	.PP
	452	On the negative side,
	453	because there is no memory associated with events,
	454	no notion of ``how much''
	455	can be signaled via the event mechanism.
	456	For example,
	457	processes that want memory might
	458	wait on an event associated with
	459	memory allocation.
	460	When any amount of memory becomes available,
	461	the event would be signaled.
	462	All the competing processes would then wake
	463	up to fight over the new memory.
	464	(In reality,
	465	the swapping process is the only process
	466	that waits for primary memory to become available.)
	467	.PP
	468	If an event occurs
	469	between the time a process decides
	470	to wait for that event and the
	471	time that process enters the wait state,
	472	then
	473	the process will wait on an event that has
	474	already happened (and may never happen again).
	475	This race condition happens because there is no memory associated with
	476	the event to indicate that the event has occurred;
	477	the only action of an event is to change a set of processes
	478	from wait state to run state.
	479	This problem is relieved largely
	480	by the fact that process switching can
	481	only occur in the kernel by explicit calls
	482	to the event-wait mechanism.
	483	If the event in question is signaled by another
	484	process,
	485	then there is no problem.
	486	But if the event is signaled by a hardware
	487	interrupt,
	488	then special care must be taken.
	489	These synchronization races pose the biggest
	490	problem when
	491	.UX
	492	is adapted to multiple-processor configurations.
	493	.[
	494	hawley meyer multiprocessing unix
	495	.]
	496	.PP
	497	The event-wait code in the kernel
	498	is like a co-routine linkage.
	499	At any time,
	500	all but one of the processes has called event-wait.
	501	The remaining process is the one currently executing.
	502	When it calls event-wait,
	503	a process whose event has been signaled
	504	is selected and that process
	505	returns from its call to event-wait.
	506	.PP
	507	Which of the runable processes is to run next?
	508	Associated with each process is a priority.
	509	The priority of a system process is assigned by the code
	510	issuing the wait on an event.
	511	This is roughly equivalent to the response
	512	that one would expect on such an event.
	513	Disk events have high priority,
	514	teletype events are low,
	515	and time-of-day events are very low.
	516	(From observation,
	517	the difference in system process priorities
	518	has little or no performance impact.)
	519	All user-process priorities are lower than the
	520	lowest system priority.
	521	User-process priorities are assigned
	522	by an algorithm based on the
	523	recent ratio of the amount of compute time to real time consumed
	524	by the process.
	525	A process that has used a lot of
	526	compute time in the last real-time
	527	unit is assigned a low user priority.
	528	Because interactive processes are characterized
	529	by low ratios of compute to real time,
	530	interactive response is maintained without any
	531	special arrangements.
	532	.PP
	533	The scheduling algorithm simply picks
	534	the process with the highest priority,
	535	thus
	536	picking all system processes first and
	537	user processes second.
	538	The compute-to-real-time ratio is updated
	539	every second.
	540	Thus,
	541	all other things being equal,
	542	looping user processes will be
	543	scheduled round-robin with a
	544	1-second quantum.
	545	A high-priority process waking up will
	546	preempt a running, low-priority process.
	547	The scheduling algorithm has a very desirable
	548	negative feedback character.
	549	If a process uses its high priority
	550	to hog the computer,
	551	its priority will drop.
	552	At the same time, if a low-priority
	553	process is ignored for a long time,
	554	its priority will rise.
	555	.NH
	556	I/O SYSTEM
	557	.PP
	558	The I/O system
	559	is broken into two completely separate systems:
	560	the block I/O system and the character I/O system.
	561	In retrospect,
	562	the names should have been ``structured I/O''
	563	and ``unstructured I/O,'' respectively;
	564	while the term ``block I/O'' has some meaning,
	565	``character I/O'' is a complete misnomer.
	566	.PP
	567	Devices are characterized by a major device number,
	568	a minor device number, and
	569	a class (block or character).
	570	For each class,
	571	there is an array of entry points into the device drivers.
	572	The major device number is used to index the array
	573	when calling the code for a particular device driver.
	574	The minor device number is passed to the
	575	device driver as an argument.
	576	The minor number has no significance other
	577	than that attributed to it by the driver.
	578	Usually,
	579	the driver uses the minor number to access
	580	one of several identical physical devices.
	581	.PP
	582	The use of the array of entry points
	583	(configuration table)
	584	as the only connection between the
	585	system code and the device drivers is
	586	very important.
	587	Early versions of the system had a much
	588	less formal connection with the drivers,
	589	so that it was extremely hard to handcraft
	590	differently configured systems.
	591	Now it is possible to create new
	592	device drivers in an average of a few hours.
	593	The configuration table in most cases
	594	is created automatically by a program
	595	that reads the system's parts list.
	596	.NH 2
	597	Block I/O system
	598	.PP
	599	The model block I/O device consists
	600	of randomly addressed, secondary
	601	memory blocks of 512 bytes each.
	602	The blocks are uniformly addressed
	603	0, 1, .\\|.\\|. up to the size of the device.
	604	The block device driver has the job of
	605	emulating this model on a
	606	physical device.
	607	.PP
	608	The block I/O devices are accessed
	609	through a layer of buffering software.
	610	The system maintains a list of buffers
	611	(typically between 10 and 70)
	612	each assigned a device name and
	613	a device address.
	614	This buffer pool constitutes a data cache
	615	for the block devices.
	616	On a read request,
	617	the cache is searched for the desired block.
	618	If the block is found,
	619	the data are made available to the
	620	requester without any physical I/O.
	621	If the block is not in the cache,
	622	the least recently used block in the cache is renamed,
	623	the correct device driver is called to
	624	fill up the renamed buffer, and then the
	625	data are made available.
	626	Write requests are handled in an analogous manner.
	627	The correct buffer is found
	628	and relabeled if necessary.
	629	The write is performed simply by marking
	630	the buffer as ``dirty.''
	631	The physical I/O is then deferred until
	632	the buffer is renamed.
	633	.PP
	634	The benefits in reduction of physical I/O
	635	of this scheme are substantial,
	636	especially considering the file system implementation.
	637	There are,
	638	however,
	639	some drawbacks.
	640	The asynchronous nature of the
	641	algorithm makes error reporting
	642	and meaningful user error handling
	643	almost impossible.
	644	The cavalier approach to I/O error
	645	handling in the
	646	.UX
	647	system is partly due to the asynchronous
	648	nature of the block I/O system.
	649	A second problem is in the delayed writes.
	650	If the system stops unexpectedly,
	651	it is almost certain that there is a
	652	lot of logically complete,
	653	but physically incomplete,
	654	I/O in the buffers.
	655	There is a system primitive to
	656	flush all outstanding I/O activity
	657	from the buffers.
	658	Periodic use of this primitive helps,
	659	but does not solve, the problem.
	660	Finally,
	661	the associativity in the buffers
	662	can alter the physical I/O sequence
	663	from that of the logical I/O sequence.
	664	This means that there are times
	665	when data structures on disk are inconsistent,
	666	even though the software is careful
	667	to perform I/O in the correct order.
	668	On non-random devices,
	669	notably magnetic tape,
	670	the inversions of writes can be disastrous.
	671	The problem with magnetic tapes is ``cured'' by
	672	allowing only one outstanding write request
	673	per drive.
	674	.NH 2
	675	Character I/O system
	676	.PP
	677	The character I/O system consists of all
	678	devices that do not fall into the block I/O model.
	679	This includes the ``classical'' character devices
	680	such as communications lines, paper tape, and
	681	line printers.
	682	It also includes magnetic tape and disks when
	683	they are not used in a stereotyped way,
	684	for example, 80-byte physical records on tape
	685	and track-at-a-time disk copies.
	686	In short,
	687	the character I/O interface
	688	means ``everything other than block.''
	689	I/O requests from the user are sent to the
	690	device driver essentially unaltered.
	691	The implementation of these requests is, of course,
	692	up to the device driver.
	693	There are guidelines and conventions
	694	to help the implementation of
	695	certain types of device drivers.
	696	.NH 3
	697	Disk drivers
	698	.PP
	699	Disk drivers are implemented
	700	with a queue of transaction records.
	701	Each record holds a read/write flag,
	702	a primary memory address,
	703	a secondary memory address, and
	704	a transfer byte count.
	705	Swapping is accomplished by passing
	706	such a record to the swapping device driver.
	707	The block I/O interface is implemented by
	708	passing such records with requests to
	709	fill and empty system buffers.
	710	The character I/O interface to the disk
	711	drivers create a transaction record that
	712	points directly into the user area.
	713	The routine that creates this record also insures
	714	that the user is not swapped during this
	715	I/O transaction.
	716	Thus by implementing the general disk driver,
	717	it is possible to use the disk
	718	as a block device,
	719	a character device, and a swap device.
	720	The only really disk-specific code in normal
	721	disk drivers is the pre-sort of transactions to
	722	minimize latency for a particular device, and
	723	the actual issuing of the I/O request.
	724	.NH 3
	725	Character lists
	726	.PP
	727	Real character-oriented devices may
	728	be implemented using the common
	729	code to handle character lists.
	730	A character list is a queue of characters.
	731	One routine puts a character on a queue.
	732	Another gets a character from a queue.
	733	It is also possible to ask how many
	734	characters are currently on a queue.
	735	Storage for all queues in the system comes
	736	from a single common pool.
	737	Putting a character on a queue will allocate
	738	space from the common pool and link the
	739	character onto the data structure defining the queue.
	740	Getting a character from a queue returns
	741	the corresponding space to the pool.
	742	.PP
	743	A typical character-output device
	744	(paper tape punch, for example)
	745	is implemented by passing characters
	746	from the user onto a character queue until
	747	some maximum number of characters is on the queue.
	748	The I/O is prodded to start as
	749	soon as there is anything on the queue
	750	and, once started,
	751	it is sustained by hardware completion interrupts.
	752	Each time there is a completion interrupt,
	753	the driver gets the next character from the queue
	754	and sends it to the hardware.
	755	The number of characters on the queue is checked and,
	756	as the count falls through some intermediate level,
	757	an event (the queue address) is signaled.
	758	The process that is passing characters from
	759	the user to the queue can be waiting on the event, and
	760	refill the queue to its maximum
	761	when the event occurs.
	762	.PP
	763	A typical character input device
	764	(for example, a paper tape reader)
	765	is handled in a very similar manner.
	766	.PP
	767	Another class of character devices is the terminals.
	768	A terminal is represented by three
	769	character queues.
	770	There are two input queues (raw and canonical)
	771	and an output queue.
	772	Characters going to the output of a terminal
	773	are handled by common code exactly as described
	774	above.
	775	The main difference is that there is also code
	776	to interpret the output stream as
	777	.UC ASCII
	778	characters and to perform some translations,
	779	e.g., escapes for deficient terminals.
	780	Another common aspect of terminals is code
	781	to insert real-time delay after certain control characters.
	782	.PP
	783	Input on terminals is a little different.
	784	Characters are collected from the terminal and
	785	placed on a raw input queue.
	786	Some device-dependent code conversion and
	787	escape interpretation is handled here.
	788	When a line is complete in the raw queue,
	789	an event is signaled.
	790	The code catching this signal then copies a
	791	line from the raw queue to a canonical queue
	792	performing the character erase and line kill editing.
	793	User read requests on terminals can be
	794	directed at either the raw or canonical queues.
	795	.NH 3
	796	Other character devices
	797	.PP
	798	Finally,
	799	there are devices that fit no general category.
	800	These devices are set up as character I/O drivers.
	801	An example is a driver that reads and writes
	802	unmapped primary memory as an I/O device.
	803	Some devices are too
	804	fast to be treated a character at time,
	805	but do not fit the disk I/O mold.
	806	Examples are fast communications lines and
	807	fast line printers.
	808	These devices either have their own buffers
	809	or ``borrow'' block I/O buffers for a while and
	810	then give them back.
	811	.NH
	812	THE FILE SYSTEM
	813	.PP
	814	In the
	815	.UX
	816	system,
	817	a file is a (one-dimensional) array of bytes.
	818	No other structure of files is implied by the
	819	system.
	820	Files are attached anywhere
	821	(and possibly multiply)
	822	onto a hierarchy of directories.
	823	Directories are simply files that
	824	users cannot write.
	825	For a further discussion
	826	of the external view of files and directories,
	827	see Ref.\0
	828	.[
	829	ritchie thompson unix bstj 1978
	830	%Q This issue
	831	.].
	832	.PP
	833	The
	834	.UX
	835	file system is a disk data structure
	836	accessed completely through
	837	the block I/O system.
	838	As stated before,
	839	the canonical view of a ``disk'' is
	840	a randomly addressable array of
	841	512-byte blocks.
	842	A file system breaks the disk into
	843	four self-identifying regions.
	844	The first block (address 0)
	845	is unused by the file system.
	846	It is left aside for booting procedures.
	847	The second block (address 1)
	848	contains the so-called ``super-block.''
	849	This block,
	850	among other things,
	851	contains the size of the disk and
	852	the boundaries of the other regions.
	853	Next comes the i-list,
	854	a list of file definitions.
	855	Each file definition is
	856	a 64-byte structure, called an i-node.
	857	The offset of a particular i-node
	858	within the i-list is called its i-number.
	859	The combination of device name
	860	(major and minor numbers) and i-number
	861	serves to uniquely name a particular file.
	862	After the i-list,
	863	and to the end of the disk,
	864	come free storage blocks that
	865	are available for the contents of files.
	866	.PP
	867	The free space on a disk is maintained
	868	by a linked list of available disk blocks.
	869	Every block in this chain contains a disk address
	870	of the next block in the chain.
	871	The remaining space contains the address of up to
	872	50 disk blocks that are also free.
	873	Thus with one I/O operation,
	874	the system obtains 50 free blocks and a
	875	pointer where to find more.
	876	The disk allocation algorithms are
	877	very straightforward.
	878	Since all allocation is in fixed-size
	879	blocks and there is strict accounting of
	880	space,
	881	there is no need to compact or garbage collect.
	882	However,
	883	as disk space becomes dispersed,
	884	latency gradually increases.
	885	Some installations choose to occasionally compact
	886	disk space to reduce latency.
	887	.PP
	888	An i-node contains 13 disk addresses.
	889	The first 10 of these addresses point directly at
	890	the first 10 blocks of a file.
	891	If a file is larger than 10 blocks (5,120 bytes),
	892	then the eleventh address points at a block
	893	that contains the addresses of the next 128 blocks of the file.
	894	If the file is still larger than this
	895	(70,656 bytes),
	896	then the twelfth block points at up to 128 blocks,
	897	each pointing to 128 blocks of the file.
	898	Files yet larger
	899	(8,459,264 bytes)
	900	use the thirteenth address for a ``triple indirect'' address.
	901	The algorithm ends here with the maximum file size
	902	of 1,082,201,087 bytes.
	903	.PP
	904	A logical directory hierarchy is added
	905	to this flat physical structure simply
	906	by adding a new type of file, the directory.
	907	A directory is accessed exactly as an ordinary file.
	908	It contains 16-byte entries consisting of
	909	a 14-byte name and an i-number.
	910	The root of the hierarchy is at a known i-number
	911	(\fIviz.,\fR 2).
	912	The file system structure allows an arbitrary, directed graph
	913	of directories with regular files linked in
	914	at arbitrary places in this graph.
	915	In fact,
	916	very early
	917	.UX
	918	systems used such a structure.
	919	Administration of such a structure became so
	920	chaotic that later systems were restricted
	921	to a directory tree.
	922	Even now,
	923	with regular files linked multiply
	924	into arbitrary places in the tree,
	925	accounting for space has become a problem.
	926	It may become necessary to restrict the entire
	927	structure to a tree,
	928	and allow a new form of linking that
	929	is subservient to the tree structure.
	930	.PP
	931	The file system allows
	932	easy creation,
	933	easy removal,
	934	easy random accessing,
	935	and very easy space allocation.
	936	With most physical addresses confined
	937	to a small contiguous section of disk,
	938	it is also easy to dump, restore, and
	939	check the consistency of the file system.
	940	Large files suffer from indirect addressing,
	941	but the cache prevents most of the implied physical I/O
	942	without adding much execution.
	943	The space overhead properties of this scheme are quite good.
	944	For example,
	945	on one particular file system,
	946	there are 25,000 files containing 130M bytes of data-file content.
	947	The overhead (i-node, indirect blocks, and last block breakage)
	948	is about 11.5M bytes.
	949	The directory structure to support these files
	950	has about 1,500 directories containing 0.6M bytes of directory content
	951	and about 0.5M bytes of overhead in accessing the directories.
	952	Added up any way,
	953	this comes out to less than a 10 percent overhead for actual
	954	stored data.
	955	Most systems have this much overhead in
	956	padded trailing blanks alone.
	957	.NH 2
	958	File system implementation
	959	.PP
	960	Because the i-node defines a file,
	961	the implementation of the file system centers
	962	around access to the i-node.
	963	The system maintains a table of all active
	964	i-nodes.
	965	As a new file is accessed,
	966	the system locates the corresponding i-node,
	967	allocates an i-node table entry, and reads
	968	the i-node into primary memory.
	969	As in the buffer cache,
	970	the table entry is considered to be the current
	971	version of the i-node.
	972	Modifications to the i-node are made to
	973	the table entry.
	974	When the last access to the i-node goes
	975	away,
	976	the table entry is copied back to the
	977	secondary store i-list and the table entry is freed.
	978	.PP
	979	All I/O operations on files are carried out
	980	with the aid of the corresponding i-node table entry.
	981	The accessing of a file is a straightforward
	982	implementation of the algorithms mentioned previously.
	983	The user is not aware of i-nodes and i-numbers.
	984	References to the file system are made in terms of
	985	path names of the directory tree.
	986	Converting a path name into an i-node table entry
	987	is also straightforward.
	988	Starting at some known i-node
	989	(the root or the current directory of some process),
	990	the next component of the path name is
	991	searched by reading the directory.
	992	This gives an i-number and an implied device
	993	(that of the directory).
	994	Thus the next i-node table entry can be accessed.
	995	If that was the last component of the path name,
	996	then this i-node is the result.
	997	If not,
	998	this i-node is the directory needed to look up
	999	the next component of the path name, and the
	1000	algorithm is repeated.
	1001	.PP
	1002	The user process accesses the file system with
	1003	certain primitives.
	1004	The most common of these are
	1005	.UL open ,
	1006	.UL create ,
	1007	.UL read ,
	1008	.UL write ,
	1009	.UL seek ,
	1010	and
	1011	.UL close .
	1012	The data structures maintained are shown in Fig. 2.
	1013	.KS
	1014	.sp 22P
	1015	.ce
	1016	Fig. 2\(emFile system data structure.
	1017	.KE
	1018	In the system data segment associated with a user,
	1019	there is room for some (usually between 10 and 50) open files.
	1020	This open file table consists of pointers that can be used to access
	1021	corresponding i-node table entries.
	1022	Associated with each of these open files is
	1023	a current I/O pointer.
	1024	This is a byte offset of
	1025	the next read/write operation on the file.
	1026	The system treats each read/write request
	1027	as random with an implied seek to the
	1028	I/O pointer.
	1029	The user usually thinks of the file as
	1030	sequential with the I/O pointer
	1031	automatically counting the number of bytes
	1032	that have been read/written from the file.
	1033	The user may,
	1034	of course,
	1035	perform random I/O by setting the I/O pointer
	1036	before reads/writes.
	1037	.PP
	1038	With file sharing,
	1039	it is necessary to allow related
	1040	processes to share a common I/O pointer
	1041	and yet have separate I/O pointers
	1042	for independent processes
	1043	that access the same file.
	1044	With these two conditions,
	1045	the I/O pointer cannot reside
	1046	in the i-node table nor can
	1047	it reside in the list of
	1048	open files for the process.
	1049	A new table
	1050	(the open file table)
	1051	was invented for the sole purpose
	1052	of holding the I/O pointer.
	1053	Processes that share the same open
	1054	file
	1055	(the result of
	1056	.UL fork s)
	1057	share a common open file table entry.
	1058	A separate open of the same file will
	1059	only share the i-node table entry,
	1060	but will have distinct open file table entries.
	1061	.PP
	1062	The main file system primitives are implemented as follows.
	1063	.UL \&open
	1064	converts a file system path name into an i-node
	1065	table entry.
	1066	A pointer to the i-node table entry is placed in a
	1067	newly created open file table entry.
	1068	A pointer to the file table entry is placed in the
	1069	system data segment for the process.
	1070	.UL \&create
	1071	first creates a new i-node entry,
	1072	writes the i-number into a directory, and
	1073	then builds the same structure as for an
	1074	.UL open .
	1075	.UL \&read
	1076	and
	1077	.UL write
	1078	just access the i-node entry as described above.
	1079	.UL \&seek
	1080	simply manipulates the I/O pointer.
	1081	No physical seeking is done.
	1082	.UL \&close
	1083	just frees the structures built by
	1084	.UL open
	1085	and
	1086	.UL create .
	1087	Reference counts are kept on the open file table entries and
	1088	the i-node table entries to free these structures after
	1089	the last reference goes away.
	1090	.UL \&unlink
	1091	simply decrements the count of the
	1092	number of directories pointing at the given i-node.
	1093	When the last reference to an i-node table entry
	1094	goes away,
	1095	if the i-node has no directories pointing to it,
	1096	then the file is removed and the i-node is freed.
	1097	This delayed removal of files prevents
	1098	problems arising from removing active files.
	1099	A file may be removed while still open.
	1100	The resulting unnamed file vanishes
	1101	when the file is closed.
	1102	This is a method of obtaining temporary files.
	1103	.PP
	1104	There is a type of unnamed
	1105	.UC FIFO
	1106	file called a
	1107	.UL pipe.
	1108	Implementation of
	1109	.UL pipe s
	1110	consists of implied
	1111	.UL seek s
	1112	before each
	1113	.UL read
	1114	or
	1115	.UL write
	1116	in order to implement
	1117	first-in-first-out.
	1118	There are also checks and synchronization
	1119	to prevent the
	1120	writer from grossly outproducing the
	1121	reader and to prevent the reader from
	1122	overtaking the writer.
	1123	.NH 2
	1124	Mounted file systems
	1125	.PP
	1126	The file system of a
	1127	.UX
	1128	system
	1129	starts with some designated block device
	1130	formatted as described above to contain
	1131	a hierarchy.
	1132	The root of this structure is the root of
	1133	the
	1134	.UX
	1135	file system.
	1136	A second formatted block device may be
	1137	mounted
	1138	at any leaf of
	1139	the current hierarchy.
	1140	This logically extends the current hierarchy.
	1141	The implementation of
	1142	mounting
	1143	is trivial.
	1144	A mount table is maintained containing
	1145	pairs of designated leaf i-nodes and
	1146	block devices.
	1147	When converting a path name into an i-node,
	1148	a check is made to see if the new i-node is a
	1149	designated leaf.
	1150	If it is,
	1151	the i-node of the root
	1152	of the block device replaces it.
	1153	.PP
	1154	Allocation of space for a file is taken
	1155	from the free pool on the device on which the
	1156	file lives.
	1157	Thus a file system consisting of many
	1158	mounted devices does not have a common pool of
	1159	free secondary storage space.
	1160	This separation of space on different
	1161	devices is necessary to allow easy
	1162	unmounting
	1163	of a device.
	1164	.NH 2
	1165	Other system functions
	1166	.PP
	1167	There are some other things that the system
	1168	does for the user\-a
	1169	little accounting,
	1170	a little tracing/debugging,
	1171	and a little access protection.
	1172	Most of these things are not very
	1173	well developed
	1174	because our use of the system in computing science research
	1175	does not need them.
	1176	There are some features that are missed in some
	1177	applications, for example, better inter-process communication.
	1178	.PP
	1179	The
	1180	.UX
	1181	kernel is an I/O multiplexer more than
	1182	a complete operating system.
	1183	This is as it should be.
	1184	Because of this outlook,
	1185	many features are
	1186	found in most
	1187	other operating systems that are missing from the
	1188	.UX
	1189	kernel.
	1190	For example,
	1191	the
	1192	.UX
	1193	kernel does not support
	1194	file access methods,
	1195	file disposition,
	1196	file formats,
	1197	file maximum size,
	1198	spooling,
	1199	command language,
	1200	logical records,
	1201	physical records,
	1202	assignment of logical file names,
	1203	logical file names,
	1204	more than one character set,
	1205	an operator's console,
	1206	an operator,
	1207	log-in,
	1208	or log-out.
	1209	Many of these things are symptoms rather than features.
	1210	Many of these things are implemented
	1211	in user software
	1212	using the kernel as a tool.
	1213	A good example of this is the command language.
	1214	.[
	1215	bourne shell 1978 bstj
	1216	%Q This issue
	1217	.]
	1218	Each user may have his own command language.
	1219	Maintenance of such code is as easy as
	1220	maintaining user code.
	1221	The idea of implementing ``system'' code with general
	1222	user primitives
	1223	comes directly from
	1224	.UC MULTICS .
	1225	.[
	1226	organick multics 1972
	1227	.]
	1228	.LP
	1229	.[
	1230	$LIST$
	1231	.]