[unix-history] / .ref-BSD-3 / usr / doc / cacm / p2

.SH
3.6 I/O calls
.PP
The system calls to do I/O are designed to eliminate
the differences between the various devices and styles of
access.
There is no distinction between ``random''
and ``sequential'' I/O, nor is any logical record size imposed
by the system.
The size of an ordinary file is determined
by the number of bytes written on it;
no predetermination of the size of a file is necessary
or possible.
.PP
To illustrate the essentials of I/O,
some of the basic calls are
summarized below
in an anonymous language that will
indicate the required parameters without getting into the
underlying
complexities.
Each call to the system may potentially result in an error
return, which for simplicity is not represented
in the calling sequence.
.PP
To read or write a file assumed to exist already, it must
be opened by the following call:
.P1
filep = open\|(\|name, flag\|)
.P2
where
.UL name
indicates the name of the file.
An arbitrary path name may be given.
The
.UL flag
argument indicates whether the file is to be read, written,
or ``updated,'' that is, read and written simultaneously.
.PP
The returned value
.UL filep
is called a
.IT "file descriptor" .
It is a small integer used to identify the file
in subsequent calls to read, write,
or otherwise manipulate the file.
.PP
To create a new file or completely rewrite an old one,
there is a
.UL create
system call that
creates the given file if it does not exist,
or truncates it to zero length
if it does exist;
.UL create
also opens the new file for writing
and, like
.UL open ,
returns a file descriptor.
.PP
The file system maintains no locks visible to the user, nor is there any
restriction on the number of users who may have a file
open for reading or writing.
Although it is possible for the contents of a file
to become scrambled when two users write on it simultaneously,
in practice difficulties do not arise.
We take the view that locks are neither
necessary nor sufficient, in our environment,
to prevent interference between users of the same file.
They are unnecessary because we are not
faced with large, single-file data bases
maintained by independent processes.
They are insufficient because
locks in the ordinary sense, whereby
one user is prevented from writing on a file that another
user is reading,
cannot prevent confusion
when, for example, both users are editing
a file with an editor that makes
a copy of the file being edited.
.PP
There are, however,
sufficient internal interlocks to maintain
the logical consistency of the file system
when two users engage simultaneously in
activities such as writing on
the same file,
creating files in the same directory,
or deleting each other's open files.
.PP
Except as indicated below, reading and writing
are sequential.
This means that if a particular
byte in the file was the last byte written (or read),
the next I/O call implicitly refers to the
immediately following byte.
For each open file there is a pointer, maintained
inside the system,
that indicates the next byte to be read
or written.
If
.IT n
bytes are read or written, the pointer advances
by
.IT n
bytes.
.PP
Once a file is open, the following calls
may be used:
.P1
n = read\|(\|filep, buffer, count\|)
n = write\|(\|filep, buffer, count\|)
.P2
Up to
.UL count
bytes are transmitted between the file specified
by
.UL filep
and the byte array
specified by
.UL buffer .
The returned value
.UL n
is the number of bytes actually transmitted.
In the
.UL write
case,
.UL n
is the same as
.UL count
except under exceptional conditions, such as I/O errors or
end of physical medium on special files;
in a
.UL read ,
however,
.UL n
may without error be less than
.UL count .
If the read pointer is so near the end of the
file that reading
.UL count
characters
would cause reading beyond the end, only sufficient
bytes are transmitted to reach the end of the
file;
also, typewriter-like terminals
never return more than one line of input.
When a
.UL read
call returns with
.UL n
equal
to zero, the end of the file has been reached.
For disk files this occurs when the read pointer
becomes equal to the current
size of the file.
It is possible to generate an end-of-file
from a terminal by use of an escape
sequence that depends on the device used.
.PP
Bytes written affect only those parts of a file implied by
the position of the write pointer and the
count; no other part of the file
is changed.
If the last byte lies beyond the end of the file, the
file is made to grow as needed.
.PP
To do random (direct-access) I/O
it is only necessary to move the read or write pointer
to the appropriate location in the file.
.P1
location = lseek\|(\|filep, offset, base\|)
.P2
The pointer
associated with
.UL filep
is moved to a position
.UL offset
bytes from the beginning of the file, from the current position
of the pointer, or from the end of the file,
depending on
.UL base.
.UL \&offset
may be negative.
For some devices (e.g., paper
tape and
terminals) seek calls are
ignored.
The actual offset from the beginning of the file
to which the pointer was moved is returned
in
.UL location .
.PP
There are several additional system entries
having to do with I/O and with the file
system that will not be discussed.
For example:
close a file,
get the status of a file,
change the protection mode or the owner
of a file,
create a directory,
make a link to an existing file,
delete a file.
.SH
IV. IMPLEMENTATION OF THE FILE SYSTEM
.PP
As mentioned in Section 3.2 above, a directory entry contains
only a name for the associated file and a pointer to the
file itself.
This pointer is an integer called the
.IT i-number
(for index number)
of the file.
When the file is accessed,
its i-number is used as an index into
a system table (the
.IT i-list \|)
stored in a known
part of the device on which
the directory resides.
The entry found thereby (the file's
.IT i-node \|)
contains
the description of the file:
.IP i
the user and group-\*sID\*n of its owner
.IP ii
its protection bits
.IP iii
the physical disk or tape addresses for the file contents
.IP iv
its size
.IP v
time of creation, last use, and last modification
.IP vi
the number of links to the file, that is, the number of times it appears in a directory
.IP vii
a code indicating whether the file is a directory, an ordinary file, or a special file.
.LP
The purpose of an
.UL open
or
.UL create
system call is to turn the path name given by the user
into an i-number
by searching the explicitly or implicitly named directories.
Once a file is open,
its device, i-number, and read/write pointer are stored in a system table
indexed by the file descriptor returned by the
.UL open
or
.UL create .
Thus, during a subsequent
call to read or write the
file,
the descriptor
may be easily related to the information necessary to access the file.
.PP
When a new file is created,
an i-node is allocated for it and a directory entry is made
that contains the name of the file and the i-node
number.
Making a link to an existing file involves
creating a directory entry with the new name,
copying the i-number from the original file entry,
and incrementing the link-count field of the i-node.
Removing (deleting) a file is done by
decrementing the
link-count of the i-node specified by its directory entry
and erasing the directory entry.
If the link-count drops to 0,
any disk blocks in the file
are freed and the i-node is de-allocated.
.PP
The space on all disks that
contain a file system is divided into a number of
512-byte
blocks logically addressed from 0 up to a limit that
depends on the device.
There is space in the i-node of each file for 13 device addresses.
For nonspecial files,
the first 10 device addresses point at the first
10 blocks of the file.
If the file is larger than 10 blocks,
the 11 device address points to an
indirect block containing up to 128 addresses
of additional blocks in the file.
Still larger files use the twelfth device address
of the i-node to point to
a double-indirect block naming
128 indirect blocks,
each
pointing to 128 blocks of the file.
If required,
the thirteenth device address is
a triple-indirect block.
Thus files may conceptually grow to
[\|(10+128+128\u\s62\s0\d+128\u\s63\s0\d)\*m512\|] bytes.
Once opened,
bytes numbered below 5120 can be read with a single
disk access;
bytes in the range 5120 to 70,656
require two accesses;
bytes in the range 70,656
to 8,459,264
require three accesses;
bytes from there to the
largest file
(1,082,201,088)
require four accesses.
In practice,
a device cache mechanism
(see below)
proves effective in eliminating
most of the indirect fetches.
.PP
The foregoing discussion applies to ordinary files.
When an I/O request is made to a file whose i-node indicates that it
is special,
the last 12 device address words are immaterial,
and the first specifies
an internal
.IT "device name" ,
which is interpreted as a pair of numbers
representing,
respectively, a device type
and subdevice number.
The device type indicates which
system routine will deal with I/O on that device;
the subdevice number selects, for example, a disk drive
attached to a particular controller or one of several
similar terminal interfaces.
.PP
In this environment, the implementation of the
.UL mount
system call (Section 3.4) is quite straightforward.
.UL \&mount
maintains a system table whose
argument is the i-number and device name of the
ordinary file specified
during the
.UL mount ,
and whose corresponding value is the
device name of the indicated special file.
This table is searched for each i-number/device pair
that turns up while a path name is being scanned
during an
.UL open
or
.UL create ;
if a match is found,
the i-number is replaced by the i-number of the root
directory
and the device name is replaced by the table value.
.PP
To the user, both reading and writing of files appear to
be synchronous and unbuffered.
That is, immediately after
return from a
.UL read
call the data are available; conversely,
after a
.UL write
the user's workspace may be reused.
In fact, the system maintains a rather complicated
buffering mechanism that reduces greatly the number
of I/O operations required to access a file.
Suppose a
.UL write
call is made specifying transmission
of a single byte.
The system
will search its buffers to see
whether the affected disk block currently resides in main memory;
if not, it will be read in from the device.
Then the affected byte is replaced in the buffer and an
entry is made in a list of blocks to be written.
The return from the
.UL write
call may then take place,
although the actual I/O may not be completed until a later time.
Conversely, if a single byte is read, the system determines
whether the secondary storage block in which the byte is located is already
in one of the system's buffers; if so, the byte can be returned immediately.
If not, the block is read into a buffer and the byte picked out.
.PP
The system recognizes when
a program has
made accesses to
sequential blocks of a file,
and asynchronously
pre-reads the next block.
This significantly reduces
the running time of most programs
while adding little to
system overhead.
.PP
A program that reads or writes files in units of 512 bytes
has an advantage over a program that reads or writes
a single byte at a time, but the gain is not immense;
it comes mainly from the avoidance of system overhead.
If a program is used rarely or does
no great volume of I/O, it may quite reasonably
read and write in units as small as it wishes.
.PP
The notion of the i-list is an unusual feature
of
.UX .
In practice, this method of organizing the file system
has proved quite reliable and easy to deal with.
To the system itself, one of its strengths is
the fact that each file has a short, unambiguous name
related in a simple way to the protection, addressing,
and other information needed to access the file.
It also permits a quite simple and rapid
algorithm for checking the consistency of a file system,
for example, verification
that the portions of each device containing useful information
and those free to be allocated are disjoint and together
exhaust the space on the device.
This algorithm is independent
of the directory hierarchy, because it need only scan
the linearly organized i-list.
At the same time the notion of the i-list induces certain
peculiarities not found in other file system organizations.
For example, there is the question of who is to be charged
for the space a file occupies,
because all directory entries for a file have equal status.
Charging the owner of a file is unfair in general,
for one user may create a file, another may link to
it, and the first user may delete the file.
The first user is still the owner of the
file, but it should be charged
to the second user.
The simplest reasonably fair algorithm
seems to be to spread the charges
equally among users who have links to a file.
Many installations
avoid the
issue by not charging any fees at all.
Commit	Line	Data
8340f87c BJ	1	.SH
	2	3.6 I/O calls
	3	.PP
	4	The system calls to do I/O are designed to eliminate
	5	the differences between the various devices and styles of
	6	access.
	7	There is no distinction between ``random''
	8	and ``sequential'' I/O, nor is any logical record size imposed
	9	by the system.
	10	The size of an ordinary file is determined
	11	by the number of bytes written on it;
	12	no predetermination of the size of a file is necessary
	13	or possible.
	14	.PP
	15	To illustrate the essentials of I/O,
	16	some of the basic calls are
	17	summarized below
	18	in an anonymous language that will
	19	indicate the required parameters without getting into the
	20	underlying
	21	complexities.
	22	Each call to the system may potentially result in an error
	23	return, which for simplicity is not represented
	24	in the calling sequence.
	25	.PP
	26	To read or write a file assumed to exist already, it must
	27	be opened by the following call:
	28	.P1
	29	filep = open\\|(\\|name, flag\\|)
	30	.P2
	31	where
	32	.UL name
	33	indicates the name of the file.
	34	An arbitrary path name may be given.
	35	The
	36	.UL flag
	37	argument indicates whether the file is to be read, written,
	38	or ``updated,'' that is, read and written simultaneously.
	39	.PP
	40	The returned value
	41	.UL filep
	42	is called a
	43	.IT "file descriptor" .
	44	It is a small integer used to identify the file
	45	in subsequent calls to read, write,
	46	or otherwise manipulate the file.
	47	.PP
	48	To create a new file or completely rewrite an old one,
	49	there is a
	50	.UL create
	51	system call that
	52	creates the given file if it does not exist,
	53	or truncates it to zero length
	54	if it does exist;
	55	.UL create
	56	also opens the new file for writing
	57	and, like
	58	.UL open ,
	59	returns a file descriptor.
	60	.PP
	61	The file system maintains no locks visible to the user, nor is there any
	62	restriction on the number of users who may have a file
	63	open for reading or writing.
	64	Although it is possible for the contents of a file
65	to become scrambled when two users write on it simultaneously,
66	in practice difficulties do not arise.
67	We take the view that locks are neither
68	necessary nor sufficient, in our environment,
69	to prevent interference between users of the same file.
70	They are unnecessary because we are not
71	faced with large, single-file data bases
72	maintained by independent processes.
73	They are insufficient because
74	locks in the ordinary sense, whereby
75	one user is prevented from writing on a file that another
76	user is reading,
77	cannot prevent confusion
78	when, for example, both users are editing
79	a file with an editor that makes
80	a copy of the file being edited.
81	.PP
82	There are, however,
83	sufficient internal interlocks to maintain
84	the logical consistency of the file system
85	when two users engage simultaneously in
86	activities such as writing on
87	the same file,
88	creating files in the same directory,
89	or deleting each other's open files.
90	.PP
91	Except as indicated below, reading and writing
92	are sequential.
93	This means that if a particular
94	byte in the file was the last byte written (or read),
95	the next I/O call implicitly refers to the
96	immediately following byte.
97	For each open file there is a pointer, maintained
98	inside the system,
99	that indicates the next byte to be read
100	or written.
101	If
102	.IT n
103	bytes are read or written, the pointer advances
104	by
105	.IT n
106	bytes.
107	.PP
108	Once a file is open, the following calls
109	may be used:
110	.P1
111	n = read\\|(\\|filep, buffer, count\\|)
112	n = write\\|(\\|filep, buffer, count\\|)
113	.P2
114	Up to
115	.UL count
116	bytes are transmitted between the file specified
117	by
118	.UL filep
119	and the byte array
120	specified by
121	.UL buffer .
122	The returned value
123	.UL n
124	is the number of bytes actually transmitted.
125	In the
126	.UL write
127	case,
128	.UL n
129	is the same as
130	.UL count
131	except under exceptional conditions, such as I/O errors or
132	end of physical medium on special files;
133	in a
134	.UL read ,
135	however,
136	.UL n
137	may without error be less than
138	.UL count .
139	If the read pointer is so near the end of the
140	file that reading
141	.UL count
142	characters
143	would cause reading beyond the end, only sufficient
144	bytes are transmitted to reach the end of the
145	file;
146	also, typewriter-like terminals
147	never return more than one line of input.
148	When a
149	.UL read
150	call returns with
151	.UL n
152	equal
153	to zero, the end of the file has been reached.
154	For disk files this occurs when the read pointer
155	becomes equal to the current
156	size of the file.
157	It is possible to generate an end-of-file
158	from a terminal by use of an escape
159	sequence that depends on the device used.
160	.PP
161	Bytes written affect only those parts of a file implied by
162	the position of the write pointer and the
163	count; no other part of the file
164	is changed.
165	If the last byte lies beyond the end of the file, the
166	file is made to grow as needed.
167	.PP
168	To do random (direct-access) I/O
169	it is only necessary to move the read or write pointer
170	to the appropriate location in the file.
171	.P1
172	location = lseek\\|(\\|filep, offset, base\\|)
173	.P2
174	The pointer
175	associated with
176	.UL filep
177	is moved to a position
178	.UL offset
179	bytes from the beginning of the file, from the current position
180	of the pointer, or from the end of the file,
181	depending on
182	.UL base.
183	.UL \&offset
184	may be negative.
185	For some devices (e.g., paper
186	tape and
187	terminals) seek calls are
188	ignored.
189	The actual offset from the beginning of the file
190	to which the pointer was moved is returned
191	in
192	.UL location .
193	.PP
194	There are several additional system entries
195	having to do with I/O and with the file
196	system that will not be discussed.
197	For example:
198	close a file,
199	get the status of a file,
200	change the protection mode or the owner
201	of a file,
202	create a directory,
203	make a link to an existing file,
204	delete a file.
205	.SH
206	IV. IMPLEMENTATION OF THE FILE SYSTEM
207	.PP
208	As mentioned in Section 3.2 above, a directory entry contains
209	only a name for the associated file and a pointer to the
210	file itself.
211	This pointer is an integer called the
212	.IT i-number
213	(for index number)
214	of the file.
215	When the file is accessed,
216	its i-number is used as an index into
217	a system table (the
218	.IT i-list \\|)
219	stored in a known
220	part of the device on which
221	the directory resides.
222	The entry found thereby (the file's
223	.IT i-node \\|)
224	contains
225	the description of the file:
226	.IP i
227	the user and group-\sID\n of its owner
228	.IP ii
229	its protection bits
230	.IP iii
231	the physical disk or tape addresses for the file contents
232	.IP iv
233	its size
234	.IP v
235	time of creation, last use, and last modification
236	.IP vi
237	the number of links to the file, that is, the number of times it appears in a directory
238	.IP vii
239	a code indicating whether the file is a directory, an ordinary file, or a special file.
240	.LP
241	The purpose of an
242	.UL open
243	or
244	.UL create
245	system call is to turn the path name given by the user
246	into an i-number
247	by searching the explicitly or implicitly named directories.
248	Once a file is open,
249	its device, i-number, and read/write pointer are stored in a system table
250	indexed by the file descriptor returned by the
251	.UL open
252	or
253	.UL create .
254	Thus, during a subsequent
255	call to read or write the
256	file,
257	the descriptor
258	may be easily related to the information necessary to access the file.
259	.PP
260	When a new file is created,
261	an i-node is allocated for it and a directory entry is made
262	that contains the name of the file and the i-node
263	number.
264	Making a link to an existing file involves
265	creating a directory entry with the new name,
266	copying the i-number from the original file entry,
267	and incrementing the link-count field of the i-node.
268	Removing (deleting) a file is done by
269	decrementing the
270	link-count of the i-node specified by its directory entry
271	and erasing the directory entry.
272	If the link-count drops to 0,
273	any disk blocks in the file
274	are freed and the i-node is de-allocated.
275	.PP
276	The space on all disks that
277	contain a file system is divided into a number of
278	512-byte
279	blocks logically addressed from 0 up to a limit that
280	depends on the device.
281	There is space in the i-node of each file for 13 device addresses.
282	For nonspecial files,
283	the first 10 device addresses point at the first
284	10 blocks of the file.
285	If the file is larger than 10 blocks,
286	the 11 device address points to an
287	indirect block containing up to 128 addresses
288	of additional blocks in the file.
289	Still larger files use the twelfth device address
290	of the i-node to point to
291	a double-indirect block naming
292	128 indirect blocks,
293	each
294	pointing to 128 blocks of the file.
295	If required,
296	the thirteenth device address is
297	a triple-indirect block.
298	Thus files may conceptually grow to
299	[\\|(10+128+128\u\s62\s0\d+128\u\s63\s0\d)\*m512\\|] bytes.
300	Once opened,
301	bytes numbered below 5120 can be read with a single
302	disk access;
303	bytes in the range 5120 to 70,656
304	require two accesses;
305	bytes in the range 70,656
306	to 8,459,264
307	require three accesses;
308	bytes from there to the
309	largest file
310	(1,082,201,088)
311	require four accesses.
312	In practice,
313	a device cache mechanism
314	(see below)
315	proves effective in eliminating
316	most of the indirect fetches.
317	.PP
318	The foregoing discussion applies to ordinary files.
319	When an I/O request is made to a file whose i-node indicates that it
320	is special,
321	the last 12 device address words are immaterial,
322	and the first specifies
323	an internal
324	.IT "device name" ,
325	which is interpreted as a pair of numbers
326	representing,
327	respectively, a device type
328	and subdevice number.
329	The device type indicates which
330	system routine will deal with I/O on that device;
331	the subdevice number selects, for example, a disk drive
332	attached to a particular controller or one of several
333	similar terminal interfaces.
334	.PP
335	In this environment, the implementation of the
336	.UL mount
337	system call (Section 3.4) is quite straightforward.
338	.UL \&mount
339	maintains a system table whose
340	argument is the i-number and device name of the
341	ordinary file specified
342	during the
343	.UL mount ,
344	and whose corresponding value is the
345	device name of the indicated special file.
346	This table is searched for each i-number/device pair
347	that turns up while a path name is being scanned
348	during an
349	.UL open
350	or
351	.UL create ;
352	if a match is found,
353	the i-number is replaced by the i-number of the root
354	directory
355	and the device name is replaced by the table value.
356	.PP
357	To the user, both reading and writing of files appear to
358	be synchronous and unbuffered.
359	That is, immediately after
360	return from a
361	.UL read
362	call the data are available; conversely,
363	after a
364	.UL write
365	the user's workspace may be reused.
366	In fact, the system maintains a rather complicated
367	buffering mechanism that reduces greatly the number
368	of I/O operations required to access a file.
369	Suppose a
370	.UL write
371	call is made specifying transmission
372	of a single byte.
373	The system
374	will search its buffers to see
375	whether the affected disk block currently resides in main memory;
376	if not, it will be read in from the device.
377	Then the affected byte is replaced in the buffer and an
378	entry is made in a list of blocks to be written.
379	The return from the
380	.UL write
381	call may then take place,
382	although the actual I/O may not be completed until a later time.
383	Conversely, if a single byte is read, the system determines
384	whether the secondary storage block in which the byte is located is already
385	in one of the system's buffers; if so, the byte can be returned immediately.
386	If not, the block is read into a buffer and the byte picked out.
387	.PP
388	The system recognizes when
389	a program has
390	made accesses to
391	sequential blocks of a file,
392	and asynchronously
393	pre-reads the next block.
394	This significantly reduces
395	the running time of most programs
396	while adding little to
397	system overhead.
398	.PP
399	A program that reads or writes files in units of 512 bytes
400	has an advantage over a program that reads or writes
401	a single byte at a time, but the gain is not immense;
402	it comes mainly from the avoidance of system overhead.
403	If a program is used rarely or does
404	no great volume of I/O, it may quite reasonably
405	read and write in units as small as it wishes.
406	.PP
407	The notion of the i-list is an unusual feature
408	of
409	.UX .
410	In practice, this method of organizing the file system
411	has proved quite reliable and easy to deal with.
412	To the system itself, one of its strengths is
413	the fact that each file has a short, unambiguous name
414	related in a simple way to the protection, addressing,
415	and other information needed to access the file.
416	It also permits a quite simple and rapid
417	algorithm for checking the consistency of a file system,
418	for example, verification
419	that the portions of each device containing useful information
420	and those free to be allocated are disjoint and together
421	exhaust the space on the device.
422	This algorithm is independent
423	of the directory hierarchy, because it need only scan
424	the linearly organized i-list.
425	At the same time the notion of the i-list induces certain
426	peculiarities not found in other file system organizations.
427	For example, there is the question of who is to be charged
428	for the space a file occupies,
429	because all directory entries for a file have equal status.
430	Charging the owner of a file is unfair in general,
431	for one user may create a file, another may link to
432	it, and the first user may delete the file.
433	The first user is still the owner of the
434	file, but it should be charged
435	to the second user.
436	The simplest reasonably fair algorithm
437	seems to be to spread the charges
438	equally among users who have links to a file.
439	Many installations
440	avoid the
441	issue by not charging any fees at all.