[unix-history] / usr / man / man8 / crash.8

.TH CRASH 8 
.SH NAME
crash \- what to do when the system crashes
.SH DESCRIPTION
This section gives at least a few clues about how to proceed if the
system crashes.
It can't pretend to be complete.
.PP
.I Bringing it back up.
If the reason for the crash is not evident
(see below for guidance on `evident')
you may want to try to dump the system if you feel up to
debugging.
At the moment a dump can be taken only on magtape.
With a tape mounted and ready,
stop the machine, load address 44, and start.
This should write a copy of all of core
on the tape with an EOF mark.
Caution:
Any error is taken to mean the end of core has been reached.
This means that you must be sure the ring is in,
the tape is ready, and the tape is clean and new.
If the dump fails, you can try again,
but some of the registers will be lost.
See below for what to do with the tape.
.PP
In restarting after a crash,
always bring up the system single-user.
This is accomplished by following the directions in
.IR boot (8)
as modified for your particular installation;
a single-user system is indicated by having a particular value
in the switches (173030 unless you've changed
.I init)
as the system starts executing.
When it is running,
perform a
.I dcheck
and
.IR  icheck (1)
on all file systems which could have been in use at the time
of the crash.
If any serious file system problems are found, they should be repaired.
When you are satisfied with the health of your disks,
check and set the date if necessary,
then come up multi-user.
This is most easily accomplished by changing the
single-user value in the switches to something else,
then logging out
by typing an EOT.
.PP
To even boot \s8UNIX\s10 at all,
three files (and the directories leading to them)
must be intact.
First,
the initialization program
.I /etc/init
must be present and executable.
If it is not,
the CPU will loop in user mode at location 6.
For
.I init
to work correctly,
.I /dev/tty8
and
.I /bin/sh
must be present.
If either does not exist,
the symptom is best described
as thrashing.
.I Init
will go into a
.I fork/exec
loop trying to create a
Shell with proper standard input and output.
.PP
If you cannot get the system to boot,
a runnable system must be obtained from
a backup medium.
The root file system may then be doctored as
a mounted file system as described below.
If there are any problems with the root
file system,
it is probably prudent to go to a
backup system to avoid working on a
mounted file system.
.PP
.I Repairing disks.
The first rule to keep in mind is that an addled disk
should be treated gently;
it shouldn't be mounted unless necessary,
and if it is very valuable yet
in quite bad shape, perhaps it should be dumped before
trying surgery on it.
This is an area where experience and informed courage count for much.
.PP
The problems reported by
.I icheck
typically fall into two kinds.
There can be
problems with the free list:
duplicates in the free list, or free blocks also in files.
These can be cured easily with an
.I icheck \-s.
If the same block appears in more than one file
or if a file contains bad blocks,
the files should be deleted, and the free list reconstructed.
The best way to delete such a file is to use
.IR  clri (1),
then remove its directory entries.
If any of the affected files is really precious,
you can try to copy it to another device
first.
.PP
.I Dcheck
may report files which
have more directory entries than links.
Such situations are potentially dangerous;
.I clri
discusses a special case of the problem.
All the directory entries for the file should be removed.
If on the other hand there are more links than directory entries,
there is no danger of spreading infection, but merely some disk space
that is lost for use.
It is sufficient to copy the file (if it has any entries and is useful)
then use
.I clri
on its inode and remove any directory
entries that do exist.
.PP
Finally,
there may be inodes reported by
.I dcheck
that have 0 links and 0 entries.
These occur on the root device when the system is stopped
with pipes open, and on other file systems when the system
stops with files that have been deleted while still open.
A
.I clri
will free the inode, and an
.I icheck -s
will
recover any missing blocks.
.PP
.I Why did it crash?
UNIX types a message
on the console typewriter when it voluntarily crashes.
Here is the current list of such messages,
with enough information to provide
a hope at least of the remedy.
The message has the form `panic: ...',
possibly accompanied by other information.
Left unstated in all cases
is the possibility that hardware or software
error produced the message in some unexpected way.
.HP 5
blkdev
.br
The
.I getblk
routine was called with a nonexistent major device as argument.
Definitely hardware or software error.
.HP 5
devtab
.br
Null device table entry for the major device used as argument to
.I getblk.
Definitely hardware or software error.
.HP 5
iinit
.br
An I/O error reading the super-block for the root file system
during initialization.
.HP 5
out of inodes
.br
A mounted file system has no more i-nodes when creating a file.
Sorry, the device isn't available;
the
.I icheck
should tell you.
.HP 5
no fs
.br
A device has disappeared from the mounted-device table.
Definitely hardware or software error.
.HP 5
no imt
.br
Like `no fs', but produced elsewhere.
.HP 5
no inodes
.br
The in-core inode table is full.
Try increasing NINODE in param.h.
Shouldn't be a panic, just a user error.
.HP 5
no clock
.br
During initialization,
neither the line nor programmable clock was found to exist.
.HP 5
swap error
.br
An unrecoverable I/O error during a swap.
Really shouldn't be a panic,
but it is hard to fix.
.HP 5
unlink \- iget
.br
The directory containing a file being deleted can't be found.
Hardware or software.
.HP 5
out of swap space
.br
A program needs to be swapped out, and there is no more swap space.
It has to be increased.
This really shouldn't be a panic, but there is no easy fix.
.HP 5
out of text
.br
A pure procedure program is being executed,
and the table for such things is full.
This shouldn't be a panic.
.HP 5
trap
.br
An unexpected trap has occurred within the system.
This is accompanied by three numbers:
a `ka6', which is the contents of the segmentation
register for the area in which the system's stack is kept;
`aps', which is the location where the hardware stored
the program status word during the trap;
and a `trap type' which encodes
which trap occurred.
The trap types are:
.TP 10
0
bus error
.br
.ns
.TP 10
1
illegal instruction
.br
.ns
.TP 10
2
BPT/trace
.br
.ns
.TP 10
3
IOT
.br
.ns
.TP 10
4
power fail
.br
.ns
.TP 10
5
EMT
.br
.ns
.TP 10
6
recursive system call (TRAP instruction)
.br
.ns
.TP 10
7
11/70 cache parity, or programmed interrupt
.br
.ns
.TP 10
10
floating point trap
.br
.ns
.TP 10
11
segmentation violation
.PP
In some of these cases it is
possible for octal 20 to be added into the trap type;
this indicates that the processor was in user mode when the trap occurred.
If you wish to examine the stack after such a trap,
either dump the system, or use the console switches to examine core;
the required address mapping is described below.
.PP
.I Interpreting dumps.
All file system problems
should be taken care of before attempting to look at dumps.
The dump should be read into the file
.I /usr/sys/core;
.IR  cp (1)
will do.
At this point, you should execute
.I ps \-alxk
and
.I who
to print the process table and the users who were on
at the time of the crash.
You should dump (
.IR  od (1))
the first 30 bytes of
.I /usr/sys/core.
Starting at location 4,
the registers R0, R1, R2, R3, R4, R5, SP
and KDSA6 (KISA6 for 11/40s) are stored.
If the dump had to be restarted,
R0 will not be correct.
Next, take the value of KA6 (location 022(8) in the dump)
multiplied by 0100(8) and dump 01000(8) bytes starting from there.
This is the per-process data associated with the process running
at the time of the crash.
Relabel
the addresses 140000 to 141776.
R5 is C's frame or display pointer.
Stored at (R5) is the old R5 pointing to the previous
stack frame.
At (R5)+2
is the saved PC of the calling procedure.
Trace
this calling chain until
you obtain an R5 value of 141756, which
is where the user's R5 is stored.
If the chain is broken,
you have to look for a plausible
R5, PC pair and continue from there.
Each PC should be looked up in the system's name list
using
.IR  adb (1)
and its `:' command,
to get a reverse calling order.
In most cases this procedure will give
an idea of what is wrong.
A more complete discussion
of system debugging is impossible here.
.SH SEE ALSO
clri(1), icheck(1), dcheck(1), boot(8)
Commit	Line	Data
aa26a18b KT	1	.TH CRASH 8
	2	.SH NAME
	3	crash \- what to do when the system crashes
	4	.SH DESCRIPTION
	5	This section gives at least a few clues about how to proceed if the
	6	system crashes.
	7	It can't pretend to be complete.
	8	.PP
	9	.I Bringing it back up.
	10	If the reason for the crash is not evident
	11	(see below for guidance on `evident')
	12	you may want to try to dump the system if you feel up to
	13	debugging.
	14	At the moment a dump can be taken only on magtape.
	15	With a tape mounted and ready,
	16	stop the machine, load address 44, and start.
	17	This should write a copy of all of core
	18	on the tape with an EOF mark.
	19	Caution:
	20	Any error is taken to mean the end of core has been reached.
	21	This means that you must be sure the ring is in,
	22	the tape is ready, and the tape is clean and new.
	23	If the dump fails, you can try again,
	24	but some of the registers will be lost.
	25	See below for what to do with the tape.
	26	.PP
	27	In restarting after a crash,
	28	always bring up the system single-user.
	29	This is accomplished by following the directions in
	30	.IR boot (8)
	31	as modified for your particular installation;
	32	a single-user system is indicated by having a particular value
	33	in the switches (173030 unless you've changed
	34	.I init)
	35	as the system starts executing.
	36	When it is running,
	37	perform a
	38	.I dcheck
	39	and
	40	.IR icheck (1)
	41	on all file systems which could have been in use at the time
	42	of the crash.
	43	If any serious file system problems are found, they should be repaired.
	44	When you are satisfied with the health of your disks,
	45	check and set the date if necessary,
	46	then come up multi-user.
	47	This is most easily accomplished by changing the
	48	single-user value in the switches to something else,
	49	then logging out
	50	by typing an EOT.
	51	.PP
	52	To even boot \s8UNIX\s10 at all,
	53	three files (and the directories leading to them)
	54	must be intact.
	55	First,
	56	the initialization program
	57	.I /etc/init
	58	must be present and executable.
	59	If it is not,
	60	the CPU will loop in user mode at location 6.
	61	For
	62	.I init
	63	to work correctly,
	64	.I /dev/tty8
65	and
66	.I /bin/sh
67	must be present.
68	If either does not exist,
69	the symptom is best described
70	as thrashing.
71	.I Init
72	will go into a
73	.I fork/exec
74	loop trying to create a
75	Shell with proper standard input and output.
76	.PP
77	If you cannot get the system to boot,
78	a runnable system must be obtained from
79	a backup medium.
80	The root file system may then be doctored as
81	a mounted file system as described below.
82	If there are any problems with the root
83	file system,
84	it is probably prudent to go to a
85	backup system to avoid working on a
86	mounted file system.
87	.PP
88	.I Repairing disks.
89	The first rule to keep in mind is that an addled disk
90	should be treated gently;
91	it shouldn't be mounted unless necessary,
92	and if it is very valuable yet
93	in quite bad shape, perhaps it should be dumped before
94	trying surgery on it.
95	This is an area where experience and informed courage count for much.
96	.PP
97	The problems reported by
98	.I icheck
99	typically fall into two kinds.
100	There can be
101	problems with the free list:
102	duplicates in the free list, or free blocks also in files.
103	These can be cured easily with an
104	.I icheck \-s.
105	If the same block appears in more than one file
106	or if a file contains bad blocks,
107	the files should be deleted, and the free list reconstructed.
108	The best way to delete such a file is to use
109	.IR clri (1),
110	then remove its directory entries.
111	If any of the affected files is really precious,
112	you can try to copy it to another device
113	first.
114	.PP
115	.I Dcheck
116	may report files which
117	have more directory entries than links.
118	Such situations are potentially dangerous;
119	.I clri
120	discusses a special case of the problem.
121	All the directory entries for the file should be removed.
122	If on the other hand there are more links than directory entries,
123	there is no danger of spreading infection, but merely some disk space
124	that is lost for use.
125	It is sufficient to copy the file (if it has any entries and is useful)
126	then use
127	.I clri
128	on its inode and remove any directory
129	entries that do exist.
130	.PP
131	Finally,
132	there may be inodes reported by
133	.I dcheck
134	that have 0 links and 0 entries.
135	These occur on the root device when the system is stopped
136	with pipes open, and on other file systems when the system
137	stops with files that have been deleted while still open.
138	A
139	.I clri
140	will free the inode, and an
141	.I icheck -s
142	will
143	recover any missing blocks.
144	.PP
145	.I Why did it crash?
146	UNIX types a message
147	on the console typewriter when it voluntarily crashes.
148	Here is the current list of such messages,
149	with enough information to provide
150	a hope at least of the remedy.
151	The message has the form `panic: ...',
152	possibly accompanied by other information.
153	Left unstated in all cases
154	is the possibility that hardware or software
155	error produced the message in some unexpected way.
156	.HP 5
157	blkdev
158	.br
159	The
160	.I getblk
161	routine was called with a nonexistent major device as argument.
162	Definitely hardware or software error.
163	.HP 5
164	devtab
165	.br
166	Null device table entry for the major device used as argument to
167	.I getblk.
168	Definitely hardware or software error.
169	.HP 5
170	iinit
171	.br
172	An I/O error reading the super-block for the root file system
173	during initialization.
174	.HP 5
175	out of inodes
176	.br
177	A mounted file system has no more i-nodes when creating a file.
178	Sorry, the device isn't available;
179	the
180	.I icheck
181	should tell you.
182	.HP 5
183	no fs
184	.br
185	A device has disappeared from the mounted-device table.
186	Definitely hardware or software error.
187	.HP 5
188	no imt
189	.br
190	Like `no fs', but produced elsewhere.
191	.HP 5
192	no inodes
193	.br
194	The in-core inode table is full.
195	Try increasing NINODE in param.h.
196	Shouldn't be a panic, just a user error.
197	.HP 5
198	no clock
199	.br
200	During initialization,
201	neither the line nor programmable clock was found to exist.
202	.HP 5
203	swap error
204	.br
205	An unrecoverable I/O error during a swap.
206	Really shouldn't be a panic,
207	but it is hard to fix.
208	.HP 5
209	unlink \- iget
210	.br
211	The directory containing a file being deleted can't be found.
212	Hardware or software.
213	.HP 5
214	out of swap space
215	.br
216	A program needs to be swapped out, and there is no more swap space.
217	It has to be increased.
218	This really shouldn't be a panic, but there is no easy fix.
219	.HP 5
220	out of text
221	.br
222	A pure procedure program is being executed,
223	and the table for such things is full.
224	This shouldn't be a panic.
225	.HP 5
226	trap
227	.br
228	An unexpected trap has occurred within the system.
229	This is accompanied by three numbers:
230	a `ka6', which is the contents of the segmentation
231	register for the area in which the system's stack is kept;
232	`aps', which is the location where the hardware stored
233	the program status word during the trap;
234	and a `trap type' which encodes
235	which trap occurred.
236	The trap types are:
237	.TP 10
238	0
239	bus error
240	.br
241	.ns
242	.TP 10
243	1
244	illegal instruction
245	.br
246	.ns
247	.TP 10
248	2
249	BPT/trace
250	.br
251	.ns
252	.TP 10
253	3
254	IOT
255	.br
256	.ns
257	.TP 10
258	4
259	power fail
260	.br
261	.ns
262	.TP 10
263	5
264	EMT
265	.br
266	.ns
267	.TP 10
268	6
269	recursive system call (TRAP instruction)
270	.br
271	.ns
272	.TP 10
273	7
274	11/70 cache parity, or programmed interrupt
275	.br
276	.ns
277	.TP 10
278	10
279	floating point trap
280	.br
281	.ns
282	.TP 10
283	11
284	segmentation violation
285	.PP
286	In some of these cases it is
287	possible for octal 20 to be added into the trap type;
288	this indicates that the processor was in user mode when the trap occurred.
289	If you wish to examine the stack after such a trap,
290	either dump the system, or use the console switches to examine core;
291	the required address mapping is described below.
292	.PP
293	.I Interpreting dumps.
294	All file system problems
295	should be taken care of before attempting to look at dumps.
296	The dump should be read into the file
297	.I /usr/sys/core;
298	.IR cp (1)
299	will do.
300	At this point, you should execute
301	.I ps \-alxk
302	and
303	.I who
304	to print the process table and the users who were on
305	at the time of the crash.
306	You should dump (
307	.IR od (1))
308	the first 30 bytes of
309	.I /usr/sys/core.
310	Starting at location 4,
311	the registers R0, R1, R2, R3, R4, R5, SP
312	and KDSA6 (KISA6 for 11/40s) are stored.
313	If the dump had to be restarted,
314	R0 will not be correct.
315	Next, take the value of KA6 (location 022(8) in the dump)
316	multiplied by 0100(8) and dump 01000(8) bytes starting from there.
317	This is the per-process data associated with the process running
318	at the time of the crash.
319	Relabel
320	the addresses 140000 to 141776.
321	R5 is C's frame or display pointer.
322	Stored at (R5) is the old R5 pointing to the previous
323	stack frame.
324	At (R5)+2
325	is the saved PC of the calling procedure.
326	Trace
327	this calling chain until
328	you obtain an R5 value of 141756, which
329	is where the user's R5 is stored.
330	If the chain is broken,
331	you have to look for a plausible
332	R5, PC pair and continue from there.
333	Each PC should be looked up in the system's name list
334	using
335	.IR adb (1)
336	and its `:' command,
337	to get a reverse calling order.
338	In most cases this procedure will give
339	an idea of what is wrong.
340	A more complete discussion
341	of system debugging is impossible here.
342	.SH SEE ALSO
343	clri(1), icheck(1), dcheck(1), boot(8)