[unix-history] / usr / man / man8 / crash.8

.TH CRASH 8 VAX-11
.UC
.tr |
.SH NAME
crash \- what to do when the system crashes
.SH DESCRIPTION
This section gives at least a few clues about how to proceed if the
system crashes.
It can't pretend to be complete.
.PP
.I "What to do first.||"
(Someday the LSI-11 will do this automatically.)
If the reason for the crash is not evident
(see below for guidance on `evident')
you may want to try to dump the system if you feel up to
debugging.
At the moment a dump can be taken only on magnetic tape.
Before you do anything, be sure that a clean tape is mounted with a ring-in
on the tape drive if you plan to make a dump.
.PP
Write the date and time on the console log.
Use the console commands to examine the registers, program status long word,
and the top several locations on the stack.
A suggested command sequence, which is executed by the ``@DUMP''
console command script, is:
.DS
.nf
	E PSL<return>
	E R0/NE:F<return>
	E SP<return>
	E/V @ /NE:40<return>
.fi
.DE
If hardware problems dictate a special set of commands be executed when
the system crashes, a sequence of commands can be saved using the console
command ``LINK'' to be reexecuted with ``PERFORM'' (which can be
abbreviated ``P'').
If a dump is to be taken on magnetic tape (this is a good idea
in most any case where the cause of the crash is not immediately obvious)
then the following commands will (should) be executed:
.DS
.nf
	D PSL 0<return>
	D PC 80000200<return>
	C<return>
.fi
.DE
These commands are actually part of the standard ``@DUMP'' script.
This should write a copy of all of memory
on the tape, followed by two EOF marks.
Caution:
Any error is taken to mean the end of memory has been reached.
This means that you must be sure the ring is in,
the tape is ready, and the tape is clean and new.
.PP
If there are not 40 locations active on the kernel stack when the
procedure is begun, then the console may begin to print error diagnostics.
You can stop this by hitting ``^C'' (control-C), and then give the
last three commands above.
.PP
If the dump fails, you can try again,
but some of the registers will be lost.
See below for what to do with the tape.
.PP
.I "How to bring it back up.||"
To restart after a crash, follow the directions in
.IR bproc (8);
if the virtual memory subsystem is suspected as the cause of the crash,
then a version of the system other than ``vmunix'' should be booted;
this version of the system will leave the paging areas temporarily intact
for use by the post-mortem analysis program
.I analyze.
On Ernie Co-vax at UCB, the backup system is ``unix''.
When this system is running, check its root file system (currently
.I /dev/rrm0a
but this is likely to change), and then read the core tape into the
file
.I /vmcore,
with the command:
.IP
cp /dev/mt0 /vmcore
.LP
With the system still in single-user mode, run the analysis program
.I analyze,
i.e.:
.IP
analyze \-s /dev/drum /vmcore /vmunix
.LP
and save the output.
Be sure to boot up
``vmunix''
before coming up multi-user.
.PP
Do a
.I sync,
and proceed to check and fix all file systems,
performing a
.I dcheck
and
.IR  icheck (1)
on all file systems which could have been in use at the time
of the crash.
If any serious file system problems are found, they should be repaired.
When you are satisfied with the health of your disks,
log out by typing an EOT (<control-D>).
The command sequence in /etc/rc will be executed and the system will
be in multi-user mode.
.PP
To even boot \s8UNIX\s10 at all,
three files (and the directories leading to them)
must be intact.
First,
the initialization program
.I /etc/init.vm
must be present and executable.
If it is not,
then the cpu will loop at location 0x13.
For
.I init.vm
to work correctly,
.I /dev/console
and
.I /bin/sh
must be present.
If either does not exist,
the symptom is best described
as thrashing.
.I Init
will go into a
.I fork/exec
loop trying to create a
Shell with proper standard input and output.
.PP
If you cannot get the system to boot,
a runnable system must be obtained from
a backup medium.
The root file system may then be doctored as
a mounted file system as described below.
If there are any problems with the root
file system,
it is probably prudent to go to a
backup system to avoid working on a
mounted file system.
.PP
.I "Repairing disks.||"
The first rule to keep in mind is that an addled disk
should be treated gently;
it shouldn't be mounted unless necessary,
and if it is very valuable yet
in quite bad shape, perhaps it should be dumped before
trying surgery on it.
This is an area where experience and informed courage count for much.
.PP
The problems reported by
.I icheck
typically fall into two kinds.
There can be
problems with the free list:
duplicates in the free list, or free blocks also in files.
These can be cured easily with an
.I "icheck \-s."
If the same block appears in more than one file
or if a file contains bad blocks,
the files should be deleted, and the free list reconstructed.
The best way to delete such a file is to use
.IR  clri (1),
then remove its directory entries with
.IR rm (1).
(Do not use
.IR mv (1).)
If any of the affected files is really precious,
you can try to copy it to another device
first.
.PP
.I Dcheck
may report files which
have more directory entries than links.
Such situations are potentially dangerous;
.I clri
discusses a special case of the problem.
All the directory entries for the file should be removed.
If on the other hand there are more links than directory entries,
there is no danger of spreading infection, but merely some disk space
that is lost for use.
It is sufficient to copy the file (if it has any entries and is useful)
then use
.I clri
on its inode and remove any directory
entries that do exist.
.PP
Finally,
there may be inodes reported by
.I dcheck
that have 0 links and 0 entries.
These occur on the root device when the system is stopped
with pipes open, and on other file systems when the system
stops with files that have been deleted while still open.
A
.I clri
will free the inode, and an
.I "icheck -s"
will
recover any missing blocks.
.PP
.I "Why did it crash?||"
UNIX types a message
on the console typewriter when it voluntarily crashes.
Here are some of the possible panic messages,
with enough information to provide
a hope at least of the remedy.
The message has the form `panic: ...',
`Trap from kernel mode', or `ILL I/E VEC'
(possibly accompanied by other information).
In rare cases the system will ``panic'' but the console message
will not appear; if this happens, you can trace the message easily
through the variable
.I panicstr
in the system.
Left unstated in all cases
is the possibility that hardware or software
error produced the message in some unexpected way.
.HP 5
blkdev
.br
The
.I getblk
routine was called with a nonexistent major device as argument.
Definitely hardware or software error.
.HP 5
devtab
.br
Null device table entry for the major device used as argument to
.I getblk.
Definitely hardware or software error.
.HP 5
iinit
.br
An I/O error reading the super-block for the root file system
during initialization.
.HP 5
out of inodes
.br
A mounted file system has no more i-nodes when creating a file.
Sorry, the device isn't available;
the
.I icheck
should tell you.
.HP 5
no fs
.br
A device has disappeared from the mounted-device table.
Definitely hardware or software error.
.HP 5
no imt
.br
Like `no fs', but produced elsewhere.
.HP 5
no inodes
.br
The in-memory inode table is full.
Try increasing NINODE in param.h.
Shouldn't be a panic, just a user error.
.HP 5
IO error in swap
.br
An unrecoverable I/O error during a swap.
Really shouldn't be a panic.
.HP 5
out of swap
.br
A program needs to be swapped out, and there is no more swap space.
It has to be increased.
This really shouldn't be a panic.
.HP 5
out of text
.br
A pure procedure program is being executed,
and the table for such things is full.
This shouldn't be a panic.
.HP 5
trap from kernel mode
.br
An unexpected trap has occurred within the system.
The trap type can be determined by examining the top word of the
stack (the trap type) with the console commands.
The trap types are:
.TP 10
0
reserved addressing mode
.br
.ns
.TP 10
1
privileged instruction
.br
.ns
.TP 10
2
BPT
.br
.ns
.TP 10
3
XFC
.br
.ns
.TP 10
4
reserved operand
.br
.ns
.TP 10
5
CHMK (system call)
.br
.ns
.TP 10
6
arithemtic trap
.br
.ns
.TP 10
7
reschedule trap (software level 3)
.br
.ns
.TP 10
8
segmentation fault
.br
.ns
.TP 10
9
protection fault
.br
.ns
.TP 10
10
trace pending (TP bit)
.HP 5
ILL I/E VEC, HALTED AT xx
.br
an illegal interrupt or exception has occurred.  The possible addresses are
.ns
.TP 10
4
machine check (hardware error).
.br
.ns
.TP 10
8
kernel stack not valid
.br
.ns
.TP 10
C
power failure
.PP
In some of these cases it is
possible for octal 20 to be added into the trap type;
this indicates that the processor was in user mode when the trap occurred.
If you wish to examine the stack after such a trap,
either dump the system, or use the console to examine memory;
the required address mapping is described below.
.PP
There are also a large number of panics if internal consistency
checks in the paging subsystem fail.  These can be caused by hardware
(e.g. if disk or tape problems cause a data structure to be mutilated)
but are most often caused by software problems.
Refer to a system listing to locate these and other panics not discussed above.
.PP
.I "Interpreting dumps.||"
All file system problems
should be taken care of before attempting to analyze dumps.
As mentioned above, the dump tape should be read into the file
.IR /vmcore ;
.IR  cp (1)
will do.
At this point, you should execute
.I "ps \-alxk"
and
.I who
to print the process table and the users who were on
at the time of the crash.
Use
.IR adb (1)
to examine
.IR /vmcore .
The location
.I dumpstack\-80000000
is the bottom of a stack onto which were pushed the stack pointer
.BR sp ,
.B PCBB
(containing the physical address of a
.IR u_area ),
.BR MAPEN ,
.BR IPL ,
and registers
.BR r13 \- r0
(in that order).
.BR r13 (fp)
is the system frame pointer and the stack is used in standard
.B calls
format.  Use
.IR  adb (1)
to get a reverse calling order.
In most cases this procedure will give
an idea of what is wrong.
A more complete discussion
of system debugging is impossible here.
See, however,
.IR analyze (1)
for some more hints.
.SH "SEE ALSO"
analyze(1m), clri(1), icheck(1), dcheck(1), bproc(8)
.br
.I "VAX 11/780 System Maintenance Guide"
for more information about machine checks.
.SH BUGS
Commit	Line	Data
e0244b21 BJ	1	.TH CRASH 8 VAX-11
	2	.UC
	3	.tr \|
	4	.SH NAME
	5	crash \- what to do when the system crashes
	6	.SH DESCRIPTION
	7	This section gives at least a few clues about how to proceed if the
	8	system crashes.
	9	It can't pretend to be complete.
	10	.PP
	11	.I "What to do first.\|\|"
	12	(Someday the LSI-11 will do this automatically.)
	13	If the reason for the crash is not evident
	14	(see below for guidance on `evident')
	15	you may want to try to dump the system if you feel up to
	16	debugging.
	17	At the moment a dump can be taken only on magnetic tape.
	18	Before you do anything, be sure that a clean tape is mounted with a ring-in
	19	on the tape drive if you plan to make a dump.
	20	.PP
	21	Write the date and time on the console log.
	22	Use the console commands to examine the registers, program status long word,
	23	and the top several locations on the stack.
	24	A suggested command sequence, which is executed by the ``@DUMP''
	25	console command script, is:
	26	.DS
	27	.nf
	28	E PSL<return>
	29	E R0/NE:F<return>
	30	E SP<return>
	31	E/V @ /NE:40<return>
	32	.fi
	33	.DE
	34	If hardware problems dictate a special set of commands be executed when
	35	the system crashes, a sequence of commands can be saved using the console
	36	command ``LINK'' to be reexecuted with ``PERFORM'' (which can be
	37	abbreviated ``P'').
	38	If a dump is to be taken on magnetic tape (this is a good idea
	39	in most any case where the cause of the crash is not immediately obvious)
	40	then the following commands will (should) be executed:
	41	.DS
	42	.nf
	43	D PSL 0<return>
	44	D PC 80000200<return>
	45	C<return>
	46	.fi
	47	.DE
	48	These commands are actually part of the standard ``@DUMP'' script.
	49	This should write a copy of all of memory
	50	on the tape, followed by two EOF marks.
	51	Caution:
	52	Any error is taken to mean the end of memory has been reached.
	53	This means that you must be sure the ring is in,
	54	the tape is ready, and the tape is clean and new.
	55	.PP
	56	If there are not 40 locations active on the kernel stack when the
	57	procedure is begun, then the console may begin to print error diagnostics.
	58	You can stop this by hitting ``^C'' (control-C), and then give the
	59	last three commands above.
	60	.PP
	61	If the dump fails, you can try again,
	62	but some of the registers will be lost.
	63	See below for what to do with the tape.
	64	.PP
65	.I "How to bring it back up.\|\|"
66	To restart after a crash, follow the directions in
67	.IR bproc (8);
68	if the virtual memory subsystem is suspected as the cause of the crash,
69	then a version of the system other than ``vmunix'' should be booted;
70	this version of the system will leave the paging areas temporarily intact
71	for use by the post-mortem analysis program
72	.I analyze.
73	On Ernie Co-vax at UCB, the backup system is ``unix''.
74	When this system is running, check its root file system (currently
75	.I /dev/rrm0a
76	but this is likely to change), and then read the core tape into the
77	file
78	.I /vmcore,
79	with the command:
80	.IP
81	cp /dev/mt0 /vmcore
82	.LP
83	With the system still in single-user mode, run the analysis program
84	.I analyze,
85	i.e.:
86	.IP
87	analyze \-s /dev/drum /vmcore /vmunix
88	.LP
89	and save the output.
90	Be sure to boot up
91	``vmunix''
92	before coming up multi-user.
93	.PP
94	Do a
95	.I sync,
96	and proceed to check and fix all file systems,
97	performing a
98	.I dcheck
99	and
100	.IR icheck (1)
101	on all file systems which could have been in use at the time
102	of the crash.
103	If any serious file system problems are found, they should be repaired.
104	When you are satisfied with the health of your disks,
105	log out by typing an EOT (<control-D>).
106	The command sequence in /etc/rc will be executed and the system will
107	be in multi-user mode.
108	.PP
109	To even boot \s8UNIX\s10 at all,
110	three files (and the directories leading to them)
111	must be intact.
112	First,
113	the initialization program
114	.I /etc/init.vm
115	must be present and executable.
116	If it is not,
117	then the cpu will loop at location 0x13.
118	For
119	.I init.vm
120	to work correctly,
121	.I /dev/console
122	and
123	.I /bin/sh
124	must be present.
125	If either does not exist,
126	the symptom is best described
127	as thrashing.
128	.I Init
129	will go into a
130	.I fork/exec
131	loop trying to create a
132	Shell with proper standard input and output.
133	.PP
134	If you cannot get the system to boot,
135	a runnable system must be obtained from
136	a backup medium.
137	The root file system may then be doctored as
138	a mounted file system as described below.
139	If there are any problems with the root
140	file system,
141	it is probably prudent to go to a
142	backup system to avoid working on a
143	mounted file system.
144	.PP
145	.I "Repairing disks.\|\|"
146	The first rule to keep in mind is that an addled disk
147	should be treated gently;
148	it shouldn't be mounted unless necessary,
149	and if it is very valuable yet
150	in quite bad shape, perhaps it should be dumped before
151	trying surgery on it.
152	This is an area where experience and informed courage count for much.
153	.PP
154	The problems reported by
155	.I icheck
156	typically fall into two kinds.
157	There can be
158	problems with the free list:
159	duplicates in the free list, or free blocks also in files.
160	These can be cured easily with an
161	.I "icheck \-s."
162	If the same block appears in more than one file
163	or if a file contains bad blocks,
164	the files should be deleted, and the free list reconstructed.
165	The best way to delete such a file is to use
166	.IR clri (1),
167	then remove its directory entries with
168	.IR rm (1).
169	(Do not use
170	.IR mv (1).)
171	If any of the affected files is really precious,
172	you can try to copy it to another device
173	first.
174	.PP
175	.I Dcheck
176	may report files which
177	have more directory entries than links.
178	Such situations are potentially dangerous;
179	.I clri
180	discusses a special case of the problem.
181	All the directory entries for the file should be removed.
182	If on the other hand there are more links than directory entries,
183	there is no danger of spreading infection, but merely some disk space
184	that is lost for use.
185	It is sufficient to copy the file (if it has any entries and is useful)
186	then use
187	.I clri
188	on its inode and remove any directory
189	entries that do exist.
190	.PP
191	Finally,
192	there may be inodes reported by
193	.I dcheck
194	that have 0 links and 0 entries.
195	These occur on the root device when the system is stopped
196	with pipes open, and on other file systems when the system
197	stops with files that have been deleted while still open.
198	A
199	.I clri
200	will free the inode, and an
201	.I "icheck -s"
202	will
203	recover any missing blocks.
204	.PP
205	.I "Why did it crash?\|\|"
206	UNIX types a message
207	on the console typewriter when it voluntarily crashes.
208	Here are some of the possible panic messages,
209	with enough information to provide
210	a hope at least of the remedy.
211	The message has the form `panic: ...',
212	`Trap from kernel mode', or `ILL I/E VEC'
213	(possibly accompanied by other information).
214	In rare cases the system will ``panic'' but the console message
215	will not appear; if this happens, you can trace the message easily
216	through the variable
217	.I panicstr
218	in the system.
219	Left unstated in all cases
220	is the possibility that hardware or software
221	error produced the message in some unexpected way.
222	.HP 5
223	blkdev
224	.br
225	The
226	.I getblk
227	routine was called with a nonexistent major device as argument.
228	Definitely hardware or software error.
229	.HP 5
230	devtab
231	.br
232	Null device table entry for the major device used as argument to
233	.I getblk.
234	Definitely hardware or software error.
235	.HP 5
236	iinit
237	.br
238	An I/O error reading the super-block for the root file system
239	during initialization.
240	.HP 5
241	out of inodes
242	.br
243	A mounted file system has no more i-nodes when creating a file.
244	Sorry, the device isn't available;
245	the
246	.I icheck
247	should tell you.
248	.HP 5
249	no fs
250	.br
251	A device has disappeared from the mounted-device table.
252	Definitely hardware or software error.
253	.HP 5
254	no imt
255	.br
256	Like `no fs', but produced elsewhere.
257	.HP 5
258	no inodes
259	.br
260	The in-memory inode table is full.
261	Try increasing NINODE in param.h.
262	Shouldn't be a panic, just a user error.
263	.HP 5
264	IO error in swap
265	.br
266	An unrecoverable I/O error during a swap.
267	Really shouldn't be a panic.
268	.HP 5
269	out of swap
270	.br
271	A program needs to be swapped out, and there is no more swap space.
272	It has to be increased.
273	This really shouldn't be a panic.
274	.HP 5
275	out of text
276	.br
277	A pure procedure program is being executed,
278	and the table for such things is full.
279	This shouldn't be a panic.
280	.HP 5
281	trap from kernel mode
282	.br
283	An unexpected trap has occurred within the system.
284	The trap type can be determined by examining the top word of the
285	stack (the trap type) with the console commands.
286	The trap types are:
287	.TP 10
288	0
289	reserved addressing mode
290	.br
291	.ns
292	.TP 10
293	1
294	privileged instruction
295	.br
296	.ns
297	.TP 10
298	2
299	BPT
300	.br
301	.ns
302	.TP 10
303	3
304	XFC
305	.br
306	.ns
307	.TP 10
308	4
309	reserved operand
310	.br
311	.ns
312	.TP 10
313	5
314	CHMK (system call)
315	.br
316	.ns
317	.TP 10
318	6
319	arithemtic trap
320	.br
321	.ns
322	.TP 10
323	7
324	reschedule trap (software level 3)
325	.br
326	.ns
327	.TP 10
328	8
329	segmentation fault
330	.br
331	.ns
332	.TP 10
333	9
334	protection fault
335	.br
336	.ns
337	.TP 10
338	10
339	trace pending (TP bit)
340	.HP 5
341	ILL I/E VEC, HALTED AT xx
342	.br
343	an illegal interrupt or exception has occurred. The possible addresses are
344	.ns
345	.TP 10
346	4
347	machine check (hardware error).
348	.br
349	.ns
350	.TP 10
351	8
352	kernel stack not valid
353	.br
354	.ns
355	.TP 10
356	C
357	power failure
358	.PP
359	In some of these cases it is
360	possible for octal 20 to be added into the trap type;
361	this indicates that the processor was in user mode when the trap occurred.
362	If you wish to examine the stack after such a trap,
363	either dump the system, or use the console to examine memory;
364	the required address mapping is described below.
365	.PP
366	There are also a large number of panics if internal consistency
367	checks in the paging subsystem fail. These can be caused by hardware
368	(e.g. if disk or tape problems cause a data structure to be mutilated)
369	but are most often caused by software problems.
370	Refer to a system listing to locate these and other panics not discussed above.
371	.PP
372	.I "Interpreting dumps.\|\|"
373	All file system problems
374	should be taken care of before attempting to analyze dumps.
375	As mentioned above, the dump tape should be read into the file
376	.IR /vmcore ;
377	.IR cp (1)
378	will do.
379	At this point, you should execute
380	.I "ps \-alxk"
381	and
382	.I who
383	to print the process table and the users who were on
384	at the time of the crash.
385	Use
386	.IR adb (1)
387	to examine
388	.IR /vmcore .
389	The location
390	.I dumpstack\-80000000
391	is the bottom of a stack onto which were pushed the stack pointer
392	.BR sp ,
393	.B PCBB
394	(containing the physical address of a
395	.IR u_area ),
396	.BR MAPEN ,
397	.BR IPL ,
398	and registers
399	.BR r13 \- r0
400	(in that order).
401	.BR r13 (fp)
402	is the system frame pointer and the stack is used in standard
403	.B calls
404	format. Use
405	.IR adb (1)
406	to get a reverse calling order.
407	In most cases this procedure will give
408	an idea of what is wrong.
409	A more complete discussion
410	of system debugging is impossible here.
411	See, however,
412	.IR analyze (1)
413	for some more hints.
414	.SH "SEE ALSO"
415	analyze(1m), clri(1), icheck(1), dcheck(1), bproc(8)
416	.br
417	.I "VAX 11/780 System Maintenance Guide"
418	for more information about machine checks.
419	.SH BUGS