BSD 3 development
[unix-history] / usr / man / man8 / crash.8
CommitLineData
e0244b21
BJ
1.TH CRASH 8 VAX-11
2.UC
3.tr |
4.SH NAME
5crash \- what to do when the system crashes
6.SH DESCRIPTION
7This section gives at least a few clues about how to proceed if the
8system crashes.
9It can't pretend to be complete.
10.PP
11.I "What to do first.||"
12(Someday the LSI-11 will do this automatically.)
13If the reason for the crash is not evident
14(see below for guidance on `evident')
15you may want to try to dump the system if you feel up to
16debugging.
17At the moment a dump can be taken only on magnetic tape.
18Before you do anything, be sure that a clean tape is mounted with a ring-in
19on the tape drive if you plan to make a dump.
20.PP
21Write the date and time on the console log.
22Use the console commands to examine the registers, program status long word,
23and the top several locations on the stack.
24A suggested command sequence, which is executed by the ``@DUMP''
25console command script, is:
26.DS
27.nf
28 E PSL<return>
29 E R0/NE:F<return>
30 E SP<return>
31 E/V @ /NE:40<return>
32.fi
33.DE
34If hardware problems dictate a special set of commands be executed when
35the system crashes, a sequence of commands can be saved using the console
36command ``LINK'' to be reexecuted with ``PERFORM'' (which can be
37abbreviated ``P'').
38If a dump is to be taken on magnetic tape (this is a good idea
39in most any case where the cause of the crash is not immediately obvious)
40then the following commands will (should) be executed:
41.DS
42.nf
43 D PSL 0<return>
44 D PC 80000200<return>
45 C<return>
46.fi
47.DE
48These commands are actually part of the standard ``@DUMP'' script.
49This should write a copy of all of memory
50on the tape, followed by two EOF marks.
51Caution:
52Any error is taken to mean the end of memory has been reached.
53This means that you must be sure the ring is in,
54the tape is ready, and the tape is clean and new.
55.PP
56If there are not 40 locations active on the kernel stack when the
57procedure is begun, then the console may begin to print error diagnostics.
58You can stop this by hitting ``^C'' (control-C), and then give the
59last three commands above.
60.PP
61If the dump fails, you can try again,
62but some of the registers will be lost.
63See below for what to do with the tape.
64.PP
65.I "How to bring it back up.||"
66To restart after a crash, follow the directions in
67.IR bproc (8);
68if the virtual memory subsystem is suspected as the cause of the crash,
69then a version of the system other than ``vmunix'' should be booted;
70this version of the system will leave the paging areas temporarily intact
71for use by the post-mortem analysis program
72.I analyze.
73On Ernie Co-vax at UCB, the backup system is ``unix''.
74When this system is running, check its root file system (currently
75.I /dev/rrm0a
76but this is likely to change), and then read the core tape into the
77file
78.I /vmcore,
79with the command:
80.IP
81cp /dev/mt0 /vmcore
82.LP
83With the system still in single-user mode, run the analysis program
84.I analyze,
85i.e.:
86.IP
87analyze \-s /dev/drum /vmcore /vmunix
88.LP
89and save the output.
90Be sure to boot up
91``vmunix''
92before coming up multi-user.
93.PP
94Do a
95.I sync,
96and proceed to check and fix all file systems,
97performing a
98.I dcheck
99and
100.IR icheck (1)
101on all file systems which could have been in use at the time
102of the crash.
103If any serious file system problems are found, they should be repaired.
104When you are satisfied with the health of your disks,
105log out by typing an EOT (<control-D>).
106The command sequence in /etc/rc will be executed and the system will
107be in multi-user mode.
108.PP
109To even boot \s8UNIX\s10 at all,
110three files (and the directories leading to them)
111must be intact.
112First,
113the initialization program
114.I /etc/init.vm
115must be present and executable.
116If it is not,
117then the cpu will loop at location 0x13.
118For
119.I init.vm
120to work correctly,
121.I /dev/console
122and
123.I /bin/sh
124must be present.
125If either does not exist,
126the symptom is best described
127as thrashing.
128.I Init
129will go into a
130.I fork/exec
131loop trying to create a
132Shell with proper standard input and output.
133.PP
134If you cannot get the system to boot,
135a runnable system must be obtained from
136a backup medium.
137The root file system may then be doctored as
138a mounted file system as described below.
139If there are any problems with the root
140file system,
141it is probably prudent to go to a
142backup system to avoid working on a
143mounted file system.
144.PP
145.I "Repairing disks.||"
146The first rule to keep in mind is that an addled disk
147should be treated gently;
148it shouldn't be mounted unless necessary,
149and if it is very valuable yet
150in quite bad shape, perhaps it should be dumped before
151trying surgery on it.
152This is an area where experience and informed courage count for much.
153.PP
154The problems reported by
155.I icheck
156typically fall into two kinds.
157There can be
158problems with the free list:
159duplicates in the free list, or free blocks also in files.
160These can be cured easily with an
161.I "icheck \-s."
162If the same block appears in more than one file
163or if a file contains bad blocks,
164the files should be deleted, and the free list reconstructed.
165The best way to delete such a file is to use
166.IR clri (1),
167then remove its directory entries with
168.IR rm (1).
169(Do not use
170.IR mv (1).)
171If any of the affected files is really precious,
172you can try to copy it to another device
173first.
174.PP
175.I Dcheck
176may report files which
177have more directory entries than links.
178Such situations are potentially dangerous;
179.I clri
180discusses a special case of the problem.
181All the directory entries for the file should be removed.
182If on the other hand there are more links than directory entries,
183there is no danger of spreading infection, but merely some disk space
184that is lost for use.
185It is sufficient to copy the file (if it has any entries and is useful)
186then use
187.I clri
188on its inode and remove any directory
189entries that do exist.
190.PP
191Finally,
192there may be inodes reported by
193.I dcheck
194that have 0 links and 0 entries.
195These occur on the root device when the system is stopped
196with pipes open, and on other file systems when the system
197stops with files that have been deleted while still open.
198A
199.I clri
200will free the inode, and an
201.I "icheck -s"
202will
203recover any missing blocks.
204.PP
205.I "Why did it crash?||"
206UNIX types a message
207on the console typewriter when it voluntarily crashes.
208Here are some of the possible panic messages,
209with enough information to provide
210a hope at least of the remedy.
211The message has the form `panic: ...',
212`Trap from kernel mode', or `ILL I/E VEC'
213(possibly accompanied by other information).
214In rare cases the system will ``panic'' but the console message
215will not appear; if this happens, you can trace the message easily
216through the variable
217.I panicstr
218in the system.
219Left unstated in all cases
220is the possibility that hardware or software
221error produced the message in some unexpected way.
222.HP 5
223blkdev
224.br
225The
226.I getblk
227routine was called with a nonexistent major device as argument.
228Definitely hardware or software error.
229.HP 5
230devtab
231.br
232Null device table entry for the major device used as argument to
233.I getblk.
234Definitely hardware or software error.
235.HP 5
236iinit
237.br
238An I/O error reading the super-block for the root file system
239during initialization.
240.HP 5
241out of inodes
242.br
243A mounted file system has no more i-nodes when creating a file.
244Sorry, the device isn't available;
245the
246.I icheck
247should tell you.
248.HP 5
249no fs
250.br
251A device has disappeared from the mounted-device table.
252Definitely hardware or software error.
253.HP 5
254no imt
255.br
256Like `no fs', but produced elsewhere.
257.HP 5
258no inodes
259.br
260The in-memory inode table is full.
261Try increasing NINODE in param.h.
262Shouldn't be a panic, just a user error.
263.HP 5
264IO error in swap
265.br
266An unrecoverable I/O error during a swap.
267Really shouldn't be a panic.
268.HP 5
269out of swap
270.br
271A program needs to be swapped out, and there is no more swap space.
272It has to be increased.
273This really shouldn't be a panic.
274.HP 5
275out of text
276.br
277A pure procedure program is being executed,
278and the table for such things is full.
279This shouldn't be a panic.
280.HP 5
281trap from kernel mode
282.br
283An unexpected trap has occurred within the system.
284The trap type can be determined by examining the top word of the
285stack (the trap type) with the console commands.
286The trap types are:
287.TP 10
2880
289reserved addressing mode
290.br
291.ns
292.TP 10
2931
294privileged instruction
295.br
296.ns
297.TP 10
2982
299BPT
300.br
301.ns
302.TP 10
3033
304XFC
305.br
306.ns
307.TP 10
3084
309reserved operand
310.br
311.ns
312.TP 10
3135
314CHMK (system call)
315.br
316.ns
317.TP 10
3186
319arithemtic trap
320.br
321.ns
322.TP 10
3237
324reschedule trap (software level 3)
325.br
326.ns
327.TP 10
3288
329segmentation fault
330.br
331.ns
332.TP 10
3339
334protection fault
335.br
336.ns
337.TP 10
33810
339trace pending (TP bit)
340.HP 5
341ILL I/E VEC, HALTED AT xx
342.br
343an illegal interrupt or exception has occurred. The possible addresses are
344.ns
345.TP 10
3464
347machine check (hardware error).
348.br
349.ns
350.TP 10
3518
352kernel stack not valid
353.br
354.ns
355.TP 10
356C
357power failure
358.PP
359In some of these cases it is
360possible for octal 20 to be added into the trap type;
361this indicates that the processor was in user mode when the trap occurred.
362If you wish to examine the stack after such a trap,
363either dump the system, or use the console to examine memory;
364the required address mapping is described below.
365.PP
366There are also a large number of panics if internal consistency
367checks in the paging subsystem fail. These can be caused by hardware
368(e.g. if disk or tape problems cause a data structure to be mutilated)
369but are most often caused by software problems.
370Refer to a system listing to locate these and other panics not discussed above.
371.PP
372.I "Interpreting dumps.||"
373All file system problems
374should be taken care of before attempting to analyze dumps.
375As mentioned above, the dump tape should be read into the file
376.IR /vmcore ;
377.IR cp (1)
378will do.
379At this point, you should execute
380.I "ps \-alxk"
381and
382.I who
383to print the process table and the users who were on
384at the time of the crash.
385Use
386.IR adb (1)
387to examine
388.IR /vmcore .
389The location
390.I dumpstack\-80000000
391is the bottom of a stack onto which were pushed the stack pointer
392.BR sp ,
393.B PCBB
394(containing the physical address of a
395.IR u_area ),
396.BR MAPEN ,
397.BR IPL ,
398and registers
399.BR r13 \- r0
400(in that order).
401.BR r13 (fp)
402is the system frame pointer and the stack is used in standard
403.B calls
404format. Use
405.IR adb (1)
406to get a reverse calling order.
407In most cases this procedure will give
408an idea of what is wrong.
409A more complete discussion
410of system debugging is impossible here.
411See, however,
412.IR analyze (1)
413for some more hints.
414.SH "SEE ALSO"
415analyze(1m), clri(1), icheck(1), dcheck(1), bproc(8)
416.br
417.I "VAX 11/780 System Maintenance Guide"
418for more information about machine checks.
419.SH BUGS