+.TH CRASH 8 VAX-11
+.UC 4
+.SH NAME
+crash \- what happens when the system crashes
+.SH DESCRIPTION
+This section explains what happens when the system crashes and how
+you can get a crash dump for analysis of non-transient problems.
+.PP
+When the system crashes voluntarily it prints a message of the form
+.IP
+panic: why i gave up the ghost
+.LP
+on the console, and then invokes an automatic reboot procedure as
+described in
+.IR reboot (8).
+If the auto-reboot switch is off on the console, then the processor
+will simply halt at this point.
+Otherwise the registers and the top few locations of the stack will
+be printed on the console, and then the system will check the disks
+and (unless some unexpected inconsistency is encountered), resume
+multi-user operations.
+.PP
+The system has a large number of internal consistency checks; if one
+of these fails, then it will panic with a very short message indicating
+which one failed. In the absence of a dump, little can be done about
+one of these. If the problem recurs, you should arrange to get a dump
+for further analysis by running with auto-reboot disabled during normal
+working hours and then following the procedure described below.
+.PP
+The most common cause of system failures is hardware failure, which
+can reflect itself in different ways. Here are the messages which
+you are likely to encounter, with some hints as to causes.
+Left unstated in all cases is the possibility that hardware or software
+error produced the message in some unexpected way.
+.TP
+IO err in push
+.ns
+.TP
+hard IO err in swap
+The system encountered an error trying to write to the paging device
+or an error in reading critical information from a disk drive.
+You should fix your disk if it is broken or unreliable.
+.TP
+Timeout table overflow
+.ns
+.TP
+ran out of bdp's
+.ns
+.TP
+ran out of uba map
+These really shouldn't be panics, but until we fix up the data structures
+involved, running out of entries causes a crash. If the timeout table
+overflows, you should make it bigger. If you run out of bdp's or uba map
+you probably have a buggy device driver in your system, allocating and
+not releasing UNIBUS resources.
+.TP
+KSP not valid
+.ns
+.TP
+SBI fault
+.ns
+.TP
+Machine check
+.ns
+.TP
+CHM? in kernel
+These indicate either a serious bug in the system or, more often,
+a glitch or failing hardware. For the machine check, the top part of
+the resulting stack frame gives more information. You can refer to a
+VAX 11/780 System Maintenance Guide for information on machine checks.
+If machine checks or SBI faults recur, check out the hardware or call
+field service. If the other faults recur, there is likely a bug somewhere
+in the system, although these can be caused by a flakey processor.
+Run processor microdiagnostics.
+.TP
+trap type %d, code=%d
+A unexpected trap has occurred within the system; the trap types are:
+.RS
+.TP 10
+0
+reserved addressing mode
+.br
+.ns
+.TP 10
+1
+privileged instruction
+.br
+.ns
+.TP 10
+2
+BPT
+.br
+.ns
+.TP 10
+3
+XFC
+.br
+.ns
+.TP 10
+4
+reserved operand
+.br
+.ns
+.TP 10
+5
+CHMK (system call)
+.br
+.ns
+.TP 10
+6
+arithmetic trap
+.br
+.ns
+.TP 10
+7
+reschedule trap (software level 3)
+.br
+.ns
+.TP 10
+8
+segmentation fault
+.br
+.ns
+.TP 10
+9
+protection fault
+.br
+.ns
+.TP 10
+10
+trace pending (TP bit)
+.RE
+.IP
+The favorite trap type in system crashes is trap type 9, indicating
+a wild reference. The code is the referenced address. If you look
+down the stack, just after the trap type and the code are the pc and
+the ps of the processor when it trapped, showing you where in the
+system the problem occurred. These problems tend to be easy to track
+down if they are kernel bugs since the processor stops cold, but random
+flakiness seems to cause this sometimes, e.g. we have trapped with
+code 80000800 three times in six months as an instruction fetch went across
+this page boundary in the kernel but have been unable to find any reason
+for this to have happened.
+.TP
+init died
+The system initialization process has exited. This is bad news, as no new
+users will then be able to log in. Rebooting is the only fix, so the
+system just does it right away.
+.PP
+That completes the list of panic types you are likely to see.
+Now for the crash dump procedure:
+.PP
+At the moment a dump can be taken only on magnetic tape.
+Before you do anything, be sure that a clean tape is mounted with a ring-in
+on the tape drive if you plan to make a dump.
+.PP
+Write the date and time on the console log.
+Use the console commands to examine the registers, program status long word,
+and the top several locations on the stack.
+A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq
+console command script, is:
+.DS
+.nf
+ E PSL<return>
+ E R0/NE:F<return>
+ E SP<return>
+ E/V @ /NE:40<return>
+.fi
+.DE
+If hardware problems dictate a special set of commands be executed when
+the system crashes, a sequence of commands can be saved using the console
+command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be
+abbreviated \*(lqP\*(rq).
+If a dump is to be taken on magnetic tape (this is a good idea
+in most any case where the cause of the crash is not immediately obvious)
+then the following commands will (should) be executed:
+.DS
+.nf
+ D PSL 0<return>
+ D PC 80000200<return>
+ C<return>
+.fi
+.DE
+These commands are actually part of the standard \*(lq@DUMP\*(rq script.
+This should write a copy of all of memory
+on the tape, followed by two EOF marks.
+Caution:
+Any error is taken to mean the end of memory has been reached.
+This means that you must be sure the ring is in,
+the tape is ready, and the tape is clean and new.
+.PP
+If there are not 40(hex) locations active on the kernel stack when the
+procedure is begun, then the console may begin to print error diagnostics.
+You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the
+last three commands above.
+.PP
+If the dump fails, you can try again,
+but some of the registers will be lost.
+See below for what to do with the tape.
+.PP
+To restart after a crash, follow the directions in
+.IR reboot (8);
+if the virtual memory subsystem is suspected as the cause of the crash,
+then a version of the system other than \*(lqvmunix\*(rq should be booted
+which will leave the paging areas temporarily intact
+for use by the post-mortem analysis program
+.I analyze.
+After checking your root file system consistency with
+.IR fsck (8),
+you can read the core dump tape into the file /vmcore with
+.IP
+dd if=/dev/rmt0 of=/vmcore bs=20b
+.LP
+It does not work to use just
+.IR cp (1),
+as the tape is blocked.
+With the system still in single-user mode, run the analysis program
+.I analyze,
+e.g.:
+.IP
+analyze \-s /dev/drum /vmcore /vmunix
+.LP
+and save the output.
+Then boot up
+\*(lqvmunix\*(rq
+and let it do the automatic reboot, i.e. to boot multi-user from
+an RM03/RM05/RP06 on the MASSBUS
+.IP
+>>> BOOT RPM
+.PP
+After rebooting, to analyze a dump you should execute
+.I "ps \-alxk"
+to print the process table at the time of the crash.
+Use
+.IR adb (1)
+to examine
+.IR /vmcore .
+The location
+.I dumpstack\-80000000
+is the bottom of a stack onto which were pushed the stack pointer
+.BR sp ,
+.B PCBB
+(containing the physical address of a
+.IR u_area ),
+.BR MAPEN ,
+.BR IPL ,
+and registers
+.BR r13 \- r0
+(in that order).
+.BR r13 (fp)
+is the system frame pointer and the stack is used in standard
+.B calls
+format. Use
+.IR adb (1)
+to get a reverse calling order.
+In most cases this procedure will give
+an idea of what is wrong.
+A more complete discussion
+of system debugging is impossible here.
+See, however,
+.IR analyze (8)
+for some more hints.
+.SH "SEE ALSO"
+analyze(8), reboot(8)
+.br
+.I "VAX 11/780 System Maintenance Guide"
+for more information about machine checks.
+.SH BUGS