.TH CRASH 8 VAX-11 .UC .tr | .SH NAME crash \- what to do when the system crashes .SH DESCRIPTION This section gives at least a few clues about how to proceed if the system crashes. It can't pretend to be complete. .PP .I "What to do first.||" (Someday the LSI-11 will do this automatically.) If the reason for the crash is not evident (see below for guidance on `evident') you may want to try to dump the system if you feel up to debugging. At the moment a dump can be taken only on magnetic tape. Before you do anything, be sure that a clean tape is mounted with a ring-in on the tape drive if you plan to make a dump. .PP Write the date and time on the console log. Use the console commands to examine the registers, program status long word, and the top several locations on the stack. A suggested command sequence, which is executed by the ``@DUMP'' console command script, is: .DS .nf E PSL E R0/NE:F E SP E/V @ /NE:40 .fi .DE If hardware problems dictate a special set of commands be executed when the system crashes, a sequence of commands can be saved using the console command ``LINK'' to be reexecuted with ``PERFORM'' (which can be abbreviated ``P''). If a dump is to be taken on magnetic tape (this is a good idea in most any case where the cause of the crash is not immediately obvious) then the following commands will (should) be executed: .DS .nf D PSL 0 D PC 80000200 C .fi .DE These commands are actually part of the standard ``@DUMP'' script. This should write a copy of all of memory on the tape, followed by two EOF marks. Caution: Any error is taken to mean the end of memory has been reached. This means that you must be sure the ring is in, the tape is ready, and the tape is clean and new. .PP If there are not 40 locations active on the kernel stack when the procedure is begun, then the console may begin to print error diagnostics. You can stop this by hitting ``^C'' (control-C), and then give the last three commands above. .PP If the dump fails, you can try again, but some of the registers will be lost. See below for what to do with the tape. .PP .I "How to bring it back up.||" To restart after a crash, follow the directions in .IR bproc (8); if the virtual memory subsystem is suspected as the cause of the crash, then a version of the system other than ``vmunix'' should be booted; this version of the system will leave the paging areas temporarily intact for use by the post-mortem analysis program .I analyze. On Ernie Co-vax at UCB, the backup system is ``unix''. When this system is running, check its root file system (currently .I /dev/rrm0a but this is likely to change), and then read the core tape into the file .I /vmcore, with the command: .IP cp /dev/mt0 /vmcore .LP With the system still in single-user mode, run the analysis program .I analyze, i.e.: .IP analyze \-s /dev/drum /vmcore /vmunix .LP and save the output. Be sure to boot up ``vmunix'' before coming up multi-user. .PP Do a .I sync, and proceed to check and fix all file systems, performing a .I dcheck and .IR icheck (1) on all file systems which could have been in use at the time of the crash. If any serious file system problems are found, they should be repaired. When you are satisfied with the health of your disks, log out by typing an EOT (). The command sequence in /etc/rc will be executed and the system will be in multi-user mode. .PP To even boot \s8UNIX\s10 at all, three files (and the directories leading to them) must be intact. First, the initialization program .I /etc/init.vm must be present and executable. If it is not, then the cpu will loop at location 0x13. For .I init.vm to work correctly, .I /dev/console and .I /bin/sh must be present. If either does not exist, the symptom is best described as thrashing. .I Init will go into a .I fork/exec loop trying to create a Shell with proper standard input and output. .PP If you cannot get the system to boot, a runnable system must be obtained from a backup medium. The root file system may then be doctored as a mounted file system as described below. If there are any problems with the root file system, it is probably prudent to go to a backup system to avoid working on a mounted file system. .PP .I "Repairing disks.||" The first rule to keep in mind is that an addled disk should be treated gently; it shouldn't be mounted unless necessary, and if it is very valuable yet in quite bad shape, perhaps it should be dumped before trying surgery on it. This is an area where experience and informed courage count for much. .PP The problems reported by .I icheck typically fall into two kinds. There can be problems with the free list: duplicates in the free list, or free blocks also in files. These can be cured easily with an .I "icheck \-s." If the same block appears in more than one file or if a file contains bad blocks, the files should be deleted, and the free list reconstructed. The best way to delete such a file is to use .IR clri (1), then remove its directory entries with .IR rm (1). (Do not use .IR mv (1).) If any of the affected files is really precious, you can try to copy it to another device first. .PP .I Dcheck may report files which have more directory entries than links. Such situations are potentially dangerous; .I clri discusses a special case of the problem. All the directory entries for the file should be removed. If on the other hand there are more links than directory entries, there is no danger of spreading infection, but merely some disk space that is lost for use. It is sufficient to copy the file (if it has any entries and is useful) then use .I clri on its inode and remove any directory entries that do exist. .PP Finally, there may be inodes reported by .I dcheck that have 0 links and 0 entries. These occur on the root device when the system is stopped with pipes open, and on other file systems when the system stops with files that have been deleted while still open. A .I clri will free the inode, and an .I "icheck -s" will recover any missing blocks. .PP .I "Why did it crash?||" UNIX types a message on the console typewriter when it voluntarily crashes. Here are some of the possible panic messages, with enough information to provide a hope at least of the remedy. The message has the form `panic: ...', `Trap from kernel mode', or `ILL I/E VEC' (possibly accompanied by other information). In rare cases the system will ``panic'' but the console message will not appear; if this happens, you can trace the message easily through the variable .I panicstr in the system. Left unstated in all cases is the possibility that hardware or software error produced the message in some unexpected way. .HP 5 blkdev .br The .I getblk routine was called with a nonexistent major device as argument. Definitely hardware or software error. .HP 5 devtab .br Null device table entry for the major device used as argument to .I getblk. Definitely hardware or software error. .HP 5 iinit .br An I/O error reading the super-block for the root file system during initialization. .HP 5 out of inodes .br A mounted file system has no more i-nodes when creating a file. Sorry, the device isn't available; the .I icheck should tell you. .HP 5 no fs .br A device has disappeared from the mounted-device table. Definitely hardware or software error. .HP 5 no imt .br Like `no fs', but produced elsewhere. .HP 5 no inodes .br The in-memory inode table is full. Try increasing NINODE in param.h. Shouldn't be a panic, just a user error. .HP 5 IO error in swap .br An unrecoverable I/O error during a swap. Really shouldn't be a panic. .HP 5 out of swap .br A program needs to be swapped out, and there is no more swap space. It has to be increased. This really shouldn't be a panic. .HP 5 out of text .br A pure procedure program is being executed, and the table for such things is full. This shouldn't be a panic. .HP 5 trap from kernel mode .br An unexpected trap has occurred within the system. The trap type can be determined by examining the top word of the stack (the trap type) with the console commands. The trap types are: .TP 10 0 reserved addressing mode .br .ns .TP 10 1 privileged instruction .br .ns .TP 10 2 BPT .br .ns .TP 10 3 XFC .br .ns .TP 10 4 reserved operand .br .ns .TP 10 5 CHMK (system call) .br .ns .TP 10 6 arithemtic trap .br .ns .TP 10 7 reschedule trap (software level 3) .br .ns .TP 10 8 segmentation fault .br .ns .TP 10 9 protection fault .br .ns .TP 10 10 trace pending (TP bit) .HP 5 ILL I/E VEC, HALTED AT xx .br an illegal interrupt or exception has occurred. The possible addresses are .ns .TP 10 4 machine check (hardware error). .br .ns .TP 10 8 kernel stack not valid .br .ns .TP 10 C power failure .PP In some of these cases it is possible for octal 20 to be added into the trap type; this indicates that the processor was in user mode when the trap occurred. If you wish to examine the stack after such a trap, either dump the system, or use the console to examine memory; the required address mapping is described below. .PP There are also a large number of panics if internal consistency checks in the paging subsystem fail. These can be caused by hardware (e.g. if disk or tape problems cause a data structure to be mutilated) but are most often caused by software problems. Refer to a system listing to locate these and other panics not discussed above. .PP .I "Interpreting dumps.||" All file system problems should be taken care of before attempting to analyze dumps. As mentioned above, the dump tape should be read into the file .IR /vmcore ; .IR cp (1) will do. At this point, you should execute .I "ps \-alxk" and .I who to print the process table and the users who were on at the time of the crash. Use .IR adb (1) to examine .IR /vmcore . The location .I dumpstack\-80000000 is the bottom of a stack onto which were pushed the stack pointer .BR sp , .B PCBB (containing the physical address of a .IR u_area ), .BR MAPEN , .BR IPL , and registers .BR r13 \- r0 (in that order). .BR r13 (fp) is the system frame pointer and the stack is used in standard .B calls format. Use .IR adb (1) to get a reverse calling order. In most cases this procedure will give an idea of what is wrong. A more complete discussion of system debugging is impossible here. See, however, .IR analyze (1) for some more hints. .SH "SEE ALSO" analyze(1m), clri(1), icheck(1), dcheck(1), bproc(8) .br .I "VAX 11/780 System Maintenance Guide" for more information about machine checks. .SH BUGS