+.th CRASH VIII 2/12/75
+.tr |
+.sh NAME
+crash \*- what to do when the system crashes
+.sh DESCRIPTION
+This section gives at least a few clues about how to proceed if the
+system crashes.
+It can't pretend to be complete.
+.s3
+.it "How to bring it back up.||"
+If the reason for the crash is not evident
+(see below for guidance on `evident')
+you may want to try to dump the system if you feel up to
+debugging.
+At the moment a dump can be taken only on magtape.
+With a tape mounted and ready,
+stop the machine, load address 44, and start.
+This should write a copy of all of core
+on the tape with an EOF mark.
+Caution:
+Any error is taken to mean the end of core has been reached.
+This means that you must be sure the ring is in,
+the tape is ready, and the tape is clean and new.
+If the dump fails, you can try again,
+but some of the registers will be lost.
+See below for what to do with the tape.
+.s3
+In restarting after a crash,
+always bring up the system single-user.
+This is accomplished by following the directions in
+.it "boot procedures"
+(VIII)
+as modified for your particular installation;
+a single-user system is indicated by having a particular value
+in the switches (173030 unless you've changed
+.it init)
+as the system starts executing.
+When it is running,
+perform a
+.it dcheck
+and
+.it icheck
+(VIII)
+on all file systems which could have been in use at the time
+of the crash.
+If any serious file system problems are found, they should be repaired.
+When you are satisfied with the health of your disks,
+check and set the date if necessary,
+then come up multi-user.
+This is most easily accomplished by changing the
+single-user value in the switches to something else,
+then logging out
+by typing an EOT.
+.s3
+To even boot \s8UNIX\s10 at all,
+three files (and the directories leading to them)
+must be intact.
+First,
+the initialization program
+.it /etc/init
+must be present and executable.
+If it is not,
+the CPU will loop in user mode at location 6.
+For
+.it init
+to work correctly,
+.it /dev/tty8
+and
+.it /bin/sh
+must be present.
+If either does not exist,
+the symptom is best described
+as thrashing.
+.it Init
+will go into a
+.it fork/exec
+loop trying to create a
+Shell with proper standard input and output.
+.s3
+If you cannot get the system to boot,
+a runnable system must be obtained from
+a backup medium.
+The root file system may then be doctored as
+a mounted file system as described below.
+If there are any problems with the root
+file system,
+it is probably prudent to go to a
+backup system to avoid working on a
+mounted file system.
+.s3
+.it "Repairing disks.||"
+The first rule to keep in mind is that an addled disk
+should be treated gently;
+it shouldn't be mounted unless necessary,
+and if it is very valuable yet
+in quite bad shape, perhaps it should be dumped before
+trying surgery on it.
+This is an area where experience and informed courage count for much.
+.s3
+The problems reported by
+.it icheck
+typically fall into two kinds.
+There can be
+problems with the free list:
+duplicates in the free list, or free blocks also in files.
+These can be cured easily with an
+.it "icheck \*-s."
+If the same block appears in more than one file
+or if a file contains bad blocks,
+the files should be deleted, and the free list reconstructed.
+The best way to delete such a file is to use
+.it clri
+(VIII),
+then remove its directory entries.
+If any of the affected files is really precious,
+you can try to copy it to another device
+first.
+.s3
+.it Dcheck
+may report files which
+have more directory entries than links.
+Such situations are potentially dangerous;
+.it clri
+discusses a special case of the problem.
+All the directory entries for the file should be removed.
+If on the other hand there are more links than directory entries,
+there is no danger of spreading infection, but merely some disk space
+that is lost for use.
+It is sufficient to copy the file (if it has any entries and is useful)
+then use
+.it clri
+on its inode and remove any directory
+entries that do exist.
+.s3
+Finally,
+there may be inodes reported by
+.it dcheck
+that have 0 links and 0 entries.
+These occur on the root device when the system is stopped
+with pipes open, and on other file systems when the system
+stops with files that have been deleted while still open.
+A
+.it clri
+will free the inode, and an
+.it "icheck -s"
+will
+recover any missing blocks.
+.s3
+.it "Why did it crash?||"
+UNIX types a message
+on the console typewriter when it voluntarily crashes.
+Here is the current list of such messages,
+with enough information to provide
+a hope at least of the remedy.
+The message has the form `panic: ...',
+possibly accompanied by other information.
+Left unstated in all cases
+is the possibility that hardware or software
+error produced the message in some unexpected way.
+.s3
+.lp +5 5
+blkdev
+.br
+The
+.it getblk
+routine was called with a nonexistent major device as argument.
+Definitely hardware or software error.
+.s3
+.lp +5 5
+devtab
+.br
+Null device table entry for the major device used as argument to
+.it getblk.
+Definitely hardware or software error.
+.s3
+.lp +5 5
+iinit
+.br
+An I/O error reading the super-block for the root file system
+during initialization.
+.s3
+.lp +5 5
+out of inodes
+.br
+A mounted file system has no more i-nodes when creating a file.
+Sorry, the device isn't available;
+the
+.it icheck
+should tell you.
+.s3
+.lp +5 5
+no fs
+.br
+A device has disappeared from the mounted-device table.
+Definitely hardware or software error.
+.s3
+.lp +5 5
+no imt
+.br
+Like `no fs', but produced elsewhere.
+.s3
+.lp +5 5
+no inodes
+.br
+The in-core inode table is full.
+Try increasing NINODE in param.h.
+Shouldn't be a panic, just a user error.
+.s3
+.lp +5 5
+no clock
+.br
+During initialization,
+neither the line nor programmable clock was found to exist.
+.s3
+.lp +5 5
+swap error
+.br
+An unrecoverable I/O error during a swap.
+Really shouldn't be a panic,
+but it is hard to fix.
+.s3
+.lp +5 5
+unlink \- iget
+.br
+The directory containing a file being deleted can't be found.
+Hardware or software.
+.s3
+.lp +5 5
+out of swap space
+.br
+A program needs to be swapped out, and there is no more swap space.
+It has to be increased.
+This really shouldn't be a panic, but there is no easy fix.
+.s3
+.lp +5 5
+out of text
+.br
+A pure procedure program is being executed,
+and the table for such things is full.
+This shouldn't be a panic.
+.s3
+.lp +5 5
+trap
+.br
+An unexpected trap has occurred within the system.
+This is accompanied by three numbers:
+a `ka6', which is the contents of the segmentation
+register for the area in which the system's stack is kept;
+`aps', which is the location where the hardware stored
+the program status word during the trap;
+and a `trap type' which encodes
+which trap occurred.
+The trap types are:
+.s3
+.lp +10 5
+0 bus error
+.lp +10 5
+1 illegal instruction
+.lp +10 5
+2 BPT/trace
+.lp +10 5
+3 IOT
+.lp +10 5
+4 power fail
+.lp +10 5
+5 EMT
+.lp +10 5
+6 recursive system call (TRAP instruction)
+.lp +10 5
+7 11/70 cache parity, or programmed interrupt
+.lp +10 5
+10 floating point trap
+.lp +10 5
+11 segmentation violation
+.i0
+.s3
+In some of these cases it is
+possible for octal 20 to be added into the trap type;
+this indicates that the processor was in user mode when the trap occurred.
+If you wish to examine the stack after such a trap,
+either dump the system, or use the console switches to examine core;
+the required address mapping is described below.
+.s3
+.it "Interpreting dumps.||"
+All file system problems
+should be taken care of before attempting to look at dumps.
+The dump should be read into the file
+.it /usr/sys/core;
+.it cp
+(I) will do.
+At this point, you should execute
+.it "ps \*-alxk"
+and
+.it who
+to print the process table and the users who were on
+at the time of the crash.
+You should dump (
+.it od
+(I))
+the first 30 bytes of
+.it /usr/sys/core.
+Starting at location 4,
+the registers R0, R1, R2, R3, R4, R5, SP
+and KDSA6 (KISA6 for 11/40s) are stored.
+If the dump had to be restarted,
+R0 will not be correct.
+Next, take the value of KA6 (location 22(8) in the dump)
+multiplied by 100(8) and dump 1000(8) bytes starting from there.
+This is the per-process data associated with the process running
+at the time of the crash.
+Relabel
+the addresses 140000 to 141776.
+R5 is C's frame or display pointer.
+Stored at (R5) is the old R5 pointing to the previous
+stack frame.
+At (R5)+2
+is the saved PC of the calling procedure.
+Trace
+this calling chain until
+you obtain an R5 value of 141756, which
+is where the user's R5 is stored.
+If the chain is broken,
+you have to look for a plausible
+R5, PC pair and continue from there.
+Each PC should be looked up in the system's name list
+using
+.it db
+(I) and its `:' command,
+to get a reverse calling order.
+In most cases this procedure will give
+an idea of what is wrong.
+A more complete discussion
+of system debugging is impossible here.
+.sh "SEE ALSO"
+clri, icheck, dcheck, boot procedures (VIII)
+.sh BUGS