Research V7 development
[unix-history] / usr / man / man8 / crash.8
CommitLineData
aa26a18b
KT
1.TH CRASH 8
2.SH NAME
3crash \- what to do when the system crashes
4.SH DESCRIPTION
5This section gives at least a few clues about how to proceed if the
6system crashes.
7It can't pretend to be complete.
8.PP
9.I Bringing it back up.
10If the reason for the crash is not evident
11(see below for guidance on `evident')
12you may want to try to dump the system if you feel up to
13debugging.
14At the moment a dump can be taken only on magtape.
15With a tape mounted and ready,
16stop the machine, load address 44, and start.
17This should write a copy of all of core
18on the tape with an EOF mark.
19Caution:
20Any error is taken to mean the end of core has been reached.
21This means that you must be sure the ring is in,
22the tape is ready, and the tape is clean and new.
23If the dump fails, you can try again,
24but some of the registers will be lost.
25See below for what to do with the tape.
26.PP
27In restarting after a crash,
28always bring up the system single-user.
29This is accomplished by following the directions in
30.IR boot (8)
31as modified for your particular installation;
32a single-user system is indicated by having a particular value
33in the switches (173030 unless you've changed
34.I init)
35as the system starts executing.
36When it is running,
37perform a
38.I dcheck
39and
40.IR icheck (1)
41on all file systems which could have been in use at the time
42of the crash.
43If any serious file system problems are found, they should be repaired.
44When you are satisfied with the health of your disks,
45check and set the date if necessary,
46then come up multi-user.
47This is most easily accomplished by changing the
48single-user value in the switches to something else,
49then logging out
50by typing an EOT.
51.PP
52To even boot \s8UNIX\s10 at all,
53three files (and the directories leading to them)
54must be intact.
55First,
56the initialization program
57.I /etc/init
58must be present and executable.
59If it is not,
60the CPU will loop in user mode at location 6.
61For
62.I init
63to work correctly,
64.I /dev/tty8
65and
66.I /bin/sh
67must be present.
68If either does not exist,
69the symptom is best described
70as thrashing.
71.I Init
72will go into a
73.I fork/exec
74loop trying to create a
75Shell with proper standard input and output.
76.PP
77If you cannot get the system to boot,
78a runnable system must be obtained from
79a backup medium.
80The root file system may then be doctored as
81a mounted file system as described below.
82If there are any problems with the root
83file system,
84it is probably prudent to go to a
85backup system to avoid working on a
86mounted file system.
87.PP
88.I Repairing disks.
89The first rule to keep in mind is that an addled disk
90should be treated gently;
91it shouldn't be mounted unless necessary,
92and if it is very valuable yet
93in quite bad shape, perhaps it should be dumped before
94trying surgery on it.
95This is an area where experience and informed courage count for much.
96.PP
97The problems reported by
98.I icheck
99typically fall into two kinds.
100There can be
101problems with the free list:
102duplicates in the free list, or free blocks also in files.
103These can be cured easily with an
104.I icheck \-s.
105If the same block appears in more than one file
106or if a file contains bad blocks,
107the files should be deleted, and the free list reconstructed.
108The best way to delete such a file is to use
109.IR clri (1),
110then remove its directory entries.
111If any of the affected files is really precious,
112you can try to copy it to another device
113first.
114.PP
115.I Dcheck
116may report files which
117have more directory entries than links.
118Such situations are potentially dangerous;
119.I clri
120discusses a special case of the problem.
121All the directory entries for the file should be removed.
122If on the other hand there are more links than directory entries,
123there is no danger of spreading infection, but merely some disk space
124that is lost for use.
125It is sufficient to copy the file (if it has any entries and is useful)
126then use
127.I clri
128on its inode and remove any directory
129entries that do exist.
130.PP
131Finally,
132there may be inodes reported by
133.I dcheck
134that have 0 links and 0 entries.
135These occur on the root device when the system is stopped
136with pipes open, and on other file systems when the system
137stops with files that have been deleted while still open.
138A
139.I clri
140will free the inode, and an
141.I icheck -s
142will
143recover any missing blocks.
144.PP
145.I Why did it crash?
146UNIX types a message
147on the console typewriter when it voluntarily crashes.
148Here is the current list of such messages,
149with enough information to provide
150a hope at least of the remedy.
151The message has the form `panic: ...',
152possibly accompanied by other information.
153Left unstated in all cases
154is the possibility that hardware or software
155error produced the message in some unexpected way.
156.HP 5
157blkdev
158.br
159The
160.I getblk
161routine was called with a nonexistent major device as argument.
162Definitely hardware or software error.
163.HP 5
164devtab
165.br
166Null device table entry for the major device used as argument to
167.I getblk.
168Definitely hardware or software error.
169.HP 5
170iinit
171.br
172An I/O error reading the super-block for the root file system
173during initialization.
174.HP 5
175out of inodes
176.br
177A mounted file system has no more i-nodes when creating a file.
178Sorry, the device isn't available;
179the
180.I icheck
181should tell you.
182.HP 5
183no fs
184.br
185A device has disappeared from the mounted-device table.
186Definitely hardware or software error.
187.HP 5
188no imt
189.br
190Like `no fs', but produced elsewhere.
191.HP 5
192no inodes
193.br
194The in-core inode table is full.
195Try increasing NINODE in param.h.
196Shouldn't be a panic, just a user error.
197.HP 5
198no clock
199.br
200During initialization,
201neither the line nor programmable clock was found to exist.
202.HP 5
203swap error
204.br
205An unrecoverable I/O error during a swap.
206Really shouldn't be a panic,
207but it is hard to fix.
208.HP 5
209unlink \- iget
210.br
211The directory containing a file being deleted can't be found.
212Hardware or software.
213.HP 5
214out of swap space
215.br
216A program needs to be swapped out, and there is no more swap space.
217It has to be increased.
218This really shouldn't be a panic, but there is no easy fix.
219.HP 5
220out of text
221.br
222A pure procedure program is being executed,
223and the table for such things is full.
224This shouldn't be a panic.
225.HP 5
226trap
227.br
228An unexpected trap has occurred within the system.
229This is accompanied by three numbers:
230a `ka6', which is the contents of the segmentation
231register for the area in which the system's stack is kept;
232`aps', which is the location where the hardware stored
233the program status word during the trap;
234and a `trap type' which encodes
235which trap occurred.
236The trap types are:
237.TP 10
2380
239bus error
240.br
241.ns
242.TP 10
2431
244illegal instruction
245.br
246.ns
247.TP 10
2482
249BPT/trace
250.br
251.ns
252.TP 10
2533
254IOT
255.br
256.ns
257.TP 10
2584
259power fail
260.br
261.ns
262.TP 10
2635
264EMT
265.br
266.ns
267.TP 10
2686
269recursive system call (TRAP instruction)
270.br
271.ns
272.TP 10
2737
27411/70 cache parity, or programmed interrupt
275.br
276.ns
277.TP 10
27810
279floating point trap
280.br
281.ns
282.TP 10
28311
284segmentation violation
285.PP
286In some of these cases it is
287possible for octal 20 to be added into the trap type;
288this indicates that the processor was in user mode when the trap occurred.
289If you wish to examine the stack after such a trap,
290either dump the system, or use the console switches to examine core;
291the required address mapping is described below.
292.PP
293.I Interpreting dumps.
294All file system problems
295should be taken care of before attempting to look at dumps.
296The dump should be read into the file
297.I /usr/sys/core;
298.IR cp (1)
299will do.
300At this point, you should execute
301.I ps \-alxk
302and
303.I who
304to print the process table and the users who were on
305at the time of the crash.
306You should dump (
307.IR od (1))
308the first 30 bytes of
309.I /usr/sys/core.
310Starting at location 4,
311the registers R0, R1, R2, R3, R4, R5, SP
312and KDSA6 (KISA6 for 11/40s) are stored.
313If the dump had to be restarted,
314R0 will not be correct.
315Next, take the value of KA6 (location 022(8) in the dump)
316multiplied by 0100(8) and dump 01000(8) bytes starting from there.
317This is the per-process data associated with the process running
318at the time of the crash.
319Relabel
320the addresses 140000 to 141776.
321R5 is C's frame or display pointer.
322Stored at (R5) is the old R5 pointing to the previous
323stack frame.
324At (R5)+2
325is the saved PC of the calling procedure.
326Trace
327this calling chain until
328you obtain an R5 value of 141756, which
329is where the user's R5 is stored.
330If the chain is broken,
331you have to look for a plausible
332R5, PC pair and continue from there.
333Each PC should be looked up in the system's name list
334using
335.IR adb (1)
336and its `:' command,
337to get a reverse calling order.
338In most cases this procedure will give
339an idea of what is wrong.
340A more complete discussion
341of system debugging is impossible here.
342.SH SEE ALSO
343clri(1), icheck(1), dcheck(1), boot(8)