Commit | Line | Data |
---|---|---|
aa26a18b KT |
1 | .TH CRASH 8 |
2 | .SH NAME | |
3 | crash \- what to do when the system crashes | |
4 | .SH DESCRIPTION | |
5 | This section gives at least a few clues about how to proceed if the | |
6 | system crashes. | |
7 | It can't pretend to be complete. | |
8 | .PP | |
9 | .I Bringing it back up. | |
10 | If the reason for the crash is not evident | |
11 | (see below for guidance on `evident') | |
12 | you may want to try to dump the system if you feel up to | |
13 | debugging. | |
14 | At the moment a dump can be taken only on magtape. | |
15 | With a tape mounted and ready, | |
16 | stop the machine, load address 44, and start. | |
17 | This should write a copy of all of core | |
18 | on the tape with an EOF mark. | |
19 | Caution: | |
20 | Any error is taken to mean the end of core has been reached. | |
21 | This means that you must be sure the ring is in, | |
22 | the tape is ready, and the tape is clean and new. | |
23 | If the dump fails, you can try again, | |
24 | but some of the registers will be lost. | |
25 | See below for what to do with the tape. | |
26 | .PP | |
27 | In restarting after a crash, | |
28 | always bring up the system single-user. | |
29 | This is accomplished by following the directions in | |
30 | .IR boot (8) | |
31 | as modified for your particular installation; | |
32 | a single-user system is indicated by having a particular value | |
33 | in the switches (173030 unless you've changed | |
34 | .I init) | |
35 | as the system starts executing. | |
36 | When it is running, | |
37 | perform a | |
38 | .I dcheck | |
39 | and | |
40 | .IR icheck (1) | |
41 | on all file systems which could have been in use at the time | |
42 | of the crash. | |
43 | If any serious file system problems are found, they should be repaired. | |
44 | When you are satisfied with the health of your disks, | |
45 | check and set the date if necessary, | |
46 | then come up multi-user. | |
47 | This is most easily accomplished by changing the | |
48 | single-user value in the switches to something else, | |
49 | then logging out | |
50 | by typing an EOT. | |
51 | .PP | |
52 | To even boot \s8UNIX\s10 at all, | |
53 | three files (and the directories leading to them) | |
54 | must be intact. | |
55 | First, | |
56 | the initialization program | |
57 | .I /etc/init | |
58 | must be present and executable. | |
59 | If it is not, | |
60 | the CPU will loop in user mode at location 6. | |
61 | For | |
62 | .I init | |
63 | to work correctly, | |
64 | .I /dev/tty8 | |
65 | and | |
66 | .I /bin/sh | |
67 | must be present. | |
68 | If either does not exist, | |
69 | the symptom is best described | |
70 | as thrashing. | |
71 | .I Init | |
72 | will go into a | |
73 | .I fork/exec | |
74 | loop trying to create a | |
75 | Shell with proper standard input and output. | |
76 | .PP | |
77 | If you cannot get the system to boot, | |
78 | a runnable system must be obtained from | |
79 | a backup medium. | |
80 | The root file system may then be doctored as | |
81 | a mounted file system as described below. | |
82 | If there are any problems with the root | |
83 | file system, | |
84 | it is probably prudent to go to a | |
85 | backup system to avoid working on a | |
86 | mounted file system. | |
87 | .PP | |
88 | .I Repairing disks. | |
89 | The first rule to keep in mind is that an addled disk | |
90 | should be treated gently; | |
91 | it shouldn't be mounted unless necessary, | |
92 | and if it is very valuable yet | |
93 | in quite bad shape, perhaps it should be dumped before | |
94 | trying surgery on it. | |
95 | This is an area where experience and informed courage count for much. | |
96 | .PP | |
97 | The problems reported by | |
98 | .I icheck | |
99 | typically fall into two kinds. | |
100 | There can be | |
101 | problems with the free list: | |
102 | duplicates in the free list, or free blocks also in files. | |
103 | These can be cured easily with an | |
104 | .I icheck \-s. | |
105 | If the same block appears in more than one file | |
106 | or if a file contains bad blocks, | |
107 | the files should be deleted, and the free list reconstructed. | |
108 | The best way to delete such a file is to use | |
109 | .IR clri (1), | |
110 | then remove its directory entries. | |
111 | If any of the affected files is really precious, | |
112 | you can try to copy it to another device | |
113 | first. | |
114 | .PP | |
115 | .I Dcheck | |
116 | may report files which | |
117 | have more directory entries than links. | |
118 | Such situations are potentially dangerous; | |
119 | .I clri | |
120 | discusses a special case of the problem. | |
121 | All the directory entries for the file should be removed. | |
122 | If on the other hand there are more links than directory entries, | |
123 | there is no danger of spreading infection, but merely some disk space | |
124 | that is lost for use. | |
125 | It is sufficient to copy the file (if it has any entries and is useful) | |
126 | then use | |
127 | .I clri | |
128 | on its inode and remove any directory | |
129 | entries that do exist. | |
130 | .PP | |
131 | Finally, | |
132 | there may be inodes reported by | |
133 | .I dcheck | |
134 | that have 0 links and 0 entries. | |
135 | These occur on the root device when the system is stopped | |
136 | with pipes open, and on other file systems when the system | |
137 | stops with files that have been deleted while still open. | |
138 | A | |
139 | .I clri | |
140 | will free the inode, and an | |
141 | .I icheck -s | |
142 | will | |
143 | recover any missing blocks. | |
144 | .PP | |
145 | .I Why did it crash? | |
146 | UNIX types a message | |
147 | on the console typewriter when it voluntarily crashes. | |
148 | Here is the current list of such messages, | |
149 | with enough information to provide | |
150 | a hope at least of the remedy. | |
151 | The message has the form `panic: ...', | |
152 | possibly accompanied by other information. | |
153 | Left unstated in all cases | |
154 | is the possibility that hardware or software | |
155 | error produced the message in some unexpected way. | |
156 | .HP 5 | |
157 | blkdev | |
158 | .br | |
159 | The | |
160 | .I getblk | |
161 | routine was called with a nonexistent major device as argument. | |
162 | Definitely hardware or software error. | |
163 | .HP 5 | |
164 | devtab | |
165 | .br | |
166 | Null device table entry for the major device used as argument to | |
167 | .I getblk. | |
168 | Definitely hardware or software error. | |
169 | .HP 5 | |
170 | iinit | |
171 | .br | |
172 | An I/O error reading the super-block for the root file system | |
173 | during initialization. | |
174 | .HP 5 | |
175 | out of inodes | |
176 | .br | |
177 | A mounted file system has no more i-nodes when creating a file. | |
178 | Sorry, the device isn't available; | |
179 | the | |
180 | .I icheck | |
181 | should tell you. | |
182 | .HP 5 | |
183 | no fs | |
184 | .br | |
185 | A device has disappeared from the mounted-device table. | |
186 | Definitely hardware or software error. | |
187 | .HP 5 | |
188 | no imt | |
189 | .br | |
190 | Like `no fs', but produced elsewhere. | |
191 | .HP 5 | |
192 | no inodes | |
193 | .br | |
194 | The in-core inode table is full. | |
195 | Try increasing NINODE in param.h. | |
196 | Shouldn't be a panic, just a user error. | |
197 | .HP 5 | |
198 | no clock | |
199 | .br | |
200 | During initialization, | |
201 | neither the line nor programmable clock was found to exist. | |
202 | .HP 5 | |
203 | swap error | |
204 | .br | |
205 | An unrecoverable I/O error during a swap. | |
206 | Really shouldn't be a panic, | |
207 | but it is hard to fix. | |
208 | .HP 5 | |
209 | unlink \- iget | |
210 | .br | |
211 | The directory containing a file being deleted can't be found. | |
212 | Hardware or software. | |
213 | .HP 5 | |
214 | out of swap space | |
215 | .br | |
216 | A program needs to be swapped out, and there is no more swap space. | |
217 | It has to be increased. | |
218 | This really shouldn't be a panic, but there is no easy fix. | |
219 | .HP 5 | |
220 | out of text | |
221 | .br | |
222 | A pure procedure program is being executed, | |
223 | and the table for such things is full. | |
224 | This shouldn't be a panic. | |
225 | .HP 5 | |
226 | trap | |
227 | .br | |
228 | An unexpected trap has occurred within the system. | |
229 | This is accompanied by three numbers: | |
230 | a `ka6', which is the contents of the segmentation | |
231 | register for the area in which the system's stack is kept; | |
232 | `aps', which is the location where the hardware stored | |
233 | the program status word during the trap; | |
234 | and a `trap type' which encodes | |
235 | which trap occurred. | |
236 | The trap types are: | |
237 | .TP 10 | |
238 | 0 | |
239 | bus error | |
240 | .br | |
241 | .ns | |
242 | .TP 10 | |
243 | 1 | |
244 | illegal instruction | |
245 | .br | |
246 | .ns | |
247 | .TP 10 | |
248 | 2 | |
249 | BPT/trace | |
250 | .br | |
251 | .ns | |
252 | .TP 10 | |
253 | 3 | |
254 | IOT | |
255 | .br | |
256 | .ns | |
257 | .TP 10 | |
258 | 4 | |
259 | power fail | |
260 | .br | |
261 | .ns | |
262 | .TP 10 | |
263 | 5 | |
264 | EMT | |
265 | .br | |
266 | .ns | |
267 | .TP 10 | |
268 | 6 | |
269 | recursive system call (TRAP instruction) | |
270 | .br | |
271 | .ns | |
272 | .TP 10 | |
273 | 7 | |
274 | 11/70 cache parity, or programmed interrupt | |
275 | .br | |
276 | .ns | |
277 | .TP 10 | |
278 | 10 | |
279 | floating point trap | |
280 | .br | |
281 | .ns | |
282 | .TP 10 | |
283 | 11 | |
284 | segmentation violation | |
285 | .PP | |
286 | In some of these cases it is | |
287 | possible for octal 20 to be added into the trap type; | |
288 | this indicates that the processor was in user mode when the trap occurred. | |
289 | If you wish to examine the stack after such a trap, | |
290 | either dump the system, or use the console switches to examine core; | |
291 | the required address mapping is described below. | |
292 | .PP | |
293 | .I Interpreting dumps. | |
294 | All file system problems | |
295 | should be taken care of before attempting to look at dumps. | |
296 | The dump should be read into the file | |
297 | .I /usr/sys/core; | |
298 | .IR cp (1) | |
299 | will do. | |
300 | At this point, you should execute | |
301 | .I ps \-alxk | |
302 | and | |
303 | .I who | |
304 | to print the process table and the users who were on | |
305 | at the time of the crash. | |
306 | You should dump ( | |
307 | .IR od (1)) | |
308 | the first 30 bytes of | |
309 | .I /usr/sys/core. | |
310 | Starting at location 4, | |
311 | the registers R0, R1, R2, R3, R4, R5, SP | |
312 | and KDSA6 (KISA6 for 11/40s) are stored. | |
313 | If the dump had to be restarted, | |
314 | R0 will not be correct. | |
315 | Next, take the value of KA6 (location 022(8) in the dump) | |
316 | multiplied by 0100(8) and dump 01000(8) bytes starting from there. | |
317 | This is the per-process data associated with the process running | |
318 | at the time of the crash. | |
319 | Relabel | |
320 | the addresses 140000 to 141776. | |
321 | R5 is C's frame or display pointer. | |
322 | Stored at (R5) is the old R5 pointing to the previous | |
323 | stack frame. | |
324 | At (R5)+2 | |
325 | is the saved PC of the calling procedure. | |
326 | Trace | |
327 | this calling chain until | |
328 | you obtain an R5 value of 141756, which | |
329 | is where the user's R5 is stored. | |
330 | If the chain is broken, | |
331 | you have to look for a plausible | |
332 | R5, PC pair and continue from there. | |
333 | Each PC should be looked up in the system's name list | |
334 | using | |
335 | .IR adb (1) | |
336 | and its `:' command, | |
337 | to get a reverse calling order. | |
338 | In most cases this procedure will give | |
339 | an idea of what is wrong. | |
340 | A more complete discussion | |
341 | of system debugging is impossible here. | |
342 | .SH SEE ALSO | |
343 | clri(1), icheck(1), dcheck(1), boot(8) |