Commit | Line | Data |
---|---|---|
e0244b21 BJ |
1 | .TH CRASH 8 VAX-11 |
2 | .UC | |
3 | .tr | | |
4 | .SH NAME | |
5 | crash \- what to do when the system crashes | |
6 | .SH DESCRIPTION | |
7 | This section gives at least a few clues about how to proceed if the | |
8 | system crashes. | |
9 | It can't pretend to be complete. | |
10 | .PP | |
11 | .I "What to do first.||" | |
12 | (Someday the LSI-11 will do this automatically.) | |
13 | If the reason for the crash is not evident | |
14 | (see below for guidance on `evident') | |
15 | you may want to try to dump the system if you feel up to | |
16 | debugging. | |
17 | At the moment a dump can be taken only on magnetic tape. | |
18 | Before you do anything, be sure that a clean tape is mounted with a ring-in | |
19 | on the tape drive if you plan to make a dump. | |
20 | .PP | |
21 | Write the date and time on the console log. | |
22 | Use the console commands to examine the registers, program status long word, | |
23 | and the top several locations on the stack. | |
24 | A suggested command sequence, which is executed by the ``@DUMP'' | |
25 | console command script, is: | |
26 | .DS | |
27 | .nf | |
28 | E PSL<return> | |
29 | E R0/NE:F<return> | |
30 | E SP<return> | |
31 | E/V @ /NE:40<return> | |
32 | .fi | |
33 | .DE | |
34 | If hardware problems dictate a special set of commands be executed when | |
35 | the system crashes, a sequence of commands can be saved using the console | |
36 | command ``LINK'' to be reexecuted with ``PERFORM'' (which can be | |
37 | abbreviated ``P''). | |
38 | If a dump is to be taken on magnetic tape (this is a good idea | |
39 | in most any case where the cause of the crash is not immediately obvious) | |
40 | then the following commands will (should) be executed: | |
41 | .DS | |
42 | .nf | |
43 | D PSL 0<return> | |
44 | D PC 80000200<return> | |
45 | C<return> | |
46 | .fi | |
47 | .DE | |
48 | These commands are actually part of the standard ``@DUMP'' script. | |
49 | This should write a copy of all of memory | |
50 | on the tape, followed by two EOF marks. | |
51 | Caution: | |
52 | Any error is taken to mean the end of memory has been reached. | |
53 | This means that you must be sure the ring is in, | |
54 | the tape is ready, and the tape is clean and new. | |
55 | .PP | |
56 | If there are not 40 locations active on the kernel stack when the | |
57 | procedure is begun, then the console may begin to print error diagnostics. | |
58 | You can stop this by hitting ``^C'' (control-C), and then give the | |
59 | last three commands above. | |
60 | .PP | |
61 | If the dump fails, you can try again, | |
62 | but some of the registers will be lost. | |
63 | See below for what to do with the tape. | |
64 | .PP | |
65 | .I "How to bring it back up.||" | |
66 | To restart after a crash, follow the directions in | |
67 | .IR bproc (8); | |
68 | if the virtual memory subsystem is suspected as the cause of the crash, | |
69 | then a version of the system other than ``vmunix'' should be booted; | |
70 | this version of the system will leave the paging areas temporarily intact | |
71 | for use by the post-mortem analysis program | |
72 | .I analyze. | |
73 | On Ernie Co-vax at UCB, the backup system is ``unix''. | |
74 | When this system is running, check its root file system (currently | |
75 | .I /dev/rrm0a | |
76 | but this is likely to change), and then read the core tape into the | |
77 | file | |
78 | .I /vmcore, | |
79 | with the command: | |
80 | .IP | |
81 | cp /dev/mt0 /vmcore | |
82 | .LP | |
83 | With the system still in single-user mode, run the analysis program | |
84 | .I analyze, | |
85 | i.e.: | |
86 | .IP | |
87 | analyze \-s /dev/drum /vmcore /vmunix | |
88 | .LP | |
89 | and save the output. | |
90 | Be sure to boot up | |
91 | ``vmunix'' | |
92 | before coming up multi-user. | |
93 | .PP | |
94 | Do a | |
95 | .I sync, | |
96 | and proceed to check and fix all file systems, | |
97 | performing a | |
98 | .I dcheck | |
99 | and | |
100 | .IR icheck (1) | |
101 | on all file systems which could have been in use at the time | |
102 | of the crash. | |
103 | If any serious file system problems are found, they should be repaired. | |
104 | When you are satisfied with the health of your disks, | |
105 | log out by typing an EOT (<control-D>). | |
106 | The command sequence in /etc/rc will be executed and the system will | |
107 | be in multi-user mode. | |
108 | .PP | |
109 | To even boot \s8UNIX\s10 at all, | |
110 | three files (and the directories leading to them) | |
111 | must be intact. | |
112 | First, | |
113 | the initialization program | |
114 | .I /etc/init.vm | |
115 | must be present and executable. | |
116 | If it is not, | |
117 | then the cpu will loop at location 0x13. | |
118 | For | |
119 | .I init.vm | |
120 | to work correctly, | |
121 | .I /dev/console | |
122 | and | |
123 | .I /bin/sh | |
124 | must be present. | |
125 | If either does not exist, | |
126 | the symptom is best described | |
127 | as thrashing. | |
128 | .I Init | |
129 | will go into a | |
130 | .I fork/exec | |
131 | loop trying to create a | |
132 | Shell with proper standard input and output. | |
133 | .PP | |
134 | If you cannot get the system to boot, | |
135 | a runnable system must be obtained from | |
136 | a backup medium. | |
137 | The root file system may then be doctored as | |
138 | a mounted file system as described below. | |
139 | If there are any problems with the root | |
140 | file system, | |
141 | it is probably prudent to go to a | |
142 | backup system to avoid working on a | |
143 | mounted file system. | |
144 | .PP | |
145 | .I "Repairing disks.||" | |
146 | The first rule to keep in mind is that an addled disk | |
147 | should be treated gently; | |
148 | it shouldn't be mounted unless necessary, | |
149 | and if it is very valuable yet | |
150 | in quite bad shape, perhaps it should be dumped before | |
151 | trying surgery on it. | |
152 | This is an area where experience and informed courage count for much. | |
153 | .PP | |
154 | The problems reported by | |
155 | .I icheck | |
156 | typically fall into two kinds. | |
157 | There can be | |
158 | problems with the free list: | |
159 | duplicates in the free list, or free blocks also in files. | |
160 | These can be cured easily with an | |
161 | .I "icheck \-s." | |
162 | If the same block appears in more than one file | |
163 | or if a file contains bad blocks, | |
164 | the files should be deleted, and the free list reconstructed. | |
165 | The best way to delete such a file is to use | |
166 | .IR clri (1), | |
167 | then remove its directory entries with | |
168 | .IR rm (1). | |
169 | (Do not use | |
170 | .IR mv (1).) | |
171 | If any of the affected files is really precious, | |
172 | you can try to copy it to another device | |
173 | first. | |
174 | .PP | |
175 | .I Dcheck | |
176 | may report files which | |
177 | have more directory entries than links. | |
178 | Such situations are potentially dangerous; | |
179 | .I clri | |
180 | discusses a special case of the problem. | |
181 | All the directory entries for the file should be removed. | |
182 | If on the other hand there are more links than directory entries, | |
183 | there is no danger of spreading infection, but merely some disk space | |
184 | that is lost for use. | |
185 | It is sufficient to copy the file (if it has any entries and is useful) | |
186 | then use | |
187 | .I clri | |
188 | on its inode and remove any directory | |
189 | entries that do exist. | |
190 | .PP | |
191 | Finally, | |
192 | there may be inodes reported by | |
193 | .I dcheck | |
194 | that have 0 links and 0 entries. | |
195 | These occur on the root device when the system is stopped | |
196 | with pipes open, and on other file systems when the system | |
197 | stops with files that have been deleted while still open. | |
198 | A | |
199 | .I clri | |
200 | will free the inode, and an | |
201 | .I "icheck -s" | |
202 | will | |
203 | recover any missing blocks. | |
204 | .PP | |
205 | .I "Why did it crash?||" | |
206 | UNIX types a message | |
207 | on the console typewriter when it voluntarily crashes. | |
208 | Here are some of the possible panic messages, | |
209 | with enough information to provide | |
210 | a hope at least of the remedy. | |
211 | The message has the form `panic: ...', | |
212 | `Trap from kernel mode', or `ILL I/E VEC' | |
213 | (possibly accompanied by other information). | |
214 | In rare cases the system will ``panic'' but the console message | |
215 | will not appear; if this happens, you can trace the message easily | |
216 | through the variable | |
217 | .I panicstr | |
218 | in the system. | |
219 | Left unstated in all cases | |
220 | is the possibility that hardware or software | |
221 | error produced the message in some unexpected way. | |
222 | .HP 5 | |
223 | blkdev | |
224 | .br | |
225 | The | |
226 | .I getblk | |
227 | routine was called with a nonexistent major device as argument. | |
228 | Definitely hardware or software error. | |
229 | .HP 5 | |
230 | devtab | |
231 | .br | |
232 | Null device table entry for the major device used as argument to | |
233 | .I getblk. | |
234 | Definitely hardware or software error. | |
235 | .HP 5 | |
236 | iinit | |
237 | .br | |
238 | An I/O error reading the super-block for the root file system | |
239 | during initialization. | |
240 | .HP 5 | |
241 | out of inodes | |
242 | .br | |
243 | A mounted file system has no more i-nodes when creating a file. | |
244 | Sorry, the device isn't available; | |
245 | the | |
246 | .I icheck | |
247 | should tell you. | |
248 | .HP 5 | |
249 | no fs | |
250 | .br | |
251 | A device has disappeared from the mounted-device table. | |
252 | Definitely hardware or software error. | |
253 | .HP 5 | |
254 | no imt | |
255 | .br | |
256 | Like `no fs', but produced elsewhere. | |
257 | .HP 5 | |
258 | no inodes | |
259 | .br | |
260 | The in-memory inode table is full. | |
261 | Try increasing NINODE in param.h. | |
262 | Shouldn't be a panic, just a user error. | |
263 | .HP 5 | |
264 | IO error in swap | |
265 | .br | |
266 | An unrecoverable I/O error during a swap. | |
267 | Really shouldn't be a panic. | |
268 | .HP 5 | |
269 | out of swap | |
270 | .br | |
271 | A program needs to be swapped out, and there is no more swap space. | |
272 | It has to be increased. | |
273 | This really shouldn't be a panic. | |
274 | .HP 5 | |
275 | out of text | |
276 | .br | |
277 | A pure procedure program is being executed, | |
278 | and the table for such things is full. | |
279 | This shouldn't be a panic. | |
280 | .HP 5 | |
281 | trap from kernel mode | |
282 | .br | |
283 | An unexpected trap has occurred within the system. | |
284 | The trap type can be determined by examining the top word of the | |
285 | stack (the trap type) with the console commands. | |
286 | The trap types are: | |
287 | .TP 10 | |
288 | 0 | |
289 | reserved addressing mode | |
290 | .br | |
291 | .ns | |
292 | .TP 10 | |
293 | 1 | |
294 | privileged instruction | |
295 | .br | |
296 | .ns | |
297 | .TP 10 | |
298 | 2 | |
299 | BPT | |
300 | .br | |
301 | .ns | |
302 | .TP 10 | |
303 | 3 | |
304 | XFC | |
305 | .br | |
306 | .ns | |
307 | .TP 10 | |
308 | 4 | |
309 | reserved operand | |
310 | .br | |
311 | .ns | |
312 | .TP 10 | |
313 | 5 | |
314 | CHMK (system call) | |
315 | .br | |
316 | .ns | |
317 | .TP 10 | |
318 | 6 | |
319 | arithemtic trap | |
320 | .br | |
321 | .ns | |
322 | .TP 10 | |
323 | 7 | |
324 | reschedule trap (software level 3) | |
325 | .br | |
326 | .ns | |
327 | .TP 10 | |
328 | 8 | |
329 | segmentation fault | |
330 | .br | |
331 | .ns | |
332 | .TP 10 | |
333 | 9 | |
334 | protection fault | |
335 | .br | |
336 | .ns | |
337 | .TP 10 | |
338 | 10 | |
339 | trace pending (TP bit) | |
340 | .HP 5 | |
341 | ILL I/E VEC, HALTED AT xx | |
342 | .br | |
343 | an illegal interrupt or exception has occurred. The possible addresses are | |
344 | .ns | |
345 | .TP 10 | |
346 | 4 | |
347 | machine check (hardware error). | |
348 | .br | |
349 | .ns | |
350 | .TP 10 | |
351 | 8 | |
352 | kernel stack not valid | |
353 | .br | |
354 | .ns | |
355 | .TP 10 | |
356 | C | |
357 | power failure | |
358 | .PP | |
359 | In some of these cases it is | |
360 | possible for octal 20 to be added into the trap type; | |
361 | this indicates that the processor was in user mode when the trap occurred. | |
362 | If you wish to examine the stack after such a trap, | |
363 | either dump the system, or use the console to examine memory; | |
364 | the required address mapping is described below. | |
365 | .PP | |
366 | There are also a large number of panics if internal consistency | |
367 | checks in the paging subsystem fail. These can be caused by hardware | |
368 | (e.g. if disk or tape problems cause a data structure to be mutilated) | |
369 | but are most often caused by software problems. | |
370 | Refer to a system listing to locate these and other panics not discussed above. | |
371 | .PP | |
372 | .I "Interpreting dumps.||" | |
373 | All file system problems | |
374 | should be taken care of before attempting to analyze dumps. | |
375 | As mentioned above, the dump tape should be read into the file | |
376 | .IR /vmcore ; | |
377 | .IR cp (1) | |
378 | will do. | |
379 | At this point, you should execute | |
380 | .I "ps \-alxk" | |
381 | and | |
382 | .I who | |
383 | to print the process table and the users who were on | |
384 | at the time of the crash. | |
385 | Use | |
386 | .IR adb (1) | |
387 | to examine | |
388 | .IR /vmcore . | |
389 | The location | |
390 | .I dumpstack\-80000000 | |
391 | is the bottom of a stack onto which were pushed the stack pointer | |
392 | .BR sp , | |
393 | .B PCBB | |
394 | (containing the physical address of a | |
395 | .IR u_area ), | |
396 | .BR MAPEN , | |
397 | .BR IPL , | |
398 | and registers | |
399 | .BR r13 \- r0 | |
400 | (in that order). | |
401 | .BR r13 (fp) | |
402 | is the system frame pointer and the stack is used in standard | |
403 | .B calls | |
404 | format. Use | |
405 | .IR adb (1) | |
406 | to get a reverse calling order. | |
407 | In most cases this procedure will give | |
408 | an idea of what is wrong. | |
409 | A more complete discussion | |
410 | of system debugging is impossible here. | |
411 | See, however, | |
412 | .IR analyze (1) | |
413 | for some more hints. | |
414 | .SH "SEE ALSO" | |
415 | analyze(1m), clri(1), icheck(1), dcheck(1), bproc(8) | |
416 | .br | |
417 | .I "VAX 11/780 System Maintenance Guide" | |
418 | for more information about machine checks. | |
419 | .SH BUGS |