| 1 | .TH CRASH 8 |
| 2 | .SH NAME |
| 3 | crash \- what to do when the system crashes |
| 4 | .SH DESCRIPTION |
| 5 | This section gives at least a few clues about how to proceed if the |
| 6 | system crashes. |
| 7 | It can't pretend to be complete. |
| 8 | .PP |
| 9 | .I Bringing it back up. |
| 10 | If the reason for the crash is not evident |
| 11 | (see below for guidance on `evident') |
| 12 | you may want to try to dump the system if you feel up to |
| 13 | debugging. |
| 14 | At the moment a dump can be taken only on magtape. |
| 15 | With a tape mounted and ready, |
| 16 | stop the machine, load address 44, and start. |
| 17 | This should write a copy of all of core |
| 18 | on the tape with an EOF mark. |
| 19 | Caution: |
| 20 | Any error is taken to mean the end of core has been reached. |
| 21 | This means that you must be sure the ring is in, |
| 22 | the tape is ready, and the tape is clean and new. |
| 23 | If the dump fails, you can try again, |
| 24 | but some of the registers will be lost. |
| 25 | See below for what to do with the tape. |
| 26 | .PP |
| 27 | In restarting after a crash, |
| 28 | always bring up the system single-user. |
| 29 | This is accomplished by following the directions in |
| 30 | .IR boot (8) |
| 31 | as modified for your particular installation; |
| 32 | a single-user system is indicated by having a particular value |
| 33 | in the switches (173030 unless you've changed |
| 34 | .I init) |
| 35 | as the system starts executing. |
| 36 | When it is running, |
| 37 | perform a |
| 38 | .I dcheck |
| 39 | and |
| 40 | .IR icheck (1) |
| 41 | on all file systems which could have been in use at the time |
| 42 | of the crash. |
| 43 | If any serious file system problems are found, they should be repaired. |
| 44 | When you are satisfied with the health of your disks, |
| 45 | check and set the date if necessary, |
| 46 | then come up multi-user. |
| 47 | This is most easily accomplished by changing the |
| 48 | single-user value in the switches to something else, |
| 49 | then logging out |
| 50 | by typing an EOT. |
| 51 | .PP |
| 52 | To even boot \s8UNIX\s10 at all, |
| 53 | three files (and the directories leading to them) |
| 54 | must be intact. |
| 55 | First, |
| 56 | the initialization program |
| 57 | .I /etc/init |
| 58 | must be present and executable. |
| 59 | If it is not, |
| 60 | the CPU will loop in user mode at location 6. |
| 61 | For |
| 62 | .I init |
| 63 | to work correctly, |
| 64 | .I /dev/tty8 |
| 65 | and |
| 66 | .I /bin/sh |
| 67 | must be present. |
| 68 | If either does not exist, |
| 69 | the symptom is best described |
| 70 | as thrashing. |
| 71 | .I Init |
| 72 | will go into a |
| 73 | .I fork/exec |
| 74 | loop trying to create a |
| 75 | Shell with proper standard input and output. |
| 76 | .PP |
| 77 | If you cannot get the system to boot, |
| 78 | a runnable system must be obtained from |
| 79 | a backup medium. |
| 80 | The root file system may then be doctored as |
| 81 | a mounted file system as described below. |
| 82 | If there are any problems with the root |
| 83 | file system, |
| 84 | it is probably prudent to go to a |
| 85 | backup system to avoid working on a |
| 86 | mounted file system. |
| 87 | .PP |
| 88 | .I Repairing disks. |
| 89 | The first rule to keep in mind is that an addled disk |
| 90 | should be treated gently; |
| 91 | it shouldn't be mounted unless necessary, |
| 92 | and if it is very valuable yet |
| 93 | in quite bad shape, perhaps it should be dumped before |
| 94 | trying surgery on it. |
| 95 | This is an area where experience and informed courage count for much. |
| 96 | .PP |
| 97 | The problems reported by |
| 98 | .I icheck |
| 99 | typically fall into two kinds. |
| 100 | There can be |
| 101 | problems with the free list: |
| 102 | duplicates in the free list, or free blocks also in files. |
| 103 | These can be cured easily with an |
| 104 | .I icheck \-s. |
| 105 | If the same block appears in more than one file |
| 106 | or if a file contains bad blocks, |
| 107 | the files should be deleted, and the free list reconstructed. |
| 108 | The best way to delete such a file is to use |
| 109 | .IR clri (1), |
| 110 | then remove its directory entries. |
| 111 | If any of the affected files is really precious, |
| 112 | you can try to copy it to another device |
| 113 | first. |
| 114 | .PP |
| 115 | .I Dcheck |
| 116 | may report files which |
| 117 | have more directory entries than links. |
| 118 | Such situations are potentially dangerous; |
| 119 | .I clri |
| 120 | discusses a special case of the problem. |
| 121 | All the directory entries for the file should be removed. |
| 122 | If on the other hand there are more links than directory entries, |
| 123 | there is no danger of spreading infection, but merely some disk space |
| 124 | that is lost for use. |
| 125 | It is sufficient to copy the file (if it has any entries and is useful) |
| 126 | then use |
| 127 | .I clri |
| 128 | on its inode and remove any directory |
| 129 | entries that do exist. |
| 130 | .PP |
| 131 | Finally, |
| 132 | there may be inodes reported by |
| 133 | .I dcheck |
| 134 | that have 0 links and 0 entries. |
| 135 | These occur on the root device when the system is stopped |
| 136 | with pipes open, and on other file systems when the system |
| 137 | stops with files that have been deleted while still open. |
| 138 | A |
| 139 | .I clri |
| 140 | will free the inode, and an |
| 141 | .I icheck -s |
| 142 | will |
| 143 | recover any missing blocks. |
| 144 | .PP |
| 145 | .I Why did it crash? |
| 146 | UNIX types a message |
| 147 | on the console typewriter when it voluntarily crashes. |
| 148 | Here is the current list of such messages, |
| 149 | with enough information to provide |
| 150 | a hope at least of the remedy. |
| 151 | The message has the form `panic: ...', |
| 152 | possibly accompanied by other information. |
| 153 | Left unstated in all cases |
| 154 | is the possibility that hardware or software |
| 155 | error produced the message in some unexpected way. |
| 156 | .HP 5 |
| 157 | blkdev |
| 158 | .br |
| 159 | The |
| 160 | .I getblk |
| 161 | routine was called with a nonexistent major device as argument. |
| 162 | Definitely hardware or software error. |
| 163 | .HP 5 |
| 164 | devtab |
| 165 | .br |
| 166 | Null device table entry for the major device used as argument to |
| 167 | .I getblk. |
| 168 | Definitely hardware or software error. |
| 169 | .HP 5 |
| 170 | iinit |
| 171 | .br |
| 172 | An I/O error reading the super-block for the root file system |
| 173 | during initialization. |
| 174 | .HP 5 |
| 175 | out of inodes |
| 176 | .br |
| 177 | A mounted file system has no more i-nodes when creating a file. |
| 178 | Sorry, the device isn't available; |
| 179 | the |
| 180 | .I icheck |
| 181 | should tell you. |
| 182 | .HP 5 |
| 183 | no fs |
| 184 | .br |
| 185 | A device has disappeared from the mounted-device table. |
| 186 | Definitely hardware or software error. |
| 187 | .HP 5 |
| 188 | no imt |
| 189 | .br |
| 190 | Like `no fs', but produced elsewhere. |
| 191 | .HP 5 |
| 192 | no inodes |
| 193 | .br |
| 194 | The in-core inode table is full. |
| 195 | Try increasing NINODE in param.h. |
| 196 | Shouldn't be a panic, just a user error. |
| 197 | .HP 5 |
| 198 | no clock |
| 199 | .br |
| 200 | During initialization, |
| 201 | neither the line nor programmable clock was found to exist. |
| 202 | .HP 5 |
| 203 | swap error |
| 204 | .br |
| 205 | An unrecoverable I/O error during a swap. |
| 206 | Really shouldn't be a panic, |
| 207 | but it is hard to fix. |
| 208 | .HP 5 |
| 209 | unlink \- iget |
| 210 | .br |
| 211 | The directory containing a file being deleted can't be found. |
| 212 | Hardware or software. |
| 213 | .HP 5 |
| 214 | out of swap space |
| 215 | .br |
| 216 | A program needs to be swapped out, and there is no more swap space. |
| 217 | It has to be increased. |
| 218 | This really shouldn't be a panic, but there is no easy fix. |
| 219 | .HP 5 |
| 220 | out of text |
| 221 | .br |
| 222 | A pure procedure program is being executed, |
| 223 | and the table for such things is full. |
| 224 | This shouldn't be a panic. |
| 225 | .HP 5 |
| 226 | trap |
| 227 | .br |
| 228 | An unexpected trap has occurred within the system. |
| 229 | This is accompanied by three numbers: |
| 230 | a `ka6', which is the contents of the segmentation |
| 231 | register for the area in which the system's stack is kept; |
| 232 | `aps', which is the location where the hardware stored |
| 233 | the program status word during the trap; |
| 234 | and a `trap type' which encodes |
| 235 | which trap occurred. |
| 236 | The trap types are: |
| 237 | .TP 10 |
| 238 | 0 |
| 239 | bus error |
| 240 | .br |
| 241 | .ns |
| 242 | .TP 10 |
| 243 | 1 |
| 244 | illegal instruction |
| 245 | .br |
| 246 | .ns |
| 247 | .TP 10 |
| 248 | 2 |
| 249 | BPT/trace |
| 250 | .br |
| 251 | .ns |
| 252 | .TP 10 |
| 253 | 3 |
| 254 | IOT |
| 255 | .br |
| 256 | .ns |
| 257 | .TP 10 |
| 258 | 4 |
| 259 | power fail |
| 260 | .br |
| 261 | .ns |
| 262 | .TP 10 |
| 263 | 5 |
| 264 | EMT |
| 265 | .br |
| 266 | .ns |
| 267 | .TP 10 |
| 268 | 6 |
| 269 | recursive system call (TRAP instruction) |
| 270 | .br |
| 271 | .ns |
| 272 | .TP 10 |
| 273 | 7 |
| 274 | 11/70 cache parity, or programmed interrupt |
| 275 | .br |
| 276 | .ns |
| 277 | .TP 10 |
| 278 | 10 |
| 279 | floating point trap |
| 280 | .br |
| 281 | .ns |
| 282 | .TP 10 |
| 283 | 11 |
| 284 | segmentation violation |
| 285 | .PP |
| 286 | In some of these cases it is |
| 287 | possible for octal 20 to be added into the trap type; |
| 288 | this indicates that the processor was in user mode when the trap occurred. |
| 289 | If you wish to examine the stack after such a trap, |
| 290 | either dump the system, or use the console switches to examine core; |
| 291 | the required address mapping is described below. |
| 292 | .PP |
| 293 | .I Interpreting dumps. |
| 294 | All file system problems |
| 295 | should be taken care of before attempting to look at dumps. |
| 296 | The dump should be read into the file |
| 297 | .I /usr/sys/core; |
| 298 | .IR cp (1) |
| 299 | will do. |
| 300 | At this point, you should execute |
| 301 | .I ps \-alxk |
| 302 | and |
| 303 | .I who |
| 304 | to print the process table and the users who were on |
| 305 | at the time of the crash. |
| 306 | You should dump ( |
| 307 | .IR od (1)) |
| 308 | the first 30 bytes of |
| 309 | .I /usr/sys/core. |
| 310 | Starting at location 4, |
| 311 | the registers R0, R1, R2, R3, R4, R5, SP |
| 312 | and KDSA6 (KISA6 for 11/40s) are stored. |
| 313 | If the dump had to be restarted, |
| 314 | R0 will not be correct. |
| 315 | Next, take the value of KA6 (location 022(8) in the dump) |
| 316 | multiplied by 0100(8) and dump 01000(8) bytes starting from there. |
| 317 | This is the per-process data associated with the process running |
| 318 | at the time of the crash. |
| 319 | Relabel |
| 320 | the addresses 140000 to 141776. |
| 321 | R5 is C's frame or display pointer. |
| 322 | Stored at (R5) is the old R5 pointing to the previous |
| 323 | stack frame. |
| 324 | At (R5)+2 |
| 325 | is the saved PC of the calling procedure. |
| 326 | Trace |
| 327 | this calling chain until |
| 328 | you obtain an R5 value of 141756, which |
| 329 | is where the user's R5 is stored. |
| 330 | If the chain is broken, |
| 331 | you have to look for a plausible |
| 332 | R5, PC pair and continue from there. |
| 333 | Each PC should be looked up in the system's name list |
| 334 | using |
| 335 | .IR adb (1) |
| 336 | and its `:' command, |
| 337 | to get a reverse calling order. |
| 338 | In most cases this procedure will give |
| 339 | an idea of what is wrong. |
| 340 | A more complete discussion |
| 341 | of system debugging is impossible here. |
| 342 | .SH SEE ALSO |
| 343 | clri(1), icheck(1), dcheck(1), boot(8) |