Commit | Line | Data |
---|---|---|
f9cde827 OB |
1 | .RP |
2 | .if n .ds dg + | |
3 | .if t .ds dg \(dg | |
4 | .if n .ds dd = | |
5 | .if t .ds dd \(dd | |
6 | .if n .ds _ _ | |
7 | .if t .ds _ \d\(mi\u | |
8 | .TL | |
9 | Data Structures Added in the | |
10 | .br | |
11 | Berkeley Virtual Memory Extensions to the | |
12 | .br | |
13 | UNIX\(dg Operating System\(dd | |
14 | .AU | |
15 | \u\*:\dOzalp Babao\*~glu | |
16 | .AU | |
17 | William Joy | |
18 | .AI | |
19 | Computer Science Division | |
20 | Department of Electrical Engineering and Computer Science | |
21 | University of California, Berkeley | |
22 | Berkeley, California 94720 | |
23 | .AB | |
24 | .FS | |
25 | \*(dg UNIX is a Trademark of Bell Laboratories | |
26 | .FE | |
27 | .FS | |
28 | \*(dd Work supported by the National Science Foundation under grants | |
29 | MCS 7807291, MCS 7824618, MCS 7407644-A03 and by an IBM Graduate Fellowship | |
30 | to the second author. | |
31 | .FE | |
32 | .FS | |
33 | * VAX is a trademark of Digital Equipment Corporation. | |
34 | .FE | |
35 | .PP | |
36 | This document describes the major new data structures that have | |
37 | been introduced to the Version 7 \s-2UNIX\s0 system to support | |
38 | demand paging on the | |
39 | \s-2VAX\u*\d-11/780\s0. The reader should be basically familiar | |
40 | with the \s-2VAX\s0 architecture, as described in the | |
41 | .I "VAX-11/780 Hardware Handbook." | |
42 | .PP | |
43 | When relevant, along with the data structures, we present | |
44 | related system constants and macro definitions, and | |
45 | some indications of how the data is used by the system algorithms. | |
46 | We also describe the extensions/deletions that have been made to some | |
47 | of the existing data structures. | |
48 | Full description of the paging system algorithms, however, is not | |
49 | given here. | |
50 | .AE | |
51 | .SH | |
52 | Introduction | |
53 | .PP | |
54 | The paging subsystem of the virtual memory extension to the system | |
55 | maintains four new, basic data structures: | |
56 | the | |
57 | .I "(system and process) page tables," | |
58 | the | |
59 | .I "kernel map," | |
60 | the | |
61 | .I "core map " | |
62 | and the | |
63 | .I "disk map." | |
64 | This document consists of a description of each of these data structures | |
65 | in turn. | |
66 | .br | |
67 | .ne .75i | |
68 | .sp .375i | |
69 | .SH | |
70 | PAGE TABLES | |
71 | .PP | |
72 | The format of the process page tables are defined in the system header | |
73 | file | |
74 | .B pte.h.\*(dg | |
75 | .FS | |
76 | \*(dg Copies of the system definitions (header) files related to the | |
77 | paging subsystem appear at the end of this document. | |
78 | .FE | |
79 | The basic form of the Page Table Entry (PTE) is dictated | |
80 | by the \s-2VAX-11/780\s0 architecture. | |
81 | Both the first level page table, | |
82 | known as | |
83 | .I Sysmap, | |
84 | and the second level per-process | |
85 | page tables consist of arrays of this structure. | |
86 | The paging system makes use of several bit fields which | |
87 | have no meaning to the hardware. | |
88 | The individual fields are: | |
89 | .IP \fBpg\*_prot\fR 16n | |
90 | The | |
91 | .I Protection | |
92 | Bits. Define the access mode of the corresponding page. | |
93 | Modes used by the paging system include PG\*_NOACC for invalid entries, | |
94 | PG\*_KR for the text of the kernel, PG\*_KW for the kernel data space, | |
95 | PG\*_URKR for text portions of user processes, PG\*_URKW for text portions | |
96 | of user processes during modification (old form \fIexec\fR, and \fIptrace\fR\|) | |
97 | of text images, and PG\*_UW for normal user data pages. | |
98 | .IP \fBpg\*_m\fR 16n | |
99 | The | |
100 | .I Modify | |
101 | Bit. Set by hardware as a result of a write access to the page. Examined | |
102 | and altered by the paging subsystem to find out if a page is | |
103 | .I dirty | |
104 | and has to be written back to disk before the page frame can be released. | |
105 | .IP \fBpg\*_swapm\fR | |
106 | Indicates whether the page has been initialized, but never written to | |
107 | the swapping area. This bit is necessary because \fBpg\*_m\fR is normally | |
108 | or'ed with \fBpg\*_vreadm\fR to see if the page has ever been modified, | |
109 | and thus \fBpg\*_m\fR is unavailable to force a swap in this case. | |
110 | .IP \fBpg\*_vreadm\fR 16n | |
111 | Indicates whether the page has been modified since it was last initialized. | |
112 | (Initialization occurs during stack growth, growth due to | |
113 | .I break | |
114 | system calls, and as a result of | |
115 | .I exec | |
116 | and | |
117 | .I vread | |
118 | system calls.) | |
119 | For use by the | |
120 | .I vwrite | |
121 | system call, which looks at the inclusive-or of this and the | |
122 | \fBpg\*_m\fR bit. | |
123 | A | |
124 | .I vwrite | |
125 | also clears this bit. | |
126 | .IP \fBpg\*_fod\fR 16n | |
127 | The | |
128 | .I "Fill on Demand" | |
129 | Bit. Set only when the valid bit (described below) is reset, | |
130 | indicating that the page has not yet been initialized. When | |
131 | referenced, the page will either be zero filled, or filled with a block | |
132 | from the file system based on other fields described next. | |
133 | .IP \fBpg\*_fileno\fR 16n | |
134 | Meaningful only when the | |
135 | .I pg\*_fod | |
136 | bit is set. | |
137 | If the | |
138 | .I pg\*_fileno | |
139 | field has value | |
140 | PG\*_FZERO, then a reference to such a page results in the | |
141 | allocation and clearing of a page frame rather than a disk transfer. | |
142 | When stack or data segment growth | |
143 | occurs, the page table entries are initialized to fill on | |
144 | demand pages with PG\*_FZERO | |
145 | in their | |
146 | .I pg\*_fileno | |
147 | fields. | |
148 | The page is otherwise filled in from the file system, either from the inode | |
149 | associated with the currently executing process (if | |
150 | .I pg\*_fileno | |
151 | is PG\*_FTEXT), or from a file. | |
152 | .IP \fBpg\*_blkno\fR 16n | |
153 | Gives the block number, in a file system determined by the value of the | |
154 | .I pg\*_fileno | |
155 | field, from which the page is to be initialized. Note that this is the | |
156 | logical block number in the file system, | |
157 | not in the mapped object. Thus no mapping is required at page fault time | |
158 | to locate the actual disk block; the system simply uses the | |
159 | .I pg\*_fileno | |
160 | field to locate an inode and uses the | |
161 | .I i\*_dev | |
162 | field of that inode in a call to a device strategy routine. | |
163 | The size of this field (20 bits) limits the maximum size of a filesystem | |
164 | to 2\(ua20 blocks (1M block). | |
165 | .IP \fBpg\*_v\fR 16n | |
166 | The | |
167 | .I Valid | |
168 | bit. | |
169 | Set only when the | |
170 | .I pg\*_fod | |
171 | bit is reset. | |
172 | Indicates that the mapping established in the PTE is valid and can | |
173 | be used by the hardware for address translation. Access to the | |
174 | PTE when this bit is reset causes an | |
175 | .I "Address Translation Not Valid" | |
176 | fault and triggers the paging mechanism. | |
177 | If both the valid and fill on demand bits are reset, but the | |
178 | .I pg\*_pfnum | |
179 | field is non-zero, then the page is still in memory and can | |
180 | be reclaimed either from the loop or the free list | |
181 | without an I/O operation by simply | |
182 | revalidating the page, after possibly removing it from the free list. | |
183 | If the | |
184 | .I pg\*_fod | |
185 | bit is not set, and the | |
186 | .I pg\*_pfnum | |
187 | field is zero, then the page has to be retrieved from disk. | |
188 | Note that resetting the valid bit for pages which are still resident allows | |
189 | for software detection and recording of references to pages, simulating a | |
190 | .I reference | |
191 | bit, which the \s-2VAX\s0 hardware does not provide. | |
192 | .IP \fBpg\*_pfnum\fR 16n | |
193 | The | |
194 | .I "Page Frame Number." | |
195 | Meaningful only when the | |
196 | .I pg\*_fod | |
197 | bit is reset. | |
198 | If the page frame is valid, then this gives the physical page frame number | |
199 | that the virtual page is currently mapped to. If no physical page frame | |
200 | is currently allocated this field is zero | |
201 | (except in page table entries in | |
202 | .I Sysmap, | |
203 | where unused entries are not always cleared.) | |
204 | .SH | |
205 | System Page Tables | |
206 | .PP | |
207 | The first level page table | |
208 | .I Sysmap | |
209 | consists of a physically contiguous | |
210 | array of PTEs defined by the processor registers SBR | |
211 | (System Base Register), and SLR | |
212 | (System Length Register). | |
213 | SLR is loaded with the constant | |
214 | .I Syssize | |
215 | at system initialization and remains fixed thereafter. | |
216 | .PP | |
217 | The first four pages of the | |
218 | .I Sysmap | |
219 | map the kernel virtual memory from addresses 0x80000000 to the end of | |
220 | kernel data space onto the first pages of physical memory. Four pages | |
221 | is enough to map a kernel supporting a full load of memory | |
222 | on a \s-2VAX-11/780\s0. | |
223 | Immediately after the pages which map the kernel text and data are the | |
224 | entries which map the user structure of the current running process | |
225 | (the | |
226 | .I u.\& | |
227 | area.) | |
228 | The | |
229 | .I u.\& | |
230 | area is currently six pages long, with the first two of these pages mapping the | |
231 | .I user | |
232 | structure itself, and the other four pages mapping the per-process kernel | |
233 | stack.\*(dg | |
234 | .FS | |
235 | \*(dg Currently all six pages are allocated physical memory; it is planned | |
236 | that in the future, the third of these six pages will be made a | |
237 | read-only copy of | |
238 | .I zeropage. | |
239 | Since the stack is observed rarely to enter the third page | |
240 | this will leave a full page for unanticipated worst-case stack growth, | |
241 | and give a clean termination condition should the stack ever accidentally | |
242 | grow beyond three pages. | |
243 | .FE | |
244 | The position of the | |
245 | .I u.\& | |
246 | in | |
247 | .I Sysmap | |
248 | determines that it will live, in this system, | |
249 | at 0x80040000. | |
250 | .PP | |
251 | After the map entries reserved for the | |
252 | .I u.\& | |
253 | area are 16 words of system map reserved for utility usage. | |
254 | The | |
255 | .I copyseg | |
256 | routine uses one of these (CMAP2) while making a copy of one data page | |
257 | to map the destination page frame. | |
258 | This is necessary because at the point of copying (during the | |
259 | .I fork | |
260 | system call) the parent process is running, while the | |
261 | destination page is not in the parents address space, but rather | |
262 | destined for the childs address space. Since the parent may | |
263 | fault on the source page during the copy, the contents of this | |
264 | map are saved in the software extension to the | |
265 | .I pcb | |
266 | during context switch. | |
267 | Other utilities are used by | |
268 | .I clearseg | |
269 | to map pages to be cleared in implementing zero fill on demand pages, | |
270 | and by the | |
271 | .B mem.c | |
272 | driver to map pages in | |
273 | .B /dev/mem | |
274 | when accessing physical memory. | |
275 | .PP | |
276 | The | |
277 | .I Sysmap | |
278 | continues with sets of entries for the UBA control and map registers, | |
279 | the physical device memory of a UNIBUS adaptor, and | |
280 | the control and map registers of upto three MASSBUSS adaptors. | |
281 | Each of these consists of 16 page table entries, mapping 8K bytes. | |
282 | .PP | |
283 | Next, there are a set of map entries for manipulating | |
284 | .I u.\& | |
285 | structures other than the one of the current running process. For instance, | |
286 | the page out demon, which runs as process 2, needs access to the diskmap | |
287 | information of a process whose page is being written to the disk. | |
288 | To get access to this, it uses six entries in the | |
289 | .I Sysmap, | |
290 | known as | |
291 | .I Pushmap, | |
292 | to map this | |
293 | .I u.\& | |
294 | into kernel virtual memory at a virtual address given by | |
295 | .I pushutl. | |
296 | There are several other map/utl pairs: | |
297 | .I Swapmap | |
298 | and | |
299 | .I swaputl, | |
300 | .I Xswapmap | |
301 | and | |
302 | .I xswaputl, | |
303 | .I Xswap2map | |
304 | and | |
305 | .I xswap2utl, | |
306 | .I Forkmap | |
307 | and | |
308 | .I forkutl, | |
309 | .I Vfmap | |
310 | and | |
311 | .I vfutl. | |
312 | These are used in swapping and forking new processes. | |
313 | .PP | |
314 | The final portion of the | |
315 | .I Sysmap | |
316 | consists of a map/utl like pair | |
317 | .I Usrptmap/usrpt | |
318 | which is a resource allocated to hold the first level page tables | |
319 | of all currently core-resident processes. This is a very important | |
320 | structure and will be described after we describe the basic structure | |
321 | of the page tables of a process. | |
322 | .SH | |
323 | Per-process page tables | |
324 | .PP | |
325 | Each process possesses three logical page tables: one to map each | |
326 | of the text, data and stack segments. Large portions of the system | |
327 | refer to page table entries in each of these segments by an index, with | |
328 | the first page of each segment being numbered 0. | |
329 | .PP | |
330 | For the \s-2VAX-11/780\s0 version of the system, these page | |
331 | tables are implemented by two physically distinct page tables, the | |
332 | .I "P0 Page Table," | |
333 | mapping the text and data segments, and the | |
334 | .I "P1 Page Table," | |
335 | mapping the stack segment. | |
336 | Within the P0 region, the text segment is mapped starting at virtual | |
337 | address 0 with the data segment following on the first page boundary | |
338 | after it.* | |
339 | .FS | |
340 | *Later versions of the system for the VAX-11/780 | |
341 | may align the data starting at a 64K byte boundary, i.e. each of the | |
342 | text, data and stack segments will use an integral number of first | |
343 | level (\fISysmap\fR\|) entries. There would then be a minimum of one page of | |
344 | page tables for each segment, and sharing of text page table pages | |
345 | will be made simple using the | |
346 | ability of the first level entries to point to common page table pages. | |
347 | .FE | |
348 | The stack segment, on the other hand, starts at the bottom of the P1 | |
349 | region and grows toward smaller virtual addresses. The constant USRSTACK | |
350 | corresponds to the address of the byte one beyond the user stack. | |
351 | The process page tables are contiguous in kernel virtual | |
352 | memory (KVM) and are situated with | |
353 | the P0 table followed by the P1 table such that the top of the first | |
354 | and the bottom of the second are aligned at page boundaries. Note that this | |
355 | results in a | |
356 | .I gap | |
357 | between the two page tables whose size does not normally exceed one page. | |
358 | .PP | |
359 | The size of the process' page tables (P0 + gap + P1) in pages is | |
360 | contained in the software extension to the pcb located at the top of | |
361 | the process' | |
362 | .I u.\& | |
363 | area (in | |
364 | .I u\*_pcb.pcb\*_szpt). | |
365 | This number is also duplicated in the | |
366 | .I proc | |
367 | structure field | |
368 | .I p\*_szpt. | |
369 | .PP | |
370 | Given | |
371 | .I x | |
372 | as the virtual size of a process, | |
373 | .B ctopt(x) | |
374 | results in the minimum number of pages of page tables required to map it. | |
375 | A process accesses its page tables through the descriptors | |
376 | P0BR, P0LR, P1BR, and P1LR. | |
377 | The per-process copies of these processor registers are contained | |
378 | in the pcb and are loaded and saved at process context switch time. | |
379 | A copy of the P0 region base register is contained in the | |
380 | .I proc | |
381 | structure field | |
382 | .I p\*_p0br. | |
383 | .PP | |
384 | Given the above description of the process layout in virtual memory, a | |
385 | pointer to a process and a page table entry, | |
386 | the | |
387 | .B isa?pte | |
388 | macros result in | |
389 | .I true | |
390 | if the PTE is within the respective segment of process | |
391 | .I p: | |
392 | .DS | |
393 | .ta 1.75i | |
394 | \fBisaspte(p, pte)\fR stack segment? | |
395 | \fBisatpte(p, pte)\fR text segment? | |
396 | \fBisadpte(p, pte)\fR data segment? | |
397 | .DE | |
398 | Conversion between segment page numbers and pointers to page | |
399 | table entries can be achieved by the following macros, where | |
400 | .I p | |
401 | is the process of interest, and | |
402 | .I i | |
403 | is the virtual page number within the segment (a non-negative integer). | |
404 | These are used in dealing with the | |
405 | .I "core map" | |
406 | where the page numbers are kept in this form for compactness. | |
407 | .DS | |
408 | .ta 1.75i | |
409 | \fBtptopte(p, i)\fR text page number to pte | |
410 | \fBdptopte(p, i)\fR data page number to pte | |
411 | \fBsptopte(p, i)\fR stack page number to pte | |
412 | .DE | |
413 | .PP | |
414 | The \s-2VAX\s0 hardware also supports a virtual page frame number. | |
415 | These begin at 0 for the first page of the P0 region and increase | |
416 | through the text and data regions. For the stack region they | |
417 | begin at the frame before | |
418 | .I "btop(USRSTACK)" | |
419 | and decrease. Note that the first stack page has a large (but positive) | |
420 | virtual page frame number. | |
421 | .PP | |
422 | Page frame numbers in the system are very machine dependent, and are | |
423 | referred to as ``v''s. The function | |
424 | .B "vtopte(p, v)" | |
425 | will take a | |
426 | .I v | |
427 | for a given process | |
428 | .I p | |
429 | and give back a pointer to the corresponding page table entry. | |
430 | The function | |
431 | .B "ptetov(p, pte)" | |
432 | performs the inverse operation. | |
433 | To decide which segment a pte is in, and to thereafter convert | |
434 | from pte's to segment indices and back, the following macros can be | |
435 | used: | |
436 | .LP | |
437 | .ID | |
438 | .nf | |
439 | .ta 1.75i | |
440 | \fBisatsv(p, v)\fR is v in the text segment of process p? | |
441 | \fBisadsv(p, v)\fR is v in the data segment of process p? | |
442 | \fBisassv(p, v)\fR is v in the stack segment of process p? | |
443 | \fBvtotp(p, v)\fR segment page number of page v, which is in text | |
444 | \fBvtodp(p, v)\fR segment page number of page v, which is in data | |
445 | \fBvtosp(p, v)\fR segment page number of page v, which is in stack | |
446 | \fBtptov(p, i)\fR v of i'th text page | |
447 | \fBdptov(p, i)\fR v of i'th data page | |
448 | \fBsptov(p, i)\fR v of i'th stack page | |
449 | \fBptetotp(p, pte)\fR pte to a text segment page number | |
450 | \fBptetodp(p, pte)\fR pte to a data segment page number | |
451 | \fBptetosp(p, pte)\fR pte to a stack segment page number | |
452 | \fBtptopte(p, i)\fR pte pointer for i'th text page | |
453 | \fBdptopte(p, i)\fR pte pointer for i'th data page | |
454 | \fBsptopte(p, i)\fR pte pointer for i'th stack page | |
455 | .fi | |
456 | .DE | |
457 | The functions | |
458 | .I vtopte | |
459 | and | |
460 | .I ptetov | |
461 | have trivial definitions in terms of these macros. | |
462 | .SH | |
463 | Page table entries as integers | |
464 | .PP | |
465 | In a few places in the kernel, it is convenient to deal with page | |
466 | table entry fields | |
467 | .I "en masse." | |
468 | In this case we cast pointers to | |
469 | page table entries to be pointers to integers and deal with the | |
470 | bits of the page table entry in parallel. | |
471 | Thus | |
472 | .DS | |
473 | \fBstruct pte\fR *pte; | |
474 | ||
475 | *(\fBint\fR *)pte = PG\*_UW; | |
476 | .DE | |
477 | clears a page table entry to have only an access field allowing user writes, | |
478 | by referencing it as an integer. | |
479 | When accessing the page table entry in this way, we use the manifest | |
480 | constant declarations in the | |
481 | .I pte.h | |
482 | file which give us the appropriate bits. | |
483 | .br | |
484 | .ne .75i | |
485 | .sp .375i | |
486 | .SH | |
487 | THE KERNEL MAP | |
488 | .PP | |
489 | Defined in | |
490 | .I map.h. | |
491 | The kernel map is used to manage the portion of kernel virtual memory | |
492 | (KVM) allocated to mapping page tables of those processes that are | |
493 | currently loaded. On the \s-2VAX-11/780\s0 this involves managing page table | |
494 | entries in the first level page table, in the | |
495 | .I "Usrptmap/usrpt" | |
496 | portion of the | |
497 | .I Sysmap. | |
498 | The size of the KVM devoted to mapping resident process page tables is | |
499 | set by USRPTSIZE in number of Sysmap entries. Note that this allows | |
500 | the mapping of a maximum of 64K * USRPTSIZE bytes of resident user | |
501 | virtual address space. The maximum can be achieved only if there is no | |
502 | fragmentation in the allocation. | |
503 | .PP | |
504 | KVM required to map the page tables of a process | |
505 | that is being swapped in is allocated according to a | |
506 | .I "first fit" | |
507 | policy through a call to the standard system resource allocator | |
508 | .I malloc. | |
509 | Once a process is swapped in, its page tables remain stationary | |
510 | in KVM unless the process grows such that it | |
511 | requires additional pages of page tables. At that time, the process' | |
512 | page tables are moved to a new region of KVM that is large enough to | |
513 | contain them. | |
514 | Upon swap out, the process deallocates KVM required to map its page | |
515 | tables through a call to the standard resource release routine | |
516 | .I mfree.\*(dg | |
517 | .FS | |
518 | \*(dg Due to the way in which | |
519 | .I malloc | |
520 | works, the KVM mapped by the first entry in | |
521 | .I Usrptmap | |
522 | (index 0) is not used. | |
523 | .FE | |
524 | .PP | |
525 | There are two macros which can be used for conversion | |
526 | between | |
527 | .I Usrptmap | |
528 | indices and kernel virtual addresses. | |
529 | .DS | |
530 | .ta 2.0i | |
531 | \fBkmxtob(a)\fR converts \fIUsrptmap\fR index to virtual address | |
532 | \fBbtokmx(b)\fR converts virtual address to \fIUsrptmap\fR index | |
533 | .DE | |
534 | .br | |
535 | .ne .75i | |
536 | .sp .375i | |
537 | .SH | |
538 | CORE MAP | |
539 | .PP | |
540 | The core map structure is defined in | |
541 | .B cmap.h. | |
542 | Each entry of core map contains eight bytes of information | |
543 | consisting of the following fields: | |
544 | .IP \fBc\*_next\fR 16n | |
545 | Index to the next entry in the free list. | |
546 | The size of this field (14 bits) limits the number of | |
547 | page frames that can exist in the main memory to 16K (8M bytes). | |
548 | .IP \fBc\*_prev\fR 16n | |
549 | Index to the previous entry in the free list. | |
550 | .IP \fBc\*_page\fR 16n | |
551 | Virtual page number within the respective segment (text, data or stack). | |
552 | The size of this field (17 bits) limits the virtual size of | |
553 | a process segment to 128K pages (i.e., 64M bytes). | |
554 | .IP \fBc\*_ndx\fR 16n | |
555 | Index of the proc structure that owns this page frame. In the case of | |
556 | shared text, the index is that of the corresponding text structure. | |
557 | The size of this field (10 bits) limits the number of | |
558 | slots in the | |
559 | .I proc | |
560 | and | |
561 | .I text | |
562 | structures to 1024. | |
563 | .IP \fBc\*_intrans\fR 16n | |
564 | The intransit | |
565 | bit. Important only for shared text segment pages, but set for private | |
566 | data pages for purposes of post-mortem analysis. | |
567 | Indicates that a page-in operation for the corresponding page has already | |
568 | been initiated by another process. Causes the faulting process to | |
569 | enter a wait state until awakened by the process that initiated the transfer. | |
570 | (This is logically part of the \fBc\*_flag\fR field, and is separate because | |
571 | of alignment considerations in the coremap.) | |
572 | .IP \fBc\*_flag\fR 16n | |
573 | 8 bits of flags. | |
574 | .PP | |
575 | The meanings of the flags are: | |
576 | .IP \fBMWANT\fR 16n | |
577 | The page frame has a process sleeping on it. The process to free it | |
578 | should perform a wakeup on the page. | |
579 | .IP \fBMLOCK\fR 16n | |
580 | Lock bit. The page frame is involved in raw I/O or page I/O and | |
581 | consequently unavailable for replacement. | |
582 | .IP \fBMFREE\fR 16n | |
583 | Free list bit. The page frame is currently in the free list. | |
584 | .IP \fBMGONE\fR 16n | |
585 | Indicates that the virtual page corresponding to this page frame has | |
586 | vanished due to either having been deallocated (contraction of the data | |
587 | segment) or swapped out. | |
588 | The page will eventually be freed by the process which is holding it, usually | |
589 | the page-out demon. | |
590 | .IP \fBMSYS\fR 16n | |
591 | System page bit. The page frame has been allocated to a user process' | |
592 | .I u.\& | |
593 | area or page tables and therefore unavailable for replacement. | |
594 | .IP \fBMSTACK\fR 16n | |
595 | Page frame belongs to a stack segment. | |
596 | .IP \fBMDATA\fR 16n | |
597 | Page frame belongs to a data segment. | |
598 | .IP \fBMTEXT\fR 16n | |
599 | Page frame belongs to a shared text segment. | |
600 | .PP | |
601 | The core map is the central data base for the paging subsystem. | |
602 | It consists of an array of these structures, one entry for each page | |
603 | frame in the main memory excluding those allocated to kernel text and data. | |
604 | .PP | |
605 | The memory free list, managed by | |
606 | .B "memall" | |
607 | and | |
608 | .B "memfree" | |
609 | is created by doubly linking entries in core map. The reverse link is | |
610 | provided to speed up page reclaims from the free list which have to | |
611 | perform an unlink operation. | |
612 | .PP | |
613 | There are a pair of macros for converting between core map indices and page | |
614 | frame numbers, since no core map entries exist for the system. | |
615 | .DS | |
616 | .ta 1.75i | |
617 | \fBcmtopg(x)\fR converts core map index x to a page frame number | |
618 | \fBpgtocm(x)\fR converts a page frame number to a core map index | |
619 | .DE | |
620 | The macros for manipulating segment page numbers, which we described | |
621 | in the section on page tables above, are very useful when dealing with | |
622 | the core map. | |
623 | .SH | |
624 | DISK MAP | |
625 | .PP | |
626 | Defined in | |
627 | .I dmap.h. | |
628 | The disk map is a per-process data structure that is kept in the process | |
629 | .I u.\& | |
630 | area. The fields are: | |
631 | .IP \fBdm\*_size\fR 16n | |
632 | The amount of disk space allocated that is actually used by the segment. | |
633 | .IP \fBdm\*_alloc\fR 16n | |
634 | The amount of physical disk space that is allocated to the segment. | |
635 | .IP \fBdm\*_dmap\fR 16n | |
636 | An array of disk block numbers marking the beginning of disk areas that | |
637 | constitute the segment disk image. | |
638 | .PP | |
639 | The four instances of the disk map | |
640 | allow the mapping of process virtual addresses to disk addresses | |
641 | for the parent data, parent stack, child data, and child stack segments. | |
642 | The two child maps are used during the | |
643 | .I fork | |
644 | system call serving to make both the parent and the child disk images | |
645 | accessible simultaneously.\*(dg | |
646 | .FS | |
647 | \*(dg These could actually be located on the kernel stack, rather than in the | |
648 | .I u.\& | |
649 | area. | |
650 | .FE | |
651 | .PP | |
652 | Each entry in the disk map array contains a disk block number (relative to | |
653 | the beginning of the swap area) that marks the beginning of a disk area | |
654 | mapping the corresponding segment of virtual space. | |
655 | The initial creation of the segment results in DMMIN | |
656 | blocks (512 bytes each) pointed to by the first disk map entry to be | |
657 | allocated. These disk blocks map precisely to the first DMMIN virtual | |
658 | pages of the corresponding segment. | |
659 | Subsequent growth of the segment beyond this size results in the | |
660 | allocation of 2*DMMIN blocks mapping segment virtual page numbers | |
661 | DMMIN through 3*DMMIN-1. This doubling process continues until | |
662 | the segment reaches a size such that the next disk area allocated has size | |
663 | .I DMMAX | |
664 | blocks. Beyond this point, the segment receives DMMAX additional blocks | |
665 | should it require them. Limiting the exponential growth at this size | |
666 | is in an effort to reduce severe disk fragmentation that would otherwise | |
667 | result for very large segments. | |
668 | .PP | |
669 | Note that increasing entries in the array map increasing segment virtual | |
670 | page numbers. However, in the case of the stack segment, this actually | |
671 | means mapping | |
672 | .I decreasing | |
673 | process virtual page numbers. Also note that since a shared text segment | |
674 | is static in size, its disk image is allocated in one contiguous | |
675 | block that is described by the text structure fields | |
676 | .I x\*_daddr | |
677 | and | |
678 | .I x\*_size. | |
679 | .PP | |
680 | The maximum size (in pages) that a segment can grow is determined by | |
681 | MAXTSIZ, MAXDSIZ, or MAXSSIZ for text, data, or stack segment | |
682 | respectively. | |
683 | Since the procedures that deal with the disk map panic on segment | |
684 | length overrun, setting the maximum size of a segment to a value | |
685 | greater than that can be mapped by it's disk map can lead to a | |
686 | system failure. | |
687 | To avoid such a situation, the disk map parameters should be set so that | |
688 | possible segment overgrowth will be detected at an earlier time in | |
689 | life of a process by | |
690 | .B "chksize." | |
691 | Note that the maximum segment size that can be mapped by disk map can be | |
692 | increased through raising any one or more of the constants NDMIN, | |
693 | DMMAX, and NDMAP. | |
694 | .br | |
695 | .ne .75i | |
696 | .sp .375i | |
697 | .SH | |
698 | INSTRUMENTATION RELATED STRUCTURES | |
699 | .PP | |
700 | Currently, the system maintains counters for various paging related | |
701 | events that are accumulated and averaged at discrete points in time. | |
702 | The basic structure as defined in | |
703 | .I vm.h | |
704 | has the following fields: | |
705 | .IP \fBv\*_swpin\fR 16n | |
706 | Process swap ins. | |
707 | .IP \fBv\*_swpout\fR 16n | |
708 | Process swap outs. | |
709 | .IP \fBv\*_pswpin\fR 16n | |
710 | Pages swapped in. | |
711 | .IP \fBv\*_pswpout\fR 16n | |
712 | Pages swapped out. | |
713 | .IP \fBv\*_pgin\fR 16n | |
714 | Page faults requiring disk I/O. | |
715 | .IP \fBv\*_pgout\fR 16n | |
716 | Dirty page writes. | |
717 | .IP \fBv\*_intrans\fR 16n | |
718 | Page faults on shared text segment pages that were found to be intransit. | |
719 | .IP \fBv\*_pgrec\fR 16n | |
720 | Page faults that were serviced by reclaiming the page from memory. | |
721 | .IP \fBv\*_exfod\fR 16n | |
722 | Fill on demand from file system of executable pages (text or data from | |
723 | demand initialized executables.) | |
724 | .IP \fBv\*_zfod\fR 16n | |
725 | Fill on demand type page faults which filled zeros. | |
726 | .IP \fBv\*_vrfod\fR 16n | |
727 | Fill on demand from file systems of pages mapped by | |
728 | .I vread. | |
729 | .IP \fBv\*_nexfod\fR 16n | |
730 | Number of pages set up for fill on demand from executed files. | |
731 | .IP \fBv\*_nzfod\fR 16n | |
732 | Number of pages set up for zero fill on demand. | |
733 | .IP \fBv\*_vrfod\fR 16n | |
734 | Number of pages set up for fill on demand with | |
735 | .I vread. | |
736 | .IP \fBv\*_pgfrec\fR 16n | |
737 | Pages reclaimed from the free list. | |
738 | .IP \fBv\*_faults\fR 16n | |
739 | Address translation faults, any one of the above categories. | |
740 | .IP \fBv\*_scan\fR 16n | |
741 | Page frames examined by the page demon. | |
742 | .IP \fBv\*_rev\fR 16n | |
743 | Revolutions around the loop by the page demon. | |
744 | .IP \fBv\*_dfree\fR 16n | |
745 | Pages freed by the page demon. | |
746 | .IP \fBv\*_swtch\fR 16n | |
747 | Cpu context switches. | |
748 | .PP | |
749 | The three instances of this structure under the names of | |
750 | .I cnt, | |
751 | .I rate, | |
752 | and | |
753 | .I sum | |
754 | serve the following purposes: | |
755 | .IP \fBcnt\fR 16n | |
756 | Incremental counters for the above events. | |
757 | .IP \fBrate\fR 16n | |
758 | The moving averages for the above events that are updated at various | |
759 | integral clock tick periods. | |
760 | The relevant macro for this operation is | |
761 | .B "ave(smooth, cnt, time)" | |
762 | which averages the incremental count | |
763 | .I cnt | |
764 | into | |
765 | .I smooth | |
766 | with aging factor | |
767 | .I time. | |
768 | .IP \fBsum\fR 16n | |
769 | Accumulated totals for the above events since reboot. | |
770 | .br | |
771 | .ne .75i | |
772 | .sp .375i | |
773 | .SH | |
774 | EXISTING DATA STRUCTURES | |
775 | .PP | |
776 | Here we describe fields within existing data structures that have | |
777 | been newly introduced or have taken a new meaning. | |
778 | .SH | |
779 | The Process Structure | |
780 | .IP \fBp\*_slptime\fR 16n | |
781 | Clock ticks since last sleep. | |
782 | .IP \fBp\*_szpt\fR 16n | |
783 | Number of pages of page table. This field is a copy of the | |
784 | .I pcb\*_szpt | |
785 | field in the pcb structure. | |
786 | .IP \fBp\*_tsize\fR 16n | |
787 | Text segment size in pages. This is a copy of the | |
788 | .I x\*_size | |
789 | field in the text structure. | |
790 | .IP \fBp\*_dsize\fR 16n | |
791 | Data segment size in pages. | |
792 | .IP \fBp\*_ssize\fR 16n | |
793 | Stack segment size in pages. | |
794 | .IP \fBp\*_rssize\fR 16n | |
795 | The current private segment (data + stack) | |
796 | .I "resident set size" | |
797 | for the process. The resident set is defined as the set of pages | |
798 | owned by the process that are either valid or reclaimable but not in the | |
799 | free list. | |
800 | .IP \fBp\*_swrss\fR 16n | |
801 | The size of the resident set at time of last swap out. | |
802 | .IP \fBp\*_p0br\fR 16n | |
803 | Pointer to the base of the P0 region page table. This is a copy of the | |
804 | .I pcb\*_p0br | |
805 | field in the pcb structure. | |
806 | .IP \fBp\*_xlink\fR 16n | |
807 | Pointer to another proc structure that is currently loaded and linked | |
808 | to the same text segment. The head of this linked list of such processes | |
809 | is contained in the text structure field | |
810 | .I x\*_caddr. | |
811 | Since the shared text portion of the process page tables are duplicated | |
812 | for each resident process attached to the same text segment, modifications | |
813 | to any one are reflected in all of them by sequentially updating the | |
814 | page table of each process that is on this linked list.\*(dg | |
815 | .FS | |
816 | \*(dg Used slightly differently when otherwise unused during | |
817 | .I vfork, | |
818 | see | |
819 | .B SNOVM | |
820 | below. | |
821 | .FE | |
822 | .IP \fBp\*_poip\fR 16n | |
823 | Count of number of page outs in progress on this process. If non-zero, | |
824 | prevents the process from being swapped in. | |
825 | .IP \fBp\*_faults\fR 16n | |
826 | Incremental number of page faults taken by the process that resulted in | |
827 | disk I/O. | |
828 | .IP \fBp\*_aveflt\fR 16n | |
829 | Moving average of above field. | |
830 | .IP \fBp\*_ndx\fR 16n | |
831 | Index of the process slot on behalf of which memory is to be allocated. | |
832 | During | |
833 | .I vfork, | |
834 | the memory of a process will be given to a child, but the reverse | |
835 | entries in | |
836 | .I cmap | |
837 | must still point to the original process | |
838 | so that the reverse links will point there when the | |
839 | .I vfork | |
840 | completes. This field thus indicates the original owner of the current | |
841 | process' virtual memory. | |
842 | .PP | |
843 | The new bits in the | |
844 | .I p\*_flag | |
845 | field are: | |
846 | .IP \fBSSYS\fR 16n | |
847 | The swapper or the page demon process. | |
848 | .IP \fBSLOCK\fR 16n | |
849 | Process being swapped out. | |
850 | .IP \fBSSWAP\fR 16n | |
851 | Context to be restored from | |
852 | .I u\*_ssave | |
853 | upon resume. | |
854 | .IP \fBSPAGE\fR 16n | |
855 | Process in page wait state. | |
856 | .IP \fBSKEEP\fR 16n | |
857 | Prevents process from being swapped out. Set during the reading of the | |
858 | text segment from the inode during exec and process duplication in fork. | |
859 | .IP \fBSDLYU\fR 16n | |
860 | Delayed unlock of pages. Causes the pages of the process that are faulted | |
861 | in to remain locked, thus ineligible for replacement, until explicitly | |
862 | unlocked by the process. | |
863 | .IP \fBSWEXIT\fR 16n | |
864 | Process working on | |
865 | .I exit. | |
866 | .IP \fBSVFORK\fR 16n | |
867 | Indicates that this process is the child in a | |
868 | .I vfork | |
869 | context; i.e. that the virtual memory being used by this process actually | |
870 | belongs to another process. | |
871 | .IP \fBSVFDONE\fR 16n | |
872 | A handshaking flag for | |
873 | .I vfork. | |
874 | .IP \fBSNOVM\fR 16n | |
875 | The parent of a | |
876 | .I vfork. | |
877 | The process has no virtual memory during this time. | |
878 | While this bit is set, the | |
879 | .I p\*_xlink | |
880 | field points to the process to which the memory was given. | |
881 | .SH | |
882 | The Text Structure | |
883 | .PP | |
884 | The new fields in the text structure are: | |
885 | .IP \fBx\*_caddr\fR 16n | |
886 | Points to the head of the linked list of proc structures of processes | |
887 | that are currently loaded and attached to this text segment. | |
888 | .IP \fBx\*_rssize\fR 16n | |
889 | The resident set size for this text segment. | |
890 | .IP \fBx\*_swrss\fR 16n | |
891 | The resident set size for this text segment at the time of last swap out. | |
892 | .IP \fBx\*_poip\fR 16n | |
893 | Count of number of page outs in progress on this text segment. If non-zero, | |
894 | prevents the process from being swapped in. | |
895 | .SH | |
896 | The User Area Structure | |
897 | .PP | |
898 | The per-process user area contains the | |
899 | .I u.\& | |
900 | structure as well as the kernel stack. It is mapped to a fixed kernel virtual | |
901 | address (starting at 0x80040000) at process context switch. The user area | |
902 | is swapped in and out of disk as a separate entity and is pointed to by | |
903 | the proc structure field | |
904 | .I p\*_swaddr | |
905 | when not resident. | |
906 | The number of pages allocated for the process' user area and kernel stack | |
907 | is six pages (UPAGES), | |
908 | thus the base of the kernel stack for a process is 0x80040c00. | |
909 | .PP | |
910 | The new fields that have been added to the | |
911 | .I u.\& | |
912 | structure are the following: | |
913 | .IP \fBu\*_pcb.pcb\*_cmap2\fR 16n | |
914 | .br | |
915 | Contains the copy of Sysmap entry CMAP2 | |
916 | at context switch time. This kernel virtual address space mapping is made | |
917 | part of the process context due to the operation of the process duplication | |
918 | code that implements fork. Briefly, the process duplication is accomplished | |
919 | by copying from parent process' virtual address space to the child's | |
920 | virtual address space by mapping it to kernel virtual memory through | |
921 | CMAP2. Since this can result in faulting in the parent's address space, | |
922 | thus causing a block and context switch, the mapping of the child memory | |
923 | in the kernel must be saved and restored before the process can resume. | |
924 | .IP \fBu\*_nswap\fR 16n | |
925 | Number of times the process has been swapped. Not yet maintained. | |
926 | .IP \fBu\*_majorflt\fR 16n | |
927 | Number of faults taken by the process that resulted in disk I/O. | |
928 | .IP \fBu\*_cnswap\fR 16n | |
929 | Number of times the children of this process have been swapped. Not | |
930 | yet maintained. | |
931 | .IP \fBu\*_cmajorflt\fR 16n | |
932 | Number of faults taken by the children of this process that resulted in | |
933 | disk I/O. Not yet maintained. | |
934 | .IP \fBu\*_minorflt\fR 16n | |
935 | Number of faults taken by the process that were reclaims. | |
936 | .IP \fBu\*_dmap\fR 16n | |
937 | The disk map for the data segment. | |
938 | .IP \fBu\*_smap\fR 16n | |
939 | The disk map for the stack segment. | |
940 | .IP \fBu\*_cdmap\fR 16n | |
941 | The disk map for the child's data segment to be used during fork. | |
942 | .IP \fBu\*_csmap\fR 16n | |
943 | The disk map for the child's stack segment to be used during fork. | |
944 | .IP \fBu\*_stklim\fR 16n | |
945 | Limit of maximum stack growth. To be varied through system calls. | |
946 | Currently not implemented. | |
947 | .IP \fBu\*_wantcore\fR 16n | |
948 | Flag to cause core dump even if the process is very large. Set by a | |
949 | system call. Currently not implemented. | |
950 | .IP \fBu\*_vrpages\fR 16n | |
951 | An array with an element for each file descriptor. Gives the number | |
952 | of fill on demand page table entries which have this file as their | |
953 | .B pg\*_fileno. | |
954 | If the count is non-zero, then the file cannot be closed, either by | |
955 | .I close | |
956 | or implicitly by | |
957 | .I dup2. | |
958 | .SH | |
959 | The Inode Structure. | |
960 | .PP | |
961 | One field was added to the inode structure to support the | |
962 | .I vread | |
963 | system call: | |
964 | .IP \fBi\*_vfdcnt\fR 16n | |
965 | This counts the number of file descriptors (fd's) that have pages mapping | |
966 | this file with | |
967 | .I vread. | |
968 | If the count is non-zero, then the file cannot be truncated. |