Commit | Line | Data |
---|---|---|
f978e1ef C |
1 | .fp 3 G |
2 | ....TM "78-1271-12, 78-1273-6" 39199 39199-11 | |
3 | .ND "September 1, 1978" | |
4 | ....TR 68 | |
5 | .RP | |
6 | . \" macros here | |
7 | .tr _\(em | |
8 | .if t .tr ~\(ap | |
9 | .tr |\(or | |
10 | .tr *\(** | |
11 | .de UC | |
12 | \&\\$3\s-1\\$1\\s0\&\\$2 | |
13 | .. | |
14 | .de IT | |
15 | .if n .ul | |
16 | \&\\$3\f2\\$1\fP\|\\$2 | |
17 | .. | |
18 | .de UL | |
19 | .if n .ul | |
20 | \&\\$3\f3\\$1\fP\&\\$2 | |
21 | .. | |
22 | .de P1 | |
23 | .DS I 3n | |
24 | .nf | |
25 | .if n .ta 5 10 15 20 25 30 35 40 45 50 55 60 | |
26 | .if t .ta .3i .6i .9i 1.2i | |
27 | .if t .tr -\-'\(fm*\(** | |
28 | .if t .tr _\(ul | |
29 | .ft 3 | |
30 | .lg 0 | |
31 | .ss 18 | |
32 | . \"use first argument as indent if present | |
33 | .. | |
34 | .de P2 | |
35 | .ps \\n(PS | |
36 | .vs \\n(VSp | |
37 | .ft R | |
38 | .ss 12 | |
39 | .if n .ls 2 | |
40 | .tr --''``^^!! | |
41 | .if t .tr _\(em | |
42 | .fi | |
43 | .lg | |
44 | .DE | |
45 | .. | |
46 | .hw semi-colon | |
47 | .hy 14 | |
48 | . \"2=not last lines; 4= no -xx; 8=no xx- | |
49 | . \"special chars in programs | |
50 | .de WS | |
51 | .sp \\$1 | |
52 | .. | |
53 | . \" end of macros | |
54 | .TL | |
55 | Awk \(em A Pattern Scanning and Processing Language | |
56 | .br | |
57 | (Second Edition) | |
58 | .AU "MH 2C-522" 4862 | |
59 | Alfred V. Aho | |
60 | .AU "MH 2C-518" 6021 | |
61 | Brian W. Kernighan | |
62 | .AU "MH 2C-514" 7214 | |
63 | Peter J. Weinberger | |
64 | .AI | |
65 | .MH | |
66 | .AB | |
67 | .IT Awk | |
68 | is a programming language whose | |
69 | basic operation | |
70 | is to search a set of files | |
71 | for patterns, and to perform specified actions upon lines or fields of lines which | |
72 | contain instances of those patterns. | |
73 | .IT Awk | |
74 | makes certain data selection and transformation operations easy to express; | |
75 | for example, the | |
76 | .IT awk | |
77 | program | |
78 | .sp | |
79 | .ce | |
80 | .ft 3 | |
81 | length > 72 | |
82 | .ft | |
83 | .sp | |
84 | prints all input lines whose length exceeds 72 characters; | |
85 | the program | |
86 | .ce | |
87 | .sp | |
88 | .ft 3 | |
89 | NF % 2 == 0 | |
90 | .ft R | |
91 | .sp | |
92 | prints all lines with an even number of fields; | |
93 | and the program | |
94 | .ce | |
95 | .sp | |
96 | .ft 3 | |
97 | { $1 = log($1); print } | |
98 | .ft R | |
99 | .sp | |
100 | replaces the first field of each line by its logarithm. | |
101 | .PP | |
102 | .IT Awk | |
103 | patterns may include arbitrary boolean combinations of regular expressions | |
104 | and of relational operators on strings, numbers, fields, variables, and array elements. | |
105 | Actions may include the same pattern-matching constructions as in patterns, | |
106 | as well as | |
107 | arithmetic and string expressions and assignments, | |
108 | .UL if-else , | |
109 | .UL while , | |
110 | .UL for | |
111 | statements, | |
112 | and multiple output streams. | |
113 | .PP | |
114 | This report contains a user's guide, a discussion of the design and implementation of | |
115 | .IT awk , | |
116 | and some timing statistics. | |
117 | ....It supersedes TM-77-1271-5, dated September 8, 1977. | |
118 | .AE | |
119 | .CS 6 1 7 0 1 4 | |
120 | .if n .ls 2 | |
121 | .nr PS 9 | |
122 | .nr VS 11 | |
123 | .NH | |
124 | Introduction | |
125 | .if t .2C | |
126 | .PP | |
127 | .IT Awk | |
128 | is a programming language designed to make | |
129 | many common | |
130 | information retrieval and text manipulation tasks | |
131 | easy to state and to perform. | |
132 | .PP | |
133 | The basic operation of | |
134 | .IT awk | |
135 | is to scan a set of input lines in order, | |
136 | searching for lines which match any of a set of patterns | |
137 | which the user has specified. | |
138 | For each pattern, an action can be specified; | |
139 | this action will be performed on each line that matches the pattern. | |
140 | .PP | |
141 | Readers familiar with the | |
142 | .UX | |
143 | program | |
144 | .IT grep\| | |
145 | .[ | |
146 | unix program manual | |
147 | .] | |
148 | will recognize | |
149 | the approach, although in | |
150 | .IT awk | |
151 | the patterns may be more | |
152 | general than in | |
153 | .IT grep , | |
154 | and the actions allowed are more involved than merely | |
155 | printing the matching line. | |
156 | For example, the | |
157 | .IT awk | |
158 | program | |
159 | .P1 | |
160 | {print $3, $2} | |
161 | .P2 | |
162 | prints the third and second columns of a table | |
163 | in that order. | |
164 | The program | |
165 | .P1 | |
166 | $2 ~ /A\||B\||C/ | |
167 | .P2 | |
168 | prints all input lines with an A, B, or C in the second field. | |
169 | The program | |
170 | .P1 | |
171 | $1 != prev { print; prev = $1 } | |
172 | .P2 | |
173 | prints all lines in which the first field is different | |
174 | from the previous first field. | |
175 | .NH 2 | |
176 | Usage | |
177 | .PP | |
178 | The command | |
179 | .P1 | |
180 | awk program [files] | |
181 | .P2 | |
182 | executes the | |
183 | .IT awk | |
184 | commands in | |
185 | the string | |
186 | .UL program | |
187 | on the set of named files, | |
188 | or on the standard input if there are no files. | |
189 | The statements can also be placed in a file | |
190 | .UL pfile , | |
191 | and executed by the command | |
192 | .P1 | |
193 | awk -f pfile [files] | |
194 | .P2 | |
195 | .NH 2 | |
196 | Program Structure | |
197 | .PP | |
198 | An | |
199 | .IT awk | |
200 | program is a sequence of statements of the form: | |
201 | .P1 | |
202 | .ft I | |
203 | pattern { action } | |
204 | pattern { action } | |
205 | ... | |
206 | .ft 3 | |
207 | .P2 | |
208 | Each line of input | |
209 | is matched against | |
210 | each of the patterns in turn. | |
211 | For each pattern that matches, the associated action | |
212 | is executed. | |
213 | When all the patterns have been tested, the next line | |
214 | is fetched and the matching starts over. | |
215 | .PP | |
216 | Either the pattern or the action may be left out, | |
217 | but not both. | |
218 | If there is no action for a pattern, | |
219 | the matching line is simply | |
220 | copied to the output. | |
221 | (Thus a line which matches several patterns can be printed several times.) | |
222 | If there is no pattern for an action, | |
223 | then the action is performed for every input line. | |
224 | A line which matches no pattern is ignored. | |
225 | .PP | |
226 | Since patterns and actions are both optional, | |
227 | actions must be enclosed in braces | |
228 | to distinguish them from patterns. | |
229 | .NH 2 | |
230 | Records and Fields | |
231 | .PP | |
232 | .IT Awk | |
233 | input is divided into | |
234 | ``records'' terminated by a record separator. | |
235 | The default record separator is a newline, | |
236 | so by default | |
237 | .IT awk | |
238 | processes its input a line at a time. | |
239 | The number of the current record is available in a variable | |
240 | named | |
241 | .UL NR . | |
242 | .PP | |
243 | Each input record | |
244 | is considered to be divided into ``fields.'' | |
245 | Fields are normally separated by | |
246 | white space \(em blanks or tabs \(em | |
247 | but the input field separator may be changed, as described below. | |
248 | Fields are referred to as | |
249 | .UL "$1, $2," | |
250 | and so forth, | |
251 | where | |
252 | .UL $1 | |
253 | is the first field, | |
254 | and | |
255 | .UL $0 | |
256 | is the whole input record itself. | |
257 | Fields may be assigned to. | |
258 | The number of fields in the current record | |
259 | is available in a variable named | |
260 | .UL NF . | |
261 | .PP | |
262 | The variables | |
263 | .UL FS | |
264 | and | |
265 | .UL RS | |
266 | refer to the input field and record separators; | |
267 | they may be changed at any time to any single character. | |
268 | The optional command-line argument | |
269 | \f3\-F\fIc\fR | |
270 | may also be used to set | |
271 | .UL FS | |
272 | to the character | |
273 | .IT c . | |
274 | .PP | |
275 | If the record separator is empty, | |
276 | an empty input line is taken as the record separator, | |
277 | and blanks, tabs and newlines are treated as field separators. | |
278 | .PP | |
279 | The variable | |
280 | .UL FILENAME | |
281 | contains the name of the current input file. | |
282 | .NH 2 | |
283 | Printing | |
284 | .PP | |
285 | An action may have no pattern, | |
286 | in which case the action is executed for | |
287 | all | |
288 | lines. | |
289 | The simplest action is to print some or all of a record; | |
290 | this is accomplished by the | |
291 | .IT awk | |
292 | command | |
293 | .UL print . | |
294 | The | |
295 | .IT awk | |
296 | program | |
297 | .P1 | |
298 | { print } | |
299 | .P2 | |
300 | prints each record, thus copying the input to the output intact. | |
301 | More useful is to print a field or fields from each record. | |
302 | For instance, | |
303 | .P1 | |
304 | print $2, $1 | |
305 | .P2 | |
306 | prints the first two fields in reverse order. | |
307 | Items separated by a comma in the print statement will be separated by the current output field separator | |
308 | when output. | |
309 | Items not separated by commas will be concatenated, | |
310 | so | |
311 | .P1 | |
312 | print $1 $2 | |
313 | .P2 | |
314 | runs the first and second fields together. | |
315 | .PP | |
316 | The predefined variables | |
317 | .UL NF | |
318 | and | |
319 | .UL NR | |
320 | can be used; | |
321 | for example | |
322 | .P1 | |
323 | { print NR, NF, $0 } | |
324 | .P2 | |
325 | prints each record preceded by the record number and the number of fields. | |
326 | .PP | |
327 | Output may be diverted to multiple files; | |
328 | the program | |
329 | .P1 | |
330 | { print $1 >"foo1"; print $2 >"foo2" } | |
331 | .P2 | |
332 | writes the first field, | |
333 | .UL $1 , | |
334 | on the file | |
335 | .UL foo1 , | |
336 | and the second field on file | |
337 | .UL foo2 . | |
338 | The | |
339 | .UL >> | |
340 | notation can also be used: | |
341 | .P1 | |
342 | print $1 >>"foo" | |
343 | .P2 | |
344 | appends the output to the file | |
345 | .UL foo . | |
346 | (In each case, | |
347 | the output files are | |
348 | created if necessary.) | |
349 | The file name can be a variable or a field as well as a constant; | |
350 | for example, | |
351 | .P1 | |
352 | print $1 >$2 | |
353 | .P2 | |
354 | uses the contents of field 2 as a file name. | |
355 | .PP | |
356 | Naturally there is a limit on the number of output files; | |
357 | currently it is 10. | |
358 | .PP | |
359 | Similarly, output can be piped into another process | |
360 | (on | |
361 | .UC UNIX | |
362 | only); for instance, | |
363 | .P1 | |
364 | print | "mail bwk" | |
365 | .P2 | |
366 | mails the output to | |
367 | .UL bwk . | |
368 | .PP | |
369 | The variables | |
370 | .UL OFS | |
371 | and | |
372 | .UL ORS | |
373 | may be used to change the current | |
374 | output field separator and output | |
375 | record separator. | |
376 | The output record separator is | |
377 | appended to the output of the | |
378 | .UL print | |
379 | statement. | |
380 | .PP | |
381 | .IT Awk | |
382 | also provides the | |
383 | .UL printf | |
384 | statement for output formatting: | |
385 | .P1 | |
386 | printf format expr, expr, ... | |
387 | .P2 | |
388 | formats the expressions in the list | |
389 | according to the specification | |
390 | in | |
391 | .UL format | |
392 | and prints them. | |
393 | For example, | |
394 | .P1 | |
395 | printf "%8.2f %10ld\en", $1, $2 | |
396 | .P2 | |
397 | prints | |
398 | .UL $1 | |
399 | as a floating point number 8 digits wide, | |
400 | with two after the decimal point, | |
401 | and | |
402 | .UL $2 | |
403 | as a 10-digit long decimal number, | |
404 | followed by a newline. | |
405 | No output separators are produced automatically; | |
406 | you must add them yourself, | |
407 | as in this example. | |
408 | The version of | |
409 | .UL printf | |
410 | is identical to that used with C. | |
411 | .[ | |
412 | C programm language prentice hall 1978 | |
413 | .] | |
414 | .NH 1 | |
415 | Patterns | |
416 | .PP | |
417 | A pattern in front of an action acts as a selector | |
418 | that determines whether the action is to be executed. | |
419 | A variety of expressions may be used as patterns: | |
420 | regular expressions, | |
421 | arithmetic relational expressions, | |
422 | string-valued expressions, | |
423 | and arbitrary boolean | |
424 | combinations of these. | |
425 | .NH 2 | |
426 | BEGIN and END | |
427 | .PP | |
428 | The special pattern | |
429 | .UL BEGIN | |
430 | matches the beginning of the input, | |
431 | before the first record is read. | |
432 | The pattern | |
433 | .UL END | |
434 | matches the end of the input, | |
435 | after the last record has been processed. | |
436 | .UL BEGIN | |
437 | and | |
438 | .UL END | |
439 | thus provide a way to gain control before and after processing, | |
440 | for initialization and wrapup. | |
441 | .PP | |
442 | As an example, the field separator | |
443 | can be set to a colon by | |
444 | .P1 | |
445 | BEGIN { FS = ":" } | |
446 | .ft I | |
447 | \&... rest of program ... | |
448 | .ft 3 | |
449 | .P2 | |
450 | Or the input lines may be counted by | |
451 | .P1 | |
452 | END { print NR } | |
453 | .P2 | |
454 | If | |
455 | .UL BEGIN | |
456 | is present, it must be the first pattern; | |
457 | .UL END | |
458 | must be the last if used. | |
459 | .NH 2 | |
460 | Regular Expressions | |
461 | .PP | |
462 | The simplest regular expression is a literal string of characters | |
463 | enclosed in slashes, | |
464 | like | |
465 | .P1 | |
466 | /smith/ | |
467 | .P2 | |
468 | This | |
469 | is actually a complete | |
470 | .IT awk | |
471 | program which | |
472 | will print all lines which contain any occurrence | |
473 | of the name ``smith''. | |
474 | If a line contains ``smith'' | |
475 | as part of a larger word, | |
476 | it will also be printed, as in | |
477 | .P1 | |
478 | blacksmithing | |
479 | .P2 | |
480 | .PP | |
481 | .IT Awk | |
482 | regular expressions include the regular expression | |
483 | forms found in | |
484 | the | |
485 | .UC UNIX | |
486 | text editor | |
487 | .IT ed\| | |
488 | .[ | |
489 | unix program manual | |
490 | .] | |
491 | and | |
492 | .IT grep | |
493 | (without back-referencing). | |
494 | In addition, | |
495 | .IT awk | |
496 | allows | |
497 | parentheses for grouping, | for alternatives, | |
498 | .UL + | |
499 | for ``one or more'', and | |
500 | .UL ? | |
501 | for ``zero or one'', | |
502 | all as in | |
503 | .IT lex . | |
504 | Character classes | |
505 | may be abbreviated: | |
506 | .UL [a\-zA\-Z0\-9] | |
507 | is the set of all letters and digits. | |
508 | As an example, | |
509 | the | |
510 | .IT awk | |
511 | program | |
512 | .P1 | |
513 | /[Aa]ho\||[Ww]einberger\||[Kk]ernighan/ | |
514 | .P2 | |
515 | will print all lines which contain any of the names | |
516 | ``Aho,'' ``Weinberger'' or ``Kernighan,'' | |
517 | whether capitalized or not. | |
518 | .PP | |
519 | Regular expressions | |
520 | (with the extensions listed above) | |
521 | must be enclosed in slashes, | |
522 | just as in | |
523 | .IT ed | |
524 | and | |
525 | .IT sed . | |
526 | Within a regular expression, | |
527 | blanks and the regular expression | |
528 | metacharacters are significant. | |
529 | To turn of the magic meaning | |
530 | of one of the regular expression characters, | |
531 | precede it with a backslash. | |
532 | An example is the pattern | |
533 | .P1 | |
534 | /\|\e/\^.\^*\e// | |
535 | .P2 | |
536 | which matches any string of characters | |
537 | enclosed in slashes. | |
538 | .PP | |
539 | One can also specify that any field or variable | |
540 | matches | |
541 | a regular expression (or does not match it) with the operators | |
542 | .UL ~ | |
543 | and | |
544 | .UL !~ . | |
545 | The program | |
546 | .P1 | |
547 | $1 ~ /[jJ]ohn/ | |
548 | .P2 | |
549 | prints all lines where the first field matches ``john'' or ``John.'' | |
550 | Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on. | |
551 | To restrict it to exactly | |
552 | .UL [jJ]ohn , | |
553 | use | |
554 | .P1 | |
555 | $1 ~ /^[jJ]ohn$/ | |
556 | .P2 | |
557 | The caret ^ refers to the beginning | |
558 | of a line or field; | |
559 | the dollar sign | |
560 | .UL $ | |
561 | refers to the end. | |
562 | .NH 2 | |
563 | Relational Expressions | |
564 | .PP | |
565 | An | |
566 | .IT awk | |
567 | pattern can be a relational expression | |
568 | involving the usual relational operators | |
569 | .UL < , | |
570 | .UL <= , | |
571 | .UL == , | |
572 | .UL != , | |
573 | .UL >= , | |
574 | and | |
575 | .UL > . | |
576 | An example is | |
577 | .P1 | |
578 | $2 > $1 + 100 | |
579 | .P2 | |
580 | which selects lines where the second field | |
581 | is at least 100 greater than the first field. | |
582 | Similarly, | |
583 | .P1 | |
584 | NF % 2 == 0 | |
585 | .P2 | |
586 | prints lines with an even number of fields. | |
587 | .PP | |
588 | In relational tests, if neither operand is numeric, | |
589 | a string comparison is made; | |
590 | otherwise it is numeric. | |
591 | Thus, | |
592 | .P1 | |
593 | $1 >= "s" | |
594 | .P2 | |
595 | selects lines that begin with an | |
596 | .UL s , | |
597 | .UL t , | |
598 | .UL u , | |
599 | etc. | |
600 | In the absence of any other information, | |
601 | fields are treated as strings, so | |
602 | the program | |
603 | .P1 | |
604 | $1 > $2 | |
605 | .P2 | |
606 | will perform a string comparison. | |
607 | .NH 2 | |
608 | Combinations of Patterns | |
609 | .PP | |
610 | A pattern can be any boolean combination of patterns, | |
611 | using the operators | |
612 | .UL \||\|| | |
613 | (or), | |
614 | .UL && | |
615 | (and), and | |
616 | .UL ! | |
617 | (not). | |
618 | For example, | |
619 | .P1 | |
620 | $1 >= "s" && $1 < "t" && $1 != "smith" | |
621 | .P2 | |
622 | selects lines where the first field begins with ``s'', but is not ``smith''. | |
623 | .UL && | |
624 | and | |
625 | .UL \||\|| | |
626 | guarantee that their operands | |
627 | will be evaluated | |
628 | from left to right; | |
629 | evaluation stops as soon as the truth or falsehood | |
630 | is determined. | |
631 | .NH 2 | |
632 | Pattern Ranges | |
633 | .PP | |
634 | The ``pattern'' that selects an action may also | |
635 | consist of two patterns separated by a comma, as in | |
636 | .P1 | |
637 | pat1, pat2 { ... } | |
638 | .P2 | |
639 | In this case, the action is performed for each line between | |
640 | an occurrence of | |
641 | .UL pat1 | |
642 | and the next occurrence of | |
643 | .UL pat2 | |
644 | (inclusive). | |
645 | For example, | |
646 | .P1 | |
647 | /start/, /stop/ | |
648 | .P2 | |
649 | prints all lines between | |
650 | .UL start | |
651 | and | |
652 | .UL stop , | |
653 | while | |
654 | .P1 | |
655 | NR == 100, NR == 200 { ... } | |
656 | .P2 | |
657 | does the action for lines 100 through 200 | |
658 | of the input. | |
659 | .NH 1 | |
660 | Actions | |
661 | .PP | |
662 | An | |
663 | .IT awk | |
664 | action is a sequence of action statements | |
665 | terminated by newlines or semicolons. | |
666 | These action statements can be used to do a variety of | |
667 | bookkeeping and string manipulating tasks. | |
668 | .NH 2 | |
669 | Built-in Functions | |
670 | .PP | |
671 | .IT Awk | |
672 | provides a ``length'' function | |
673 | to compute the length of a string of characters. | |
674 | This program prints each record, | |
675 | preceded by its length: | |
676 | .P1 | |
677 | {print length, $0} | |
678 | .P2 | |
679 | .UL length | |
680 | by itself is a ``pseudo-variable'' which | |
681 | yields the length of the current record; | |
682 | .UL length(argument) | |
683 | is a function which yields the length of its argument, | |
684 | as in | |
685 | the equivalent | |
686 | .P1 | |
687 | {print length($0), $0} | |
688 | .P2 | |
689 | The argument may be any expression. | |
690 | .PP | |
691 | .IT Awk | |
692 | also | |
693 | provides the arithmetic functions | |
694 | .UL sqrt , | |
695 | .UL log , | |
696 | .UL exp , | |
697 | and | |
698 | .UL int , | |
699 | for | |
700 | square root, | |
701 | base | |
702 | .IT e | |
703 | logarithm, | |
704 | exponential, | |
705 | and integer part of their respective arguments. | |
706 | .PP | |
707 | The name of one of these built-in functions, | |
708 | without argument or parentheses, | |
709 | stands for the value of the function on the | |
710 | whole record. | |
711 | The program | |
712 | .P1 | |
713 | length < 10 || length > 20 | |
714 | .P2 | |
715 | prints lines whose length | |
716 | is less than 10 or greater | |
717 | than 20. | |
718 | .PP | |
719 | The function | |
720 | .UL substr(s,\ m,\ n) | |
721 | produces the substring of | |
722 | .UL s | |
723 | that begins at position | |
724 | .UL m | |
725 | (origin 1) | |
726 | and is at most | |
727 | .UL n | |
728 | characters long. | |
729 | If | |
730 | .UL n | |
731 | is omitted, the substring goes to the end of | |
732 | .UL s . | |
733 | The function | |
734 | .UL index(s1,\ s2) | |
735 | returns the position where the string | |
736 | .UL s2 | |
737 | occurs in | |
738 | .UL s1 , | |
739 | or zero if it does not. | |
740 | .PP | |
741 | The function | |
742 | .UL sprintf(f,\ e1,\ e2,\ ...) | |
743 | produces the value of the expressions | |
744 | .UL e1 , | |
745 | .UL e2 , | |
746 | etc., | |
747 | in the | |
748 | .UL printf | |
749 | format specified by | |
750 | .UL f . | |
751 | Thus, for example, | |
752 | .P1 | |
753 | x = sprintf("%8.2f %10ld", $1, $2) | |
754 | .P2 | |
755 | sets | |
756 | .UL x | |
757 | to the string produced by formatting | |
758 | the values of | |
759 | .UL $1 | |
760 | and | |
761 | .UL $2 . | |
762 | .NH 2 | |
763 | Variables, Expressions, and Assignments | |
764 | .PP | |
765 | .IT Awk | |
766 | variables take on numeric (floating point) | |
767 | or string values according to context. | |
768 | For example, in | |
769 | .P1 | |
770 | x = 1 | |
771 | .P2 | |
772 | .UL x | |
773 | is clearly a number, while in | |
774 | .P1 | |
775 | x = "smith" | |
776 | .P2 | |
777 | it is clearly a string. | |
778 | Strings are converted to numbers and | |
779 | vice versa whenever context demands it. | |
780 | For instance, | |
781 | .P1 | |
782 | x = "3" + "4" | |
783 | .P2 | |
784 | assigns 7 to | |
785 | .UL x . | |
786 | Strings which cannot be interpreted | |
787 | as numbers in a numerical context | |
788 | will generally have numeric value zero, | |
789 | but it is unwise to count on this behavior. | |
790 | .PP | |
791 | By default, variables (other than built-ins) are initialized to the null string, | |
792 | which has numerical value zero; | |
793 | this eliminates the need for most | |
794 | .UL BEGIN | |
795 | sections. | |
796 | For example, the sums of the first two fields can be computed by | |
797 | .P1 | |
798 | { s1 += $1; s2 += $2 } | |
799 | END { print s1, s2 } | |
800 | .P2 | |
801 | .PP | |
802 | Arithmetic is done internally in floating point. | |
803 | The arithmetic operators are | |
804 | .UL + , | |
805 | .UL \- , | |
806 | .UL \(** , | |
807 | .UL / , | |
808 | and | |
809 | .UL % | |
810 | (mod). | |
811 | The C increment | |
812 | .UL ++ | |
813 | and | |
814 | decrement | |
815 | .UL \-\- | |
816 | operators are also available, | |
817 | and so are the assignment operators | |
818 | .UL += , | |
819 | .UL \-= , | |
820 | .UL *= , | |
821 | .UL /= , | |
822 | and | |
823 | .UL %= . | |
824 | These operators may all be used in expressions. | |
825 | .NH 2 | |
826 | Field Variables | |
827 | .PP | |
828 | Fields in | |
829 | .IT awk | |
830 | share essentially all of the properties of variables _ | |
831 | they may be used in arithmetic or string operations, | |
832 | and may be assigned to. | |
833 | Thus one can | |
834 | replace the first field with a sequence number like this: | |
835 | .P1 | |
836 | { $1 = NR; print } | |
837 | .P2 | |
838 | or | |
839 | accumulate two fields into a third, like this: | |
840 | .P1 | |
841 | { $1 = $2 + $3; print $0 } | |
842 | .P2 | |
843 | or assign a string to a field: | |
844 | .P1 | |
845 | { if ($3 > 1000) | |
846 | $3 = "too big" | |
847 | ||
848 | } | |
849 | .P2 | |
850 | which replaces the third field by ``too big'' when it is, | |
851 | and in any case prints the record. | |
852 | .PP | |
853 | Field references may be numerical expressions, | |
854 | as in | |
855 | .P1 | |
856 | { print $i, $(i+1), $(i+n) } | |
857 | .P2 | |
858 | Whether a field is deemed numeric or string depends on context; | |
859 | in ambiguous cases like | |
860 | .P1 | |
861 | if ($1 == $2) ... | |
862 | .P2 | |
863 | fields are treated as strings. | |
864 | .PP | |
865 | Each input line is split into fields automatically as necessary. | |
866 | It is also possible to split any variable or string | |
867 | into fields: | |
868 | .P1 | |
869 | n = split(s, array, sep) | |
870 | .P2 | |
871 | splits the | |
872 | the string | |
873 | .UL s | |
874 | into | |
875 | .UL array[1] , | |
876 | \&..., | |
877 | .UL array[n] . | |
878 | The number of elements found is returned. | |
879 | If the | |
880 | .UL sep | |
881 | argument is provided, it is used as the field separator; | |
882 | otherwise | |
883 | .UL FS | |
884 | is used as the separator. | |
885 | .NH 2 | |
886 | String Concatenation | |
887 | .PP | |
888 | Strings may be concatenated. | |
889 | For example | |
890 | .P1 | |
891 | length($1 $2 $3) | |
892 | .P2 | |
893 | returns the length of the first three fields. | |
894 | Or in a | |
895 | .UL print | |
896 | statement, | |
897 | .P1 | |
898 | print $1 " is " $2 | |
899 | .P2 | |
900 | prints | |
901 | the two fields separated by `` is ''. | |
902 | Variables and numeric expressions may also appear in concatenations. | |
903 | .NH 2 | |
904 | Arrays | |
905 | .PP | |
906 | Array elements are not declared; | |
907 | they spring into existence by being mentioned. | |
908 | Subscripts may have | |
909 | .ul | |
910 | any | |
911 | non-null | |
912 | value, including non-numeric strings. | |
913 | As an example of a conventional numeric subscript, | |
914 | the statement | |
915 | .P1 | |
916 | x[NR] = $0 | |
917 | .P2 | |
918 | assigns the current input record to | |
919 | the | |
920 | .UL NR -th | |
921 | element of the array | |
922 | .UL x . | |
923 | In fact, it is possible in principle (though perhaps slow) | |
924 | to process the entire input in a random order with the | |
925 | .IT awk | |
926 | program | |
927 | .P1 | |
928 | { x[NR] = $0 } | |
929 | END { \fI... program ...\fP } | |
930 | .P2 | |
931 | The first action merely records each input line in | |
932 | the array | |
933 | .UL x . | |
934 | .PP | |
935 | Array elements may be named by non-numeric values, | |
936 | which gives | |
937 | .IT awk | |
938 | a capability rather like the associative memory of | |
939 | Snobol tables. | |
940 | Suppose the input contains fields with values like | |
941 | .UL apple , | |
942 | .UL orange , | |
943 | etc. | |
944 | Then the program | |
945 | .P1 | |
946 | /apple/ { x["apple"]++ } | |
947 | /orange/ { x["orange"]++ } | |
948 | END { print x["apple"], x["orange"] } | |
949 | .P2 | |
950 | increments counts for the named array elements, | |
951 | and prints them at the end of the input. | |
952 | .NH 2 | |
953 | Flow-of-Control Statements | |
954 | .PP | |
955 | .IT Awk | |
956 | provides the basic flow-of-control statements | |
957 | .UL if-else , | |
958 | .UL while , | |
959 | .UL for , | |
960 | and statement grouping with braces, as in C. | |
961 | We showed the | |
962 | .UL if | |
963 | statement in section 3.3 without describing it. | |
964 | The condition in parentheses is evaluated; | |
965 | if it is true, the statement following the | |
966 | .UL if | |
967 | is done. | |
968 | The | |
969 | .UL else | |
970 | part is optional. | |
971 | .PP | |
972 | The | |
973 | .UL while | |
974 | statement is exactly like that of C. | |
975 | For example, to print all input fields one per line, | |
976 | .P1 | |
977 | i = 1 | |
978 | while (i <= NF) { | |
979 | print $i | |
980 | ++i | |
981 | } | |
982 | .P2 | |
983 | .PP | |
984 | The | |
985 | .UL for | |
986 | statement is also exactly that of C: | |
987 | .P1 | |
988 | for (i = 1; i <= NF; i++) | |
989 | print $i | |
990 | .P2 | |
991 | does the same job as the | |
992 | .UL while | |
993 | statement above. | |
994 | .PP | |
995 | There is an alternate form of the | |
996 | .UL for | |
997 | statement which is suited for accessing the | |
998 | elements of an associative array: | |
999 | .P1 | |
1000 | for (i in array) | |
1001 | \fIstatement\f3 | |
1002 | .P2 | |
1003 | does | |
1004 | .ul | |
1005 | statement | |
1006 | with | |
1007 | .UL i | |
1008 | set in turn to each element of | |
1009 | .UL array . | |
1010 | The elements are accessed in an apparently random order. | |
1011 | Chaos will ensue if | |
1012 | .UL i | |
1013 | is altered, or if any new elements are | |
1014 | accessed during the loop. | |
1015 | .PP | |
1016 | The expression in the condition part of an | |
1017 | .UL if , | |
1018 | .UL while | |
1019 | or | |
1020 | .UL for | |
1021 | can include relational operators like | |
1022 | .UL < , | |
1023 | .UL <= , | |
1024 | .UL > , | |
1025 | .UL >= , | |
1026 | .UL == | |
1027 | (``is equal to''), | |
1028 | and | |
1029 | .UL != | |
1030 | (``not equal to''); | |
1031 | regular expression matches with the match operators | |
1032 | .UL ~ | |
1033 | and | |
1034 | .UL !~ ; | |
1035 | the logical operators | |
1036 | .UL \||\|| , | |
1037 | .UL && , | |
1038 | and | |
1039 | .UL ! ; | |
1040 | and of course parentheses for grouping. | |
1041 | .PP | |
1042 | The | |
1043 | .UL break | |
1044 | statement causes an immediate exit | |
1045 | from an enclosing | |
1046 | .UL while | |
1047 | or | |
1048 | .UL for ; | |
1049 | the | |
1050 | .UL continue | |
1051 | statement | |
1052 | causes the next iteration to begin. | |
1053 | .PP | |
1054 | The statement | |
1055 | .UL next | |
1056 | causes | |
1057 | .IT awk | |
1058 | to skip immediately to | |
1059 | the next record and begin scanning the patterns from the top. | |
1060 | The statement | |
1061 | .UL exit | |
1062 | causes the program to behave as if the end of the input | |
1063 | had occurred. | |
1064 | .PP | |
1065 | Comments may be placed in | |
1066 | .IT awk | |
1067 | programs: | |
1068 | they begin with the character | |
1069 | .UL # | |
1070 | and end with the end of the line, | |
1071 | as in | |
1072 | .P1 | |
1073 | print x, y # this is a comment | |
1074 | .P2 | |
1075 | .NH | |
1076 | Design | |
1077 | .PP | |
1078 | The | |
1079 | .UX | |
1080 | system | |
1081 | already provides several programs that | |
1082 | operate by passing input through a | |
1083 | selection mechanism. | |
1084 | .IT Grep , | |
1085 | the first and simplest, merely prints all lines which | |
1086 | match a single specified pattern. | |
1087 | .IT Egrep | |
1088 | provides more general patterns, i.e., regular expressions | |
1089 | in full generality; | |
1090 | .IT fgrep | |
1091 | searches for a set of keywords with a particularly fast algorithm. | |
1092 | .IT Sed\| | |
1093 | .[ | |
1094 | unix programm manual | |
1095 | .] | |
1096 | provides most of the editing facilities of | |
1097 | the editor | |
1098 | .IT ed , | |
1099 | applied to a stream of input. | |
1100 | None of these programs provides | |
1101 | numeric capabilities, | |
1102 | logical relations, | |
1103 | or variables. | |
1104 | .PP | |
1105 | .IT Lex\| | |
1106 | .[ | |
1107 | lesk lexical analyzer cstr | |
1108 | .] | |
1109 | provides general regular expression recognition capabilities, | |
1110 | and, by serving as a C program generator, | |
1111 | is essentially open-ended in its capabilities. | |
1112 | The use of | |
1113 | .IT lex , | |
1114 | however, requires a knowledge of C programming, | |
1115 | and a | |
1116 | .IT lex | |
1117 | program must be compiled and loaded before use, | |
1118 | which discourages its use for one-shot applications. | |
1119 | .PP | |
1120 | .IT Awk | |
1121 | is an attempt | |
1122 | to fill in another part of the matrix of possibilities. | |
1123 | It | |
1124 | provides general regular expression capabilities | |
1125 | and an implicit input/output loop. | |
1126 | But it also provides convenient numeric processing, | |
1127 | variables, | |
1128 | more general selection, | |
1129 | and control flow in the actions. | |
1130 | It | |
1131 | does not require compilation or a knowledge of C. | |
1132 | Finally, | |
1133 | .IT awk | |
1134 | provides | |
1135 | a convenient way to access fields within lines; | |
1136 | it is unique in this respect. | |
1137 | .PP | |
1138 | .IT Awk | |
1139 | also tries to integrate strings and numbers | |
1140 | completely, | |
1141 | by treating all quantities as both string and numeric, | |
1142 | deciding which representation is appropriate | |
1143 | as late as possible. | |
1144 | In most cases the user can simply ignore the differences. | |
1145 | .PP | |
1146 | Most of the effort in developing | |
1147 | .I awk | |
1148 | went into deciding what | |
1149 | .I awk | |
1150 | should or should not do | |
1151 | (for instance, it doesn't do string substitution) | |
1152 | and what the syntax should be | |
1153 | (no explicit operator for concatenation) | |
1154 | rather | |
1155 | than on writing or debugging the code. | |
1156 | We have tried | |
1157 | to make the syntax powerful | |
1158 | but easy to use and well adapted | |
1159 | to scanning files. | |
1160 | For example, | |
1161 | the absence of declarations and implicit initializations, | |
1162 | while probably a bad idea for a general-purpose programming language, | |
1163 | is desirable in a language | |
1164 | that is meant to be used for tiny programs | |
1165 | that may even be composed on the command line. | |
1166 | .PP | |
1167 | In practice, | |
1168 | .IT awk | |
1169 | usage seems to fall into two broad categories. | |
1170 | One is what might be called ``report generation'' \(em | |
1171 | processing an input to extract counts, | |
1172 | sums, sub-totals, etc. | |
1173 | This also includes the writing of trivial | |
1174 | data validation programs, | |
1175 | such as verifying that a field contains only numeric information | |
1176 | or that certain delimiters are properly balanced. | |
1177 | The combination of textual and numeric processing is invaluable here. | |
1178 | .PP | |
1179 | A second area of use is as a data transformer, | |
1180 | converting data from the form produced by one program | |
1181 | into that expected by another. | |
1182 | The simplest examples merely select fields, perhaps with rearrangements. | |
1183 | .NH | |
1184 | Implementation | |
1185 | .PP | |
1186 | The actual implementation of | |
1187 | .IT awk | |
1188 | uses the language development tools available | |
1189 | on the | |
1190 | .UC UNIX | |
1191 | operating system. | |
1192 | The grammar is specified with | |
1193 | .IT yacc ; | |
1194 | .[ | |
1195 | yacc johnson cstr | |
1196 | .] | |
1197 | the lexical analysis is done by | |
1198 | .IT lex ; | |
1199 | the regular expression recognizers are | |
1200 | deterministic finite automata | |
1201 | constructed directly from the expressions. | |
1202 | An | |
1203 | .IT awk | |
1204 | program is translated into a | |
1205 | parse tree which is then directly executed | |
1206 | by a simple interpreter. | |
1207 | .PP | |
1208 | .IT Awk | |
1209 | was designed for ease of use rather than processing speed; | |
1210 | the delayed evaluation of variable types | |
1211 | and the necessity to break input | |
1212 | into fields makes high speed difficult to achieve in any case. | |
1213 | Nonetheless, | |
1214 | the program has not proven to be unworkably slow. | |
1215 | .PP | |
1216 | Table I below shows the execution (user + system) time | |
1217 | on a PDP-11/70 of | |
1218 | the | |
1219 | .UC UNIX | |
1220 | programs | |
1221 | .IT wc , | |
1222 | .IT grep , | |
1223 | .IT egrep , | |
1224 | .IT fgrep , | |
1225 | .IT sed , | |
1226 | .IT lex , | |
1227 | and | |
1228 | .IT awk | |
1229 | on the following simple tasks: | |
1230 | .IP "\ \ 1." | |
1231 | count the number of lines. | |
1232 | .IP "\ \ 2." | |
1233 | print all lines containing ``doug''. | |
1234 | .IP "\ \ 3." | |
1235 | print all lines containing ``doug'', ``ken'' or ``dmr''. | |
1236 | .IP "\ \ 4." | |
1237 | print the third field of each line. | |
1238 | .IP "\ \ 5." | |
1239 | print the third and second fields of each line, in that order. | |
1240 | .IP "\ \ 6." | |
1241 | append all lines containing ``doug'', ``ken'', and ``dmr'' | |
1242 | to files ``jdoug'', ``jken'', and ``jdmr'', respectively. | |
1243 | .IP "\ \ 7." | |
1244 | print each line prefixed by ``line-number\ :\ ''. | |
1245 | .IP "\ \ 8." | |
1246 | sum the fourth column of a table. | |
1247 | .LP | |
1248 | The program | |
1249 | .IT wc | |
1250 | merely counts words, lines and characters in its input; | |
1251 | we have already mentioned the others. | |
1252 | In all cases the input was a file containing | |
1253 | 10,000 lines | |
1254 | as created by the | |
1255 | command | |
1256 | .IT "ls \-l" ; | |
1257 | each line has the form | |
1258 | .P1 | |
1259 | -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx | |
1260 | .P2 | |
1261 | The total length of this input is | |
1262 | 452,960 characters. | |
1263 | Times for | |
1264 | .IT lex | |
1265 | do not include compile or load. | |
1266 | .PP | |
1267 | As might be expected, | |
1268 | .IT awk | |
1269 | is not as fast as the specialized tools | |
1270 | .IT wc , | |
1271 | .IT sed , | |
1272 | or the programs in the | |
1273 | .IT grep | |
1274 | family, | |
1275 | but | |
1276 | is faster than the more general tool | |
1277 | .IT lex . | |
1278 | In all cases, the tasks were | |
1279 | about as easy to express as | |
1280 | .IT awk | |
1281 | programs | |
1282 | as programs in these other languages; | |
1283 | tasks involving fields were | |
1284 | considerably easier to express as | |
1285 | .IT awk | |
1286 | programs. | |
1287 | Some of the test programs are shown in | |
1288 | .IT awk , | |
1289 | .IT sed | |
1290 | and | |
1291 | .IT lex . | |
1292 | .[ | |
1293 | $LIST$ | |
1294 | .] | |
1295 | .1C | |
1296 | .TS | |
1297 | center; | |
1298 | c c c c c c c c c | |
1299 | c c c c c c c c c | |
1300 | c|n|n|n|n|n|n|n|n|. | |
1301 | Task | |
1302 | Program 1 2 3 4 5 6 7 8 | |
1303 | _ | |
1304 | \fIwc\fR 8.6 | |
1305 | \fIgrep\fR 11.7 13.1 | |
1306 | \fIegrep\fR 6.2 11.5 11.6 | |
1307 | \fIfgrep\fR 7.7 13.8 16.1 | |
1308 | \fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1 | |
1309 | \fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8 | |
1310 | \fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1 | |
1311 | _ | |
1312 | .TE | |
1313 | .sp | |
1314 | .ce | |
1315 | \fBTable I.\fR Execution Times of Programs. (Times are in sec.) | |
1316 | .sp 2 | |
1317 | .2C | |
1318 | .PP | |
1319 | The programs for some of these jobs are shown below. | |
1320 | The | |
1321 | .IT lex | |
1322 | programs are generally too long to show. | |
1323 | .LP | |
1324 | AWK: | |
1325 | .LP | |
1326 | .P1 | |
1327 | 1. END {print NR} | |
1328 | .P2 | |
1329 | .P1 | |
1330 | 2. /doug/ | |
1331 | .P2 | |
1332 | .P1 | |
1333 | 3. /ken|doug|dmr/ | |
1334 | .P2 | |
1335 | .P1 | |
1336 | 4. {print $3} | |
1337 | .P2 | |
1338 | .P1 | |
1339 | 5. {print $3, $2} | |
1340 | .P2 | |
1341 | .P1 | |
1342 | 6. /ken/ {print >"jken"} | |
1343 | /doug/ {print >"jdoug"} | |
1344 | /dmr/ {print >"jdmr"} | |
1345 | .P2 | |
1346 | .P1 | |
1347 | 7. {print NR ": " $0} | |
1348 | .P2 | |
1349 | .P1 | |
1350 | 8. {sum = sum + $4} | |
1351 | END {print sum} | |
1352 | .P2 | |
1353 | .LP | |
1354 | SED: | |
1355 | .LP | |
1356 | .P1 | |
1357 | 1. $= | |
1358 | .P2 | |
1359 | .P1 | |
1360 | 2. /doug/p | |
1361 | .P2 | |
1362 | .P1 | |
1363 | 3. /doug/p | |
1364 | /doug/d | |
1365 | /ken/p | |
1366 | /ken/d | |
1367 | /dmr/p | |
1368 | /dmr/d | |
1369 | .P2 | |
1370 | .P1 | |
1371 | 4. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p | |
1372 | .P2 | |
1373 | .P1 | |
1374 | 5. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p | |
1375 | .P2 | |
1376 | .P1 | |
1377 | 6. /ken/w jken | |
1378 | /doug/w jdoug | |
1379 | /dmr/w jdmr | |
1380 | .P2 | |
1381 | .LP | |
1382 | LEX: | |
1383 | .LP | |
1384 | .P1 | |
1385 | 1. %{ | |
1386 | int i; | |
1387 | %} | |
1388 | %% | |
1389 | \en i++; | |
1390 | . ; | |
1391 | %% | |
1392 | yywrap() { | |
1393 | printf("%d\en", i); | |
1394 | } | |
1395 | .P2 | |
1396 | .P1 | |
1397 | 2. %% | |
1398 | ^.*doug.*$ printf("%s\en", yytext); | |
1399 | . ; | |
1400 | \en ; | |
1401 | .P2 |