| 1 | .fp 3 G |
| 2 | ....TM "78-1271-12, 78-1273-6" 39199 39199-11 |
| 3 | .ND "September 1, 1978" |
| 4 | ....TR 68 |
| 5 | .RP |
| 6 | . \" macros here |
| 7 | .tr _\(em |
| 8 | .if t .tr ~\(ap |
| 9 | .tr |\(or |
| 10 | .tr *\(** |
| 11 | .de UC |
| 12 | \&\\$3\s-1\\$1\\s0\&\\$2 |
| 13 | .. |
| 14 | .de IT |
| 15 | .if n .ul |
| 16 | \&\\$3\f2\\$1\fP\|\\$2 |
| 17 | .. |
| 18 | .de UL |
| 19 | .if n .ul |
| 20 | \&\\$3\f3\\$1\fP\&\\$2 |
| 21 | .. |
| 22 | .de P1 |
| 23 | .DS I 3n |
| 24 | .nf |
| 25 | .if n .ta 5 10 15 20 25 30 35 40 45 50 55 60 |
| 26 | .if t .ta .3i .6i .9i 1.2i |
| 27 | .if t .tr -\-'\(fm*\(** |
| 28 | .if t .tr _\(ul |
| 29 | .ft 3 |
| 30 | .lg 0 |
| 31 | .ss 18 |
| 32 | . \"use first argument as indent if present |
| 33 | .. |
| 34 | .de P2 |
| 35 | .ps \\n(PS |
| 36 | .vs \\n(VSp |
| 37 | .ft R |
| 38 | .ss 12 |
| 39 | .if n .ls 2 |
| 40 | .tr --''``^^!! |
| 41 | .if t .tr _\(em |
| 42 | .fi |
| 43 | .lg |
| 44 | .DE |
| 45 | .. |
| 46 | .hw semi-colon |
| 47 | .hy 14 |
| 48 | . \"2=not last lines; 4= no -xx; 8=no xx- |
| 49 | . \"special chars in programs |
| 50 | .de WS |
| 51 | .sp \\$1 |
| 52 | .. |
| 53 | . \" end of macros |
| 54 | .TL |
| 55 | Awk \(em A Pattern Scanning and Processing Language |
| 56 | .br |
| 57 | (Second Edition) |
| 58 | .AU "MH 2C-522" 4862 |
| 59 | Alfred V. Aho |
| 60 | .AU "MH 2C-518" 6021 |
| 61 | Brian W. Kernighan |
| 62 | .AU "MH 2C-514" 7214 |
| 63 | Peter J. Weinberger |
| 64 | .AI |
| 65 | .MH |
| 66 | .AB |
| 67 | .IT Awk |
| 68 | is a programming language whose |
| 69 | basic operation |
| 70 | is to search a set of files |
| 71 | for patterns, and to perform specified actions upon lines or fields of lines which |
| 72 | contain instances of those patterns. |
| 73 | .IT Awk |
| 74 | makes certain data selection and transformation operations easy to express; |
| 75 | for example, the |
| 76 | .IT awk |
| 77 | program |
| 78 | .sp |
| 79 | .ce |
| 80 | .ft 3 |
| 81 | length > 72 |
| 82 | .ft |
| 83 | .sp |
| 84 | prints all input lines whose length exceeds 72 characters; |
| 85 | the program |
| 86 | .ce |
| 87 | .sp |
| 88 | .ft 3 |
| 89 | NF % 2 == 0 |
| 90 | .ft R |
| 91 | .sp |
| 92 | prints all lines with an even number of fields; |
| 93 | and the program |
| 94 | .ce |
| 95 | .sp |
| 96 | .ft 3 |
| 97 | { $1 = log($1); print } |
| 98 | .ft R |
| 99 | .sp |
| 100 | replaces the first field of each line by its logarithm. |
| 101 | .PP |
| 102 | .IT Awk |
| 103 | patterns may include arbitrary boolean combinations of regular expressions |
| 104 | and of relational operators on strings, numbers, fields, variables, and array elements. |
| 105 | Actions may include the same pattern-matching constructions as in patterns, |
| 106 | as well as |
| 107 | arithmetic and string expressions and assignments, |
| 108 | .UL if-else , |
| 109 | .UL while , |
| 110 | .UL for |
| 111 | statements, |
| 112 | and multiple output streams. |
| 113 | .PP |
| 114 | This report contains a user's guide, a discussion of the design and implementation of |
| 115 | .IT awk , |
| 116 | and some timing statistics. |
| 117 | ....It supersedes TM-77-1271-5, dated September 8, 1977. |
| 118 | .AE |
| 119 | .CS 6 1 7 0 1 4 |
| 120 | .if n .ls 2 |
| 121 | .nr PS 9 |
| 122 | .nr VS 11 |
| 123 | .NH |
| 124 | Introduction |
| 125 | .if t .2C |
| 126 | .PP |
| 127 | .IT Awk |
| 128 | is a programming language designed to make |
| 129 | many common |
| 130 | information retrieval and text manipulation tasks |
| 131 | easy to state and to perform. |
| 132 | .PP |
| 133 | The basic operation of |
| 134 | .IT awk |
| 135 | is to scan a set of input lines in order, |
| 136 | searching for lines which match any of a set of patterns |
| 137 | which the user has specified. |
| 138 | For each pattern, an action can be specified; |
| 139 | this action will be performed on each line that matches the pattern. |
| 140 | .PP |
| 141 | Readers familiar with the |
| 142 | .UX |
| 143 | program |
| 144 | .IT grep\| |
| 145 | .[ |
| 146 | unix program manual |
| 147 | .] |
| 148 | will recognize |
| 149 | the approach, although in |
| 150 | .IT awk |
| 151 | the patterns may be more |
| 152 | general than in |
| 153 | .IT grep , |
| 154 | and the actions allowed are more involved than merely |
| 155 | printing the matching line. |
| 156 | For example, the |
| 157 | .IT awk |
| 158 | program |
| 159 | .P1 |
| 160 | {print $3, $2} |
| 161 | .P2 |
| 162 | prints the third and second columns of a table |
| 163 | in that order. |
| 164 | The program |
| 165 | .P1 |
| 166 | $2 ~ /A\||B\||C/ |
| 167 | .P2 |
| 168 | prints all input lines with an A, B, or C in the second field. |
| 169 | The program |
| 170 | .P1 |
| 171 | $1 != prev { print; prev = $1 } |
| 172 | .P2 |
| 173 | prints all lines in which the first field is different |
| 174 | from the previous first field. |
| 175 | .NH 2 |
| 176 | Usage |
| 177 | .PP |
| 178 | The command |
| 179 | .P1 |
| 180 | awk program [files] |
| 181 | .P2 |
| 182 | executes the |
| 183 | .IT awk |
| 184 | commands in |
| 185 | the string |
| 186 | .UL program |
| 187 | on the set of named files, |
| 188 | or on the standard input if there are no files. |
| 189 | The statements can also be placed in a file |
| 190 | .UL pfile , |
| 191 | and executed by the command |
| 192 | .P1 |
| 193 | awk -f pfile [files] |
| 194 | .P2 |
| 195 | .NH 2 |
| 196 | Program Structure |
| 197 | .PP |
| 198 | An |
| 199 | .IT awk |
| 200 | program is a sequence of statements of the form: |
| 201 | .P1 |
| 202 | .ft I |
| 203 | pattern { action } |
| 204 | pattern { action } |
| 205 | ... |
| 206 | .ft 3 |
| 207 | .P2 |
| 208 | Each line of input |
| 209 | is matched against |
| 210 | each of the patterns in turn. |
| 211 | For each pattern that matches, the associated action |
| 212 | is executed. |
| 213 | When all the patterns have been tested, the next line |
| 214 | is fetched and the matching starts over. |
| 215 | .PP |
| 216 | Either the pattern or the action may be left out, |
| 217 | but not both. |
| 218 | If there is no action for a pattern, |
| 219 | the matching line is simply |
| 220 | copied to the output. |
| 221 | (Thus a line which matches several patterns can be printed several times.) |
| 222 | If there is no pattern for an action, |
| 223 | then the action is performed for every input line. |
| 224 | A line which matches no pattern is ignored. |
| 225 | .PP |
| 226 | Since patterns and actions are both optional, |
| 227 | actions must be enclosed in braces |
| 228 | to distinguish them from patterns. |
| 229 | .NH 2 |
| 230 | Records and Fields |
| 231 | .PP |
| 232 | .IT Awk |
| 233 | input is divided into |
| 234 | ``records'' terminated by a record separator. |
| 235 | The default record separator is a newline, |
| 236 | so by default |
| 237 | .IT awk |
| 238 | processes its input a line at a time. |
| 239 | The number of the current record is available in a variable |
| 240 | named |
| 241 | .UL NR . |
| 242 | .PP |
| 243 | Each input record |
| 244 | is considered to be divided into ``fields.'' |
| 245 | Fields are normally separated by |
| 246 | white space \(em blanks or tabs \(em |
| 247 | but the input field separator may be changed, as described below. |
| 248 | Fields are referred to as |
| 249 | .UL "$1, $2," |
| 250 | and so forth, |
| 251 | where |
| 252 | .UL $1 |
| 253 | is the first field, |
| 254 | and |
| 255 | .UL $0 |
| 256 | is the whole input record itself. |
| 257 | Fields may be assigned to. |
| 258 | The number of fields in the current record |
| 259 | is available in a variable named |
| 260 | .UL NF . |
| 261 | .PP |
| 262 | The variables |
| 263 | .UL FS |
| 264 | and |
| 265 | .UL RS |
| 266 | refer to the input field and record separators; |
| 267 | they may be changed at any time to any single character. |
| 268 | The optional command-line argument |
| 269 | \f3\-F\fIc\fR |
| 270 | may also be used to set |
| 271 | .UL FS |
| 272 | to the character |
| 273 | .IT c . |
| 274 | .PP |
| 275 | If the record separator is empty, |
| 276 | an empty input line is taken as the record separator, |
| 277 | and blanks, tabs and newlines are treated as field separators. |
| 278 | .PP |
| 279 | The variable |
| 280 | .UL FILENAME |
| 281 | contains the name of the current input file. |
| 282 | .NH 2 |
| 283 | Printing |
| 284 | .PP |
| 285 | An action may have no pattern, |
| 286 | in which case the action is executed for |
| 287 | all |
| 288 | lines. |
| 289 | The simplest action is to print some or all of a record; |
| 290 | this is accomplished by the |
| 291 | .IT awk |
| 292 | command |
| 293 | .UL print . |
| 294 | The |
| 295 | .IT awk |
| 296 | program |
| 297 | .P1 |
| 298 | { print } |
| 299 | .P2 |
| 300 | prints each record, thus copying the input to the output intact. |
| 301 | More useful is to print a field or fields from each record. |
| 302 | For instance, |
| 303 | .P1 |
| 304 | print $2, $1 |
| 305 | .P2 |
| 306 | prints the first two fields in reverse order. |
| 307 | Items separated by a comma in the print statement will be separated by the current output field separator |
| 308 | when output. |
| 309 | Items not separated by commas will be concatenated, |
| 310 | so |
| 311 | .P1 |
| 312 | print $1 $2 |
| 313 | .P2 |
| 314 | runs the first and second fields together. |
| 315 | .PP |
| 316 | The predefined variables |
| 317 | .UL NF |
| 318 | and |
| 319 | .UL NR |
| 320 | can be used; |
| 321 | for example |
| 322 | .P1 |
| 323 | { print NR, NF, $0 } |
| 324 | .P2 |
| 325 | prints each record preceded by the record number and the number of fields. |
| 326 | .PP |
| 327 | Output may be diverted to multiple files; |
| 328 | the program |
| 329 | .P1 |
| 330 | { print $1 >"foo1"; print $2 >"foo2" } |
| 331 | .P2 |
| 332 | writes the first field, |
| 333 | .UL $1 , |
| 334 | on the file |
| 335 | .UL foo1 , |
| 336 | and the second field on file |
| 337 | .UL foo2 . |
| 338 | The |
| 339 | .UL >> |
| 340 | notation can also be used: |
| 341 | .P1 |
| 342 | print $1 >>"foo" |
| 343 | .P2 |
| 344 | appends the output to the file |
| 345 | .UL foo . |
| 346 | (In each case, |
| 347 | the output files are |
| 348 | created if necessary.) |
| 349 | The file name can be a variable or a field as well as a constant; |
| 350 | for example, |
| 351 | .P1 |
| 352 | print $1 >$2 |
| 353 | .P2 |
| 354 | uses the contents of field 2 as a file name. |
| 355 | .PP |
| 356 | Naturally there is a limit on the number of output files; |
| 357 | currently it is 10. |
| 358 | .PP |
| 359 | Similarly, output can be piped into another process |
| 360 | (on |
| 361 | .UC UNIX |
| 362 | only); for instance, |
| 363 | .P1 |
| 364 | print | "mail bwk" |
| 365 | .P2 |
| 366 | mails the output to |
| 367 | .UL bwk . |
| 368 | .PP |
| 369 | The variables |
| 370 | .UL OFS |
| 371 | and |
| 372 | .UL ORS |
| 373 | may be used to change the current |
| 374 | output field separator and output |
| 375 | record separator. |
| 376 | The output record separator is |
| 377 | appended to the output of the |
| 378 | .UL print |
| 379 | statement. |
| 380 | .PP |
| 381 | .IT Awk |
| 382 | also provides the |
| 383 | .UL printf |
| 384 | statement for output formatting: |
| 385 | .P1 |
| 386 | printf format expr, expr, ... |
| 387 | .P2 |
| 388 | formats the expressions in the list |
| 389 | according to the specification |
| 390 | in |
| 391 | .UL format |
| 392 | and prints them. |
| 393 | For example, |
| 394 | .P1 |
| 395 | printf "%8.2f %10ld\en", $1, $2 |
| 396 | .P2 |
| 397 | prints |
| 398 | .UL $1 |
| 399 | as a floating point number 8 digits wide, |
| 400 | with two after the decimal point, |
| 401 | and |
| 402 | .UL $2 |
| 403 | as a 10-digit long decimal number, |
| 404 | followed by a newline. |
| 405 | No output separators are produced automatically; |
| 406 | you must add them yourself, |
| 407 | as in this example. |
| 408 | The version of |
| 409 | .UL printf |
| 410 | is identical to that used with C. |
| 411 | .[ |
| 412 | C programm language prentice hall 1978 |
| 413 | .] |
| 414 | .NH 1 |
| 415 | Patterns |
| 416 | .PP |
| 417 | A pattern in front of an action acts as a selector |
| 418 | that determines whether the action is to be executed. |
| 419 | A variety of expressions may be used as patterns: |
| 420 | regular expressions, |
| 421 | arithmetic relational expressions, |
| 422 | string-valued expressions, |
| 423 | and arbitrary boolean |
| 424 | combinations of these. |
| 425 | .NH 2 |
| 426 | BEGIN and END |
| 427 | .PP |
| 428 | The special pattern |
| 429 | .UL BEGIN |
| 430 | matches the beginning of the input, |
| 431 | before the first record is read. |
| 432 | The pattern |
| 433 | .UL END |
| 434 | matches the end of the input, |
| 435 | after the last record has been processed. |
| 436 | .UL BEGIN |
| 437 | and |
| 438 | .UL END |
| 439 | thus provide a way to gain control before and after processing, |
| 440 | for initialization and wrapup. |
| 441 | .PP |
| 442 | As an example, the field separator |
| 443 | can be set to a colon by |
| 444 | .P1 |
| 445 | BEGIN { FS = ":" } |
| 446 | .ft I |
| 447 | \&... rest of program ... |
| 448 | .ft 3 |
| 449 | .P2 |
| 450 | Or the input lines may be counted by |
| 451 | .P1 |
| 452 | END { print NR } |
| 453 | .P2 |
| 454 | If |
| 455 | .UL BEGIN |
| 456 | is present, it must be the first pattern; |
| 457 | .UL END |
| 458 | must be the last if used. |
| 459 | .NH 2 |
| 460 | Regular Expressions |
| 461 | .PP |
| 462 | The simplest regular expression is a literal string of characters |
| 463 | enclosed in slashes, |
| 464 | like |
| 465 | .P1 |
| 466 | /smith/ |
| 467 | .P2 |
| 468 | This |
| 469 | is actually a complete |
| 470 | .IT awk |
| 471 | program which |
| 472 | will print all lines which contain any occurrence |
| 473 | of the name ``smith''. |
| 474 | If a line contains ``smith'' |
| 475 | as part of a larger word, |
| 476 | it will also be printed, as in |
| 477 | .P1 |
| 478 | blacksmithing |
| 479 | .P2 |
| 480 | .PP |
| 481 | .IT Awk |
| 482 | regular expressions include the regular expression |
| 483 | forms found in |
| 484 | the |
| 485 | .UC UNIX |
| 486 | text editor |
| 487 | .IT ed\| |
| 488 | .[ |
| 489 | unix program manual |
| 490 | .] |
| 491 | and |
| 492 | .IT grep |
| 493 | (without back-referencing). |
| 494 | In addition, |
| 495 | .IT awk |
| 496 | allows |
| 497 | parentheses for grouping, | for alternatives, |
| 498 | .UL + |
| 499 | for ``one or more'', and |
| 500 | .UL ? |
| 501 | for ``zero or one'', |
| 502 | all as in |
| 503 | .IT lex . |
| 504 | Character classes |
| 505 | may be abbreviated: |
| 506 | .UL [a\-zA\-Z0\-9] |
| 507 | is the set of all letters and digits. |
| 508 | As an example, |
| 509 | the |
| 510 | .IT awk |
| 511 | program |
| 512 | .P1 |
| 513 | /[Aa]ho\||[Ww]einberger\||[Kk]ernighan/ |
| 514 | .P2 |
| 515 | will print all lines which contain any of the names |
| 516 | ``Aho,'' ``Weinberger'' or ``Kernighan,'' |
| 517 | whether capitalized or not. |
| 518 | .PP |
| 519 | Regular expressions |
| 520 | (with the extensions listed above) |
| 521 | must be enclosed in slashes, |
| 522 | just as in |
| 523 | .IT ed |
| 524 | and |
| 525 | .IT sed . |
| 526 | Within a regular expression, |
| 527 | blanks and the regular expression |
| 528 | metacharacters are significant. |
| 529 | To turn of the magic meaning |
| 530 | of one of the regular expression characters, |
| 531 | precede it with a backslash. |
| 532 | An example is the pattern |
| 533 | .P1 |
| 534 | /\|\e/\^.\^*\e// |
| 535 | .P2 |
| 536 | which matches any string of characters |
| 537 | enclosed in slashes. |
| 538 | .PP |
| 539 | One can also specify that any field or variable |
| 540 | matches |
| 541 | a regular expression (or does not match it) with the operators |
| 542 | .UL ~ |
| 543 | and |
| 544 | .UL !~ . |
| 545 | The program |
| 546 | .P1 |
| 547 | $1 ~ /[jJ]ohn/ |
| 548 | .P2 |
| 549 | prints all lines where the first field matches ``john'' or ``John.'' |
| 550 | Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on. |
| 551 | To restrict it to exactly |
| 552 | .UL [jJ]ohn , |
| 553 | use |
| 554 | .P1 |
| 555 | $1 ~ /^[jJ]ohn$/ |
| 556 | .P2 |
| 557 | The caret ^ refers to the beginning |
| 558 | of a line or field; |
| 559 | the dollar sign |
| 560 | .UL $ |
| 561 | refers to the end. |
| 562 | .NH 2 |
| 563 | Relational Expressions |
| 564 | .PP |
| 565 | An |
| 566 | .IT awk |
| 567 | pattern can be a relational expression |
| 568 | involving the usual relational operators |
| 569 | .UL < , |
| 570 | .UL <= , |
| 571 | .UL == , |
| 572 | .UL != , |
| 573 | .UL >= , |
| 574 | and |
| 575 | .UL > . |
| 576 | An example is |
| 577 | .P1 |
| 578 | $2 > $1 + 100 |
| 579 | .P2 |
| 580 | which selects lines where the second field |
| 581 | is at least 100 greater than the first field. |
| 582 | Similarly, |
| 583 | .P1 |
| 584 | NF % 2 == 0 |
| 585 | .P2 |
| 586 | prints lines with an even number of fields. |
| 587 | .PP |
| 588 | In relational tests, if neither operand is numeric, |
| 589 | a string comparison is made; |
| 590 | otherwise it is numeric. |
| 591 | Thus, |
| 592 | .P1 |
| 593 | $1 >= "s" |
| 594 | .P2 |
| 595 | selects lines that begin with an |
| 596 | .UL s , |
| 597 | .UL t , |
| 598 | .UL u , |
| 599 | etc. |
| 600 | In the absence of any other information, |
| 601 | fields are treated as strings, so |
| 602 | the program |
| 603 | .P1 |
| 604 | $1 > $2 |
| 605 | .P2 |
| 606 | will perform a string comparison. |
| 607 | .NH 2 |
| 608 | Combinations of Patterns |
| 609 | .PP |
| 610 | A pattern can be any boolean combination of patterns, |
| 611 | using the operators |
| 612 | .UL \||\|| |
| 613 | (or), |
| 614 | .UL && |
| 615 | (and), and |
| 616 | .UL ! |
| 617 | (not). |
| 618 | For example, |
| 619 | .P1 |
| 620 | $1 >= "s" && $1 < "t" && $1 != "smith" |
| 621 | .P2 |
| 622 | selects lines where the first field begins with ``s'', but is not ``smith''. |
| 623 | .UL && |
| 624 | and |
| 625 | .UL \||\|| |
| 626 | guarantee that their operands |
| 627 | will be evaluated |
| 628 | from left to right; |
| 629 | evaluation stops as soon as the truth or falsehood |
| 630 | is determined. |
| 631 | .NH 2 |
| 632 | Pattern Ranges |
| 633 | .PP |
| 634 | The ``pattern'' that selects an action may also |
| 635 | consist of two patterns separated by a comma, as in |
| 636 | .P1 |
| 637 | pat1, pat2 { ... } |
| 638 | .P2 |
| 639 | In this case, the action is performed for each line between |
| 640 | an occurrence of |
| 641 | .UL pat1 |
| 642 | and the next occurrence of |
| 643 | .UL pat2 |
| 644 | (inclusive). |
| 645 | For example, |
| 646 | .P1 |
| 647 | /start/, /stop/ |
| 648 | .P2 |
| 649 | prints all lines between |
| 650 | .UL start |
| 651 | and |
| 652 | .UL stop , |
| 653 | while |
| 654 | .P1 |
| 655 | NR == 100, NR == 200 { ... } |
| 656 | .P2 |
| 657 | does the action for lines 100 through 200 |
| 658 | of the input. |
| 659 | .NH 1 |
| 660 | Actions |
| 661 | .PP |
| 662 | An |
| 663 | .IT awk |
| 664 | action is a sequence of action statements |
| 665 | terminated by newlines or semicolons. |
| 666 | These action statements can be used to do a variety of |
| 667 | bookkeeping and string manipulating tasks. |
| 668 | .NH 2 |
| 669 | Built-in Functions |
| 670 | .PP |
| 671 | .IT Awk |
| 672 | provides a ``length'' function |
| 673 | to compute the length of a string of characters. |
| 674 | This program prints each record, |
| 675 | preceded by its length: |
| 676 | .P1 |
| 677 | {print length, $0} |
| 678 | .P2 |
| 679 | .UL length |
| 680 | by itself is a ``pseudo-variable'' which |
| 681 | yields the length of the current record; |
| 682 | .UL length(argument) |
| 683 | is a function which yields the length of its argument, |
| 684 | as in |
| 685 | the equivalent |
| 686 | .P1 |
| 687 | {print length($0), $0} |
| 688 | .P2 |
| 689 | The argument may be any expression. |
| 690 | .PP |
| 691 | .IT Awk |
| 692 | also |
| 693 | provides the arithmetic functions |
| 694 | .UL sqrt , |
| 695 | .UL log , |
| 696 | .UL exp , |
| 697 | and |
| 698 | .UL int , |
| 699 | for |
| 700 | square root, |
| 701 | base |
| 702 | .IT e |
| 703 | logarithm, |
| 704 | exponential, |
| 705 | and integer part of their respective arguments. |
| 706 | .PP |
| 707 | The name of one of these built-in functions, |
| 708 | without argument or parentheses, |
| 709 | stands for the value of the function on the |
| 710 | whole record. |
| 711 | The program |
| 712 | .P1 |
| 713 | length < 10 || length > 20 |
| 714 | .P2 |
| 715 | prints lines whose length |
| 716 | is less than 10 or greater |
| 717 | than 20. |
| 718 | .PP |
| 719 | The function |
| 720 | .UL substr(s,\ m,\ n) |
| 721 | produces the substring of |
| 722 | .UL s |
| 723 | that begins at position |
| 724 | .UL m |
| 725 | (origin 1) |
| 726 | and is at most |
| 727 | .UL n |
| 728 | characters long. |
| 729 | If |
| 730 | .UL n |
| 731 | is omitted, the substring goes to the end of |
| 732 | .UL s . |
| 733 | The function |
| 734 | .UL index(s1,\ s2) |
| 735 | returns the position where the string |
| 736 | .UL s2 |
| 737 | occurs in |
| 738 | .UL s1 , |
| 739 | or zero if it does not. |
| 740 | .PP |
| 741 | The function |
| 742 | .UL sprintf(f,\ e1,\ e2,\ ...) |
| 743 | produces the value of the expressions |
| 744 | .UL e1 , |
| 745 | .UL e2 , |
| 746 | etc., |
| 747 | in the |
| 748 | .UL printf |
| 749 | format specified by |
| 750 | .UL f . |
| 751 | Thus, for example, |
| 752 | .P1 |
| 753 | x = sprintf("%8.2f %10ld", $1, $2) |
| 754 | .P2 |
| 755 | sets |
| 756 | .UL x |
| 757 | to the string produced by formatting |
| 758 | the values of |
| 759 | .UL $1 |
| 760 | and |
| 761 | .UL $2 . |
| 762 | .NH 2 |
| 763 | Variables, Expressions, and Assignments |
| 764 | .PP |
| 765 | .IT Awk |
| 766 | variables take on numeric (floating point) |
| 767 | or string values according to context. |
| 768 | For example, in |
| 769 | .P1 |
| 770 | x = 1 |
| 771 | .P2 |
| 772 | .UL x |
| 773 | is clearly a number, while in |
| 774 | .P1 |
| 775 | x = "smith" |
| 776 | .P2 |
| 777 | it is clearly a string. |
| 778 | Strings are converted to numbers and |
| 779 | vice versa whenever context demands it. |
| 780 | For instance, |
| 781 | .P1 |
| 782 | x = "3" + "4" |
| 783 | .P2 |
| 784 | assigns 7 to |
| 785 | .UL x . |
| 786 | Strings which cannot be interpreted |
| 787 | as numbers in a numerical context |
| 788 | will generally have numeric value zero, |
| 789 | but it is unwise to count on this behavior. |
| 790 | .PP |
| 791 | By default, variables (other than built-ins) are initialized to the null string, |
| 792 | which has numerical value zero; |
| 793 | this eliminates the need for most |
| 794 | .UL BEGIN |
| 795 | sections. |
| 796 | For example, the sums of the first two fields can be computed by |
| 797 | .P1 |
| 798 | { s1 += $1; s2 += $2 } |
| 799 | END { print s1, s2 } |
| 800 | .P2 |
| 801 | .PP |
| 802 | Arithmetic is done internally in floating point. |
| 803 | The arithmetic operators are |
| 804 | .UL + , |
| 805 | .UL \- , |
| 806 | .UL \(** , |
| 807 | .UL / , |
| 808 | and |
| 809 | .UL % |
| 810 | (mod). |
| 811 | The C increment |
| 812 | .UL ++ |
| 813 | and |
| 814 | decrement |
| 815 | .UL \-\- |
| 816 | operators are also available, |
| 817 | and so are the assignment operators |
| 818 | .UL += , |
| 819 | .UL \-= , |
| 820 | .UL *= , |
| 821 | .UL /= , |
| 822 | and |
| 823 | .UL %= . |
| 824 | These operators may all be used in expressions. |
| 825 | .NH 2 |
| 826 | Field Variables |
| 827 | .PP |
| 828 | Fields in |
| 829 | .IT awk |
| 830 | share essentially all of the properties of variables _ |
| 831 | they may be used in arithmetic or string operations, |
| 832 | and may be assigned to. |
| 833 | Thus one can |
| 834 | replace the first field with a sequence number like this: |
| 835 | .P1 |
| 836 | { $1 = NR; print } |
| 837 | .P2 |
| 838 | or |
| 839 | accumulate two fields into a third, like this: |
| 840 | .P1 |
| 841 | { $1 = $2 + $3; print $0 } |
| 842 | .P2 |
| 843 | or assign a string to a field: |
| 844 | .P1 |
| 845 | { if ($3 > 1000) |
| 846 | $3 = "too big" |
| 847 | print |
| 848 | } |
| 849 | .P2 |
| 850 | which replaces the third field by ``too big'' when it is, |
| 851 | and in any case prints the record. |
| 852 | .PP |
| 853 | Field references may be numerical expressions, |
| 854 | as in |
| 855 | .P1 |
| 856 | { print $i, $(i+1), $(i+n) } |
| 857 | .P2 |
| 858 | Whether a field is deemed numeric or string depends on context; |
| 859 | in ambiguous cases like |
| 860 | .P1 |
| 861 | if ($1 == $2) ... |
| 862 | .P2 |
| 863 | fields are treated as strings. |
| 864 | .PP |
| 865 | Each input line is split into fields automatically as necessary. |
| 866 | It is also possible to split any variable or string |
| 867 | into fields: |
| 868 | .P1 |
| 869 | n = split(s, array, sep) |
| 870 | .P2 |
| 871 | splits the |
| 872 | the string |
| 873 | .UL s |
| 874 | into |
| 875 | .UL array[1] , |
| 876 | \&..., |
| 877 | .UL array[n] . |
| 878 | The number of elements found is returned. |
| 879 | If the |
| 880 | .UL sep |
| 881 | argument is provided, it is used as the field separator; |
| 882 | otherwise |
| 883 | .UL FS |
| 884 | is used as the separator. |
| 885 | .NH 2 |
| 886 | String Concatenation |
| 887 | .PP |
| 888 | Strings may be concatenated. |
| 889 | For example |
| 890 | .P1 |
| 891 | length($1 $2 $3) |
| 892 | .P2 |
| 893 | returns the length of the first three fields. |
| 894 | Or in a |
| 895 | .UL print |
| 896 | statement, |
| 897 | .P1 |
| 898 | print $1 " is " $2 |
| 899 | .P2 |
| 900 | prints |
| 901 | the two fields separated by `` is ''. |
| 902 | Variables and numeric expressions may also appear in concatenations. |
| 903 | .NH 2 |
| 904 | Arrays |
| 905 | .PP |
| 906 | Array elements are not declared; |
| 907 | they spring into existence by being mentioned. |
| 908 | Subscripts may have |
| 909 | .ul |
| 910 | any |
| 911 | non-null |
| 912 | value, including non-numeric strings. |
| 913 | As an example of a conventional numeric subscript, |
| 914 | the statement |
| 915 | .P1 |
| 916 | x[NR] = $0 |
| 917 | .P2 |
| 918 | assigns the current input record to |
| 919 | the |
| 920 | .UL NR -th |
| 921 | element of the array |
| 922 | .UL x . |
| 923 | In fact, it is possible in principle (though perhaps slow) |
| 924 | to process the entire input in a random order with the |
| 925 | .IT awk |
| 926 | program |
| 927 | .P1 |
| 928 | { x[NR] = $0 } |
| 929 | END { \fI... program ...\fP } |
| 930 | .P2 |
| 931 | The first action merely records each input line in |
| 932 | the array |
| 933 | .UL x . |
| 934 | .PP |
| 935 | Array elements may be named by non-numeric values, |
| 936 | which gives |
| 937 | .IT awk |
| 938 | a capability rather like the associative memory of |
| 939 | Snobol tables. |
| 940 | Suppose the input contains fields with values like |
| 941 | .UL apple , |
| 942 | .UL orange , |
| 943 | etc. |
| 944 | Then the program |
| 945 | .P1 |
| 946 | /apple/ { x["apple"]++ } |
| 947 | /orange/ { x["orange"]++ } |
| 948 | END { print x["apple"], x["orange"] } |
| 949 | .P2 |
| 950 | increments counts for the named array elements, |
| 951 | and prints them at the end of the input. |
| 952 | .NH 2 |
| 953 | Flow-of-Control Statements |
| 954 | .PP |
| 955 | .IT Awk |
| 956 | provides the basic flow-of-control statements |
| 957 | .UL if-else , |
| 958 | .UL while , |
| 959 | .UL for , |
| 960 | and statement grouping with braces, as in C. |
| 961 | We showed the |
| 962 | .UL if |
| 963 | statement in section 3.3 without describing it. |
| 964 | The condition in parentheses is evaluated; |
| 965 | if it is true, the statement following the |
| 966 | .UL if |
| 967 | is done. |
| 968 | The |
| 969 | .UL else |
| 970 | part is optional. |
| 971 | .PP |
| 972 | The |
| 973 | .UL while |
| 974 | statement is exactly like that of C. |
| 975 | For example, to print all input fields one per line, |
| 976 | .P1 |
| 977 | i = 1 |
| 978 | while (i <= NF) { |
| 979 | print $i |
| 980 | ++i |
| 981 | } |
| 982 | .P2 |
| 983 | .PP |
| 984 | The |
| 985 | .UL for |
| 986 | statement is also exactly that of C: |
| 987 | .P1 |
| 988 | for (i = 1; i <= NF; i++) |
| 989 | print $i |
| 990 | .P2 |
| 991 | does the same job as the |
| 992 | .UL while |
| 993 | statement above. |
| 994 | .PP |
| 995 | There is an alternate form of the |
| 996 | .UL for |
| 997 | statement which is suited for accessing the |
| 998 | elements of an associative array: |
| 999 | .P1 |
| 1000 | for (i in array) |
| 1001 | \fIstatement\f3 |
| 1002 | .P2 |
| 1003 | does |
| 1004 | .ul |
| 1005 | statement |
| 1006 | with |
| 1007 | .UL i |
| 1008 | set in turn to each element of |
| 1009 | .UL array . |
| 1010 | The elements are accessed in an apparently random order. |
| 1011 | Chaos will ensue if |
| 1012 | .UL i |
| 1013 | is altered, or if any new elements are |
| 1014 | accessed during the loop. |
| 1015 | .PP |
| 1016 | The expression in the condition part of an |
| 1017 | .UL if , |
| 1018 | .UL while |
| 1019 | or |
| 1020 | .UL for |
| 1021 | can include relational operators like |
| 1022 | .UL < , |
| 1023 | .UL <= , |
| 1024 | .UL > , |
| 1025 | .UL >= , |
| 1026 | .UL == |
| 1027 | (``is equal to''), |
| 1028 | and |
| 1029 | .UL != |
| 1030 | (``not equal to''); |
| 1031 | regular expression matches with the match operators |
| 1032 | .UL ~ |
| 1033 | and |
| 1034 | .UL !~ ; |
| 1035 | the logical operators |
| 1036 | .UL \||\|| , |
| 1037 | .UL && , |
| 1038 | and |
| 1039 | .UL ! ; |
| 1040 | and of course parentheses for grouping. |
| 1041 | .PP |
| 1042 | The |
| 1043 | .UL break |
| 1044 | statement causes an immediate exit |
| 1045 | from an enclosing |
| 1046 | .UL while |
| 1047 | or |
| 1048 | .UL for ; |
| 1049 | the |
| 1050 | .UL continue |
| 1051 | statement |
| 1052 | causes the next iteration to begin. |
| 1053 | .PP |
| 1054 | The statement |
| 1055 | .UL next |
| 1056 | causes |
| 1057 | .IT awk |
| 1058 | to skip immediately to |
| 1059 | the next record and begin scanning the patterns from the top. |
| 1060 | The statement |
| 1061 | .UL exit |
| 1062 | causes the program to behave as if the end of the input |
| 1063 | had occurred. |
| 1064 | .PP |
| 1065 | Comments may be placed in |
| 1066 | .IT awk |
| 1067 | programs: |
| 1068 | they begin with the character |
| 1069 | .UL # |
| 1070 | and end with the end of the line, |
| 1071 | as in |
| 1072 | .P1 |
| 1073 | print x, y # this is a comment |
| 1074 | .P2 |
| 1075 | .NH |
| 1076 | Design |
| 1077 | .PP |
| 1078 | The |
| 1079 | .UX |
| 1080 | system |
| 1081 | already provides several programs that |
| 1082 | operate by passing input through a |
| 1083 | selection mechanism. |
| 1084 | .IT Grep , |
| 1085 | the first and simplest, merely prints all lines which |
| 1086 | match a single specified pattern. |
| 1087 | .IT Egrep |
| 1088 | provides more general patterns, i.e., regular expressions |
| 1089 | in full generality; |
| 1090 | .IT fgrep |
| 1091 | searches for a set of keywords with a particularly fast algorithm. |
| 1092 | .IT Sed\| |
| 1093 | .[ |
| 1094 | unix programm manual |
| 1095 | .] |
| 1096 | provides most of the editing facilities of |
| 1097 | the editor |
| 1098 | .IT ed , |
| 1099 | applied to a stream of input. |
| 1100 | None of these programs provides |
| 1101 | numeric capabilities, |
| 1102 | logical relations, |
| 1103 | or variables. |
| 1104 | .PP |
| 1105 | .IT Lex\| |
| 1106 | .[ |
| 1107 | lesk lexical analyzer cstr |
| 1108 | .] |
| 1109 | provides general regular expression recognition capabilities, |
| 1110 | and, by serving as a C program generator, |
| 1111 | is essentially open-ended in its capabilities. |
| 1112 | The use of |
| 1113 | .IT lex , |
| 1114 | however, requires a knowledge of C programming, |
| 1115 | and a |
| 1116 | .IT lex |
| 1117 | program must be compiled and loaded before use, |
| 1118 | which discourages its use for one-shot applications. |
| 1119 | .PP |
| 1120 | .IT Awk |
| 1121 | is an attempt |
| 1122 | to fill in another part of the matrix of possibilities. |
| 1123 | It |
| 1124 | provides general regular expression capabilities |
| 1125 | and an implicit input/output loop. |
| 1126 | But it also provides convenient numeric processing, |
| 1127 | variables, |
| 1128 | more general selection, |
| 1129 | and control flow in the actions. |
| 1130 | It |
| 1131 | does not require compilation or a knowledge of C. |
| 1132 | Finally, |
| 1133 | .IT awk |
| 1134 | provides |
| 1135 | a convenient way to access fields within lines; |
| 1136 | it is unique in this respect. |
| 1137 | .PP |
| 1138 | .IT Awk |
| 1139 | also tries to integrate strings and numbers |
| 1140 | completely, |
| 1141 | by treating all quantities as both string and numeric, |
| 1142 | deciding which representation is appropriate |
| 1143 | as late as possible. |
| 1144 | In most cases the user can simply ignore the differences. |
| 1145 | .PP |
| 1146 | Most of the effort in developing |
| 1147 | .I awk |
| 1148 | went into deciding what |
| 1149 | .I awk |
| 1150 | should or should not do |
| 1151 | (for instance, it doesn't do string substitution) |
| 1152 | and what the syntax should be |
| 1153 | (no explicit operator for concatenation) |
| 1154 | rather |
| 1155 | than on writing or debugging the code. |
| 1156 | We have tried |
| 1157 | to make the syntax powerful |
| 1158 | but easy to use and well adapted |
| 1159 | to scanning files. |
| 1160 | For example, |
| 1161 | the absence of declarations and implicit initializations, |
| 1162 | while probably a bad idea for a general-purpose programming language, |
| 1163 | is desirable in a language |
| 1164 | that is meant to be used for tiny programs |
| 1165 | that may even be composed on the command line. |
| 1166 | .PP |
| 1167 | In practice, |
| 1168 | .IT awk |
| 1169 | usage seems to fall into two broad categories. |
| 1170 | One is what might be called ``report generation'' \(em |
| 1171 | processing an input to extract counts, |
| 1172 | sums, sub-totals, etc. |
| 1173 | This also includes the writing of trivial |
| 1174 | data validation programs, |
| 1175 | such as verifying that a field contains only numeric information |
| 1176 | or that certain delimiters are properly balanced. |
| 1177 | The combination of textual and numeric processing is invaluable here. |
| 1178 | .PP |
| 1179 | A second area of use is as a data transformer, |
| 1180 | converting data from the form produced by one program |
| 1181 | into that expected by another. |
| 1182 | The simplest examples merely select fields, perhaps with rearrangements. |
| 1183 | .NH |
| 1184 | Implementation |
| 1185 | .PP |
| 1186 | The actual implementation of |
| 1187 | .IT awk |
| 1188 | uses the language development tools available |
| 1189 | on the |
| 1190 | .UC UNIX |
| 1191 | operating system. |
| 1192 | The grammar is specified with |
| 1193 | .IT yacc ; |
| 1194 | .[ |
| 1195 | yacc johnson cstr |
| 1196 | .] |
| 1197 | the lexical analysis is done by |
| 1198 | .IT lex ; |
| 1199 | the regular expression recognizers are |
| 1200 | deterministic finite automata |
| 1201 | constructed directly from the expressions. |
| 1202 | An |
| 1203 | .IT awk |
| 1204 | program is translated into a |
| 1205 | parse tree which is then directly executed |
| 1206 | by a simple interpreter. |
| 1207 | .PP |
| 1208 | .IT Awk |
| 1209 | was designed for ease of use rather than processing speed; |
| 1210 | the delayed evaluation of variable types |
| 1211 | and the necessity to break input |
| 1212 | into fields makes high speed difficult to achieve in any case. |
| 1213 | Nonetheless, |
| 1214 | the program has not proven to be unworkably slow. |
| 1215 | .PP |
| 1216 | Table I below shows the execution (user + system) time |
| 1217 | on a PDP-11/70 of |
| 1218 | the |
| 1219 | .UC UNIX |
| 1220 | programs |
| 1221 | .IT wc , |
| 1222 | .IT grep , |
| 1223 | .IT egrep , |
| 1224 | .IT fgrep , |
| 1225 | .IT sed , |
| 1226 | .IT lex , |
| 1227 | and |
| 1228 | .IT awk |
| 1229 | on the following simple tasks: |
| 1230 | .IP "\ \ 1." |
| 1231 | count the number of lines. |
| 1232 | .IP "\ \ 2." |
| 1233 | print all lines containing ``doug''. |
| 1234 | .IP "\ \ 3." |
| 1235 | print all lines containing ``doug'', ``ken'' or ``dmr''. |
| 1236 | .IP "\ \ 4." |
| 1237 | print the third field of each line. |
| 1238 | .IP "\ \ 5." |
| 1239 | print the third and second fields of each line, in that order. |
| 1240 | .IP "\ \ 6." |
| 1241 | append all lines containing ``doug'', ``ken'', and ``dmr'' |
| 1242 | to files ``jdoug'', ``jken'', and ``jdmr'', respectively. |
| 1243 | .IP "\ \ 7." |
| 1244 | print each line prefixed by ``line-number\ :\ ''. |
| 1245 | .IP "\ \ 8." |
| 1246 | sum the fourth column of a table. |
| 1247 | .LP |
| 1248 | The program |
| 1249 | .IT wc |
| 1250 | merely counts words, lines and characters in its input; |
| 1251 | we have already mentioned the others. |
| 1252 | In all cases the input was a file containing |
| 1253 | 10,000 lines |
| 1254 | as created by the |
| 1255 | command |
| 1256 | .IT "ls \-l" ; |
| 1257 | each line has the form |
| 1258 | .P1 |
| 1259 | -rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx |
| 1260 | .P2 |
| 1261 | The total length of this input is |
| 1262 | 452,960 characters. |
| 1263 | Times for |
| 1264 | .IT lex |
| 1265 | do not include compile or load. |
| 1266 | .PP |
| 1267 | As might be expected, |
| 1268 | .IT awk |
| 1269 | is not as fast as the specialized tools |
| 1270 | .IT wc , |
| 1271 | .IT sed , |
| 1272 | or the programs in the |
| 1273 | .IT grep |
| 1274 | family, |
| 1275 | but |
| 1276 | is faster than the more general tool |
| 1277 | .IT lex . |
| 1278 | In all cases, the tasks were |
| 1279 | about as easy to express as |
| 1280 | .IT awk |
| 1281 | programs |
| 1282 | as programs in these other languages; |
| 1283 | tasks involving fields were |
| 1284 | considerably easier to express as |
| 1285 | .IT awk |
| 1286 | programs. |
| 1287 | Some of the test programs are shown in |
| 1288 | .IT awk , |
| 1289 | .IT sed |
| 1290 | and |
| 1291 | .IT lex . |
| 1292 | .[ |
| 1293 | $LIST$ |
| 1294 | .] |
| 1295 | .1C |
| 1296 | .TS |
| 1297 | center; |
| 1298 | c c c c c c c c c |
| 1299 | c c c c c c c c c |
| 1300 | c|n|n|n|n|n|n|n|n|. |
| 1301 | Task |
| 1302 | Program 1 2 3 4 5 6 7 8 |
| 1303 | _ |
| 1304 | \fIwc\fR 8.6 |
| 1305 | \fIgrep\fR 11.7 13.1 |
| 1306 | \fIegrep\fR 6.2 11.5 11.6 |
| 1307 | \fIfgrep\fR 7.7 13.8 16.1 |
| 1308 | \fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1 |
| 1309 | \fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8 |
| 1310 | \fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1 |
| 1311 | _ |
| 1312 | .TE |
| 1313 | .sp |
| 1314 | .ce |
| 1315 | \fBTable I.\fR Execution Times of Programs. (Times are in sec.) |
| 1316 | .sp 2 |
| 1317 | .2C |
| 1318 | .PP |
| 1319 | The programs for some of these jobs are shown below. |
| 1320 | The |
| 1321 | .IT lex |
| 1322 | programs are generally too long to show. |
| 1323 | .LP |
| 1324 | AWK: |
| 1325 | .LP |
| 1326 | .P1 |
| 1327 | 1. END {print NR} |
| 1328 | .P2 |
| 1329 | .P1 |
| 1330 | 2. /doug/ |
| 1331 | .P2 |
| 1332 | .P1 |
| 1333 | 3. /ken|doug|dmr/ |
| 1334 | .P2 |
| 1335 | .P1 |
| 1336 | 4. {print $3} |
| 1337 | .P2 |
| 1338 | .P1 |
| 1339 | 5. {print $3, $2} |
| 1340 | .P2 |
| 1341 | .P1 |
| 1342 | 6. /ken/ {print >"jken"} |
| 1343 | /doug/ {print >"jdoug"} |
| 1344 | /dmr/ {print >"jdmr"} |
| 1345 | .P2 |
| 1346 | .P1 |
| 1347 | 7. {print NR ": " $0} |
| 1348 | .P2 |
| 1349 | .P1 |
| 1350 | 8. {sum = sum + $4} |
| 1351 | END {print sum} |
| 1352 | .P2 |
| 1353 | .LP |
| 1354 | SED: |
| 1355 | .LP |
| 1356 | .P1 |
| 1357 | 1. $= |
| 1358 | .P2 |
| 1359 | .P1 |
| 1360 | 2. /doug/p |
| 1361 | .P2 |
| 1362 | .P1 |
| 1363 | 3. /doug/p |
| 1364 | /doug/d |
| 1365 | /ken/p |
| 1366 | /ken/d |
| 1367 | /dmr/p |
| 1368 | /dmr/d |
| 1369 | .P2 |
| 1370 | .P1 |
| 1371 | 4. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p |
| 1372 | .P2 |
| 1373 | .P1 |
| 1374 | 5. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p |
| 1375 | .P2 |
| 1376 | .P1 |
| 1377 | 6. /ken/w jken |
| 1378 | /doug/w jdoug |
| 1379 | /dmr/w jdmr |
| 1380 | .P2 |
| 1381 | .LP |
| 1382 | LEX: |
| 1383 | .LP |
| 1384 | .P1 |
| 1385 | 1. %{ |
| 1386 | int i; |
| 1387 | %} |
| 1388 | %% |
| 1389 | \en i++; |
| 1390 | . ; |
| 1391 | %% |
| 1392 | yywrap() { |
| 1393 | printf("%d\en", i); |
| 1394 | } |
| 1395 | .P2 |
| 1396 | .P1 |
| 1397 | 2. %% |
| 1398 | ^.*doug.*$ printf("%s\en", yytext); |
| 1399 | . ; |
| 1400 | \en ; |
| 1401 | .P2 |