BSD 3 development
[unix-history] / usr / doc / awk
CommitLineData
2074ceed
BJ
1.fp 3 G
2....TM "78-1271-12, 78-1273-6" 39199 39199-11
3.ND "September 1, 1978"
4....TR 68
5.RP
6. \" macros here
7.tr _\(em
8.if t .tr ~\(ap
9.tr |\(or
10.tr *\(**
11.de UC
12\&\\$3\s-1\\$1\\s0\&\\$2
13..
14.de IT
15.if n .ul
16\&\\$3\f2\\$1\fP\|\\$2
17..
18.de UL
19.if n .ul
20\&\\$3\f3\\$1\fP\&\\$2
21..
22.de P1
23.DS I 3n
24.nf
25.if n .ta 5 10 15 20 25 30 35 40 45 50 55 60
26.if t .ta .3i .6i .9i 1.2i
27.if t .tr -\-'\(fm*\(**
28.if t .tr _\(ul
29.ft 3
30.lg 0
31.ss 18
32. \"use first argument as indent if present
33..
34.de P2
35.ps \\n(PS
36.vs \\n(VSp
37.ft R
38.ss 12
39.if n .ls 2
40.tr --''``^^!!
41.if t .tr _\(em
42.fi
43.lg
44.DE
45..
46.hw semi-colon
47.hy 14
48. \"2=not last lines; 4= no -xx; 8=no xx-
49. \"special chars in programs
50.de WS
51.sp \\$1
52..
53. \" end of macros
54.TL
55Awk \(em A Pattern Scanning and Processing Language
56.br
57(Second Edition)
58.AU "MH 2C-522" 4862
59Alfred V. Aho
60.AU "MH 2C-518" 6021
61Brian W. Kernighan
62.AU "MH 2C-514" 7214
63Peter J. Weinberger
64.AI
65.MH
66.AB
67.IT Awk
68is a programming language whose
69basic operation
70is to search a set of files
71for patterns, and to perform specified actions upon lines or fields of lines which
72contain instances of those patterns.
73.IT Awk
74makes certain data selection and transformation operations easy to express;
75for example, the
76.IT awk
77program
78.sp
79.ce
80.ft 3
81length > 72
82.ft
83.sp
84prints all input lines whose length exceeds 72 characters;
85the program
86.ce
87.sp
88.ft 3
89NF % 2 == 0
90.ft R
91.sp
92prints all lines with an even number of fields;
93and the program
94.ce
95.sp
96.ft 3
97{ $1 = log($1); print }
98.ft R
99.sp
100replaces the first field of each line by its logarithm.
101.PP
102.IT Awk
103patterns may include arbitrary boolean combinations of regular expressions
104and of relational operators on strings, numbers, fields, variables, and array elements.
105Actions may include the same pattern-matching constructions as in patterns,
106as well as
107arithmetic and string expressions and assignments,
108.UL if-else ,
109.UL while ,
110.UL for
111statements,
112and multiple output streams.
113.PP
114This report contains a user's guide, a discussion of the design and implementation of
115.IT awk ,
116and some timing statistics.
117....It supersedes TM-77-1271-5, dated September 8, 1977.
118.AE
119.CS 6 1 7 0 1 4
120.if n .ls 2
121.nr PS 9
122.nr VS 11
123.NH
124Introduction
125.if t .2C
126.PP
127.IT Awk
128is a programming language designed to make
129many common
130information retrieval and text manipulation tasks
131easy to state and to perform.
132.PP
133The basic operation of
134.IT awk
135is to scan a set of input lines in order,
136searching for lines which match any of a set of patterns
137which the user has specified.
138For each pattern, an action can be specified;
139this action will be performed on each line that matches the pattern.
140.PP
141Readers familiar with the
142.UX
143program
144.IT grep\|
145.[
146unix program manual
147.]
148will recognize
149the approach, although in
150.IT awk
151the patterns may be more
152general than in
153.IT grep ,
154and the actions allowed are more involved than merely
155printing the matching line.
156For example, the
157.IT awk
158program
159.P1
160{print $3, $2}
161.P2
162prints the third and second columns of a table
163in that order.
164The program
165.P1
166$2 ~ /A\||B\||C/
167.P2
168prints all input lines with an A, B, or C in the second field.
169The program
170.P1
171$1 != prev { print; prev = $1 }
172.P2
173prints all lines in which the first field is different
174from the previous first field.
175.NH 2
176Usage
177.PP
178The command
179.P1
180awk program [files]
181.P2
182executes the
183.IT awk
184commands in
185the string
186.UL program
187on the set of named files,
188or on the standard input if there are no files.
189The statements can also be placed in a file
190.UL pfile ,
191and executed by the command
192.P1
193awk -f pfile [files]
194.P2
195.NH 2
196Program Structure
197.PP
198An
199.IT awk
200program is a sequence of statements of the form:
201.P1
202.ft I
203 pattern { action }
204 pattern { action }
205 ...
206.ft 3
207.P2
208Each line of input
209is matched against
210each of the patterns in turn.
211For each pattern that matches, the associated action
212is executed.
213When all the patterns have been tested, the next line
214is fetched and the matching starts over.
215.PP
216Either the pattern or the action may be left out,
217but not both.
218If there is no action for a pattern,
219the matching line is simply
220copied to the output.
221(Thus a line which matches several patterns can be printed several times.)
222If there is no pattern for an action,
223then the action is performed for every input line.
224A line which matches no pattern is ignored.
225.PP
226Since patterns and actions are both optional,
227actions must be enclosed in braces
228to distinguish them from patterns.
229.NH 2
230Records and Fields
231.PP
232.IT Awk
233input is divided into
234``records'' terminated by a record separator.
235The default record separator is a newline,
236so by default
237.IT awk
238processes its input a line at a time.
239The number of the current record is available in a variable
240named
241.UL NR .
242.PP
243Each input record
244is considered to be divided into ``fields.''
245Fields are normally separated by
246white space \(em blanks or tabs \(em
247but the input field separator may be changed, as described below.
248Fields are referred to as
249.UL "$1, $2,"
250and so forth,
251where
252.UL $1
253is the first field,
254and
255.UL $0
256is the whole input record itself.
257Fields may be assigned to.
258The number of fields in the current record
259is available in a variable named
260.UL NF .
261.PP
262The variables
263.UL FS
264and
265.UL RS
266refer to the input field and record separators;
267they may be changed at any time to any single character.
268The optional command-line argument
269\f3\-F\fIc\fR
270may also be used to set
271.UL FS
272to the character
273.IT c .
274.PP
275If the record separator is empty,
276an empty input line is taken as the record separator,
277and blanks, tabs and newlines are treated as field separators.
278.PP
279The variable
280.UL FILENAME
281contains the name of the current input file.
282.NH 2
283Printing
284.PP
285An action may have no pattern,
286in which case the action is executed for
287all
288lines.
289The simplest action is to print some or all of a record;
290this is accomplished by the
291.IT awk
292command
293.UL print .
294The
295.IT awk
296program
297.P1
298{ print }
299.P2
300prints each record, thus copying the input to the output intact.
301More useful is to print a field or fields from each record.
302For instance,
303.P1
304print $2, $1
305.P2
306prints the first two fields in reverse order.
307Items separated by a comma in the print statement will be separated by the current output field separator
308when output.
309Items not separated by commas will be concatenated,
310so
311.P1
312print $1 $2
313.P2
314runs the first and second fields together.
315.PP
316The predefined variables
317.UL NF
318and
319.UL NR
320can be used;
321for example
322.P1
323{ print NR, NF, $0 }
324.P2
325prints each record preceded by the record number and the number of fields.
326.PP
327Output may be diverted to multiple files;
328the program
329.P1
330{ print $1 >"foo1"; print $2 >"foo2" }
331.P2
332writes the first field,
333.UL $1 ,
334on the file
335.UL foo1 ,
336and the second field on file
337.UL foo2 .
338The
339.UL >>
340notation can also be used:
341.P1
342print $1 >>"foo"
343.P2
344appends the output to the file
345.UL foo .
346(In each case,
347the output files are
348created if necessary.)
349The file name can be a variable or a field as well as a constant;
350for example,
351.P1
352print $1 >$2
353.P2
354uses the contents of field 2 as a file name.
355.PP
356Naturally there is a limit on the number of output files;
357currently it is 10.
358.PP
359Similarly, output can be piped into another process
360(on
361.UC UNIX
362only); for instance,
363.P1
364print | "mail bwk"
365.P2
366mails the output to
367.UL bwk .
368.PP
369The variables
370.UL OFS
371and
372.UL ORS
373may be used to change the current
374output field separator and output
375record separator.
376The output record separator is
377appended to the output of the
378.UL print
379statement.
380.PP
381.IT Awk
382also provides the
383.UL printf
384statement for output formatting:
385.P1
386printf format expr, expr, ...
387.P2
388formats the expressions in the list
389according to the specification
390in
391.UL format
392and prints them.
393For example,
394.P1
395printf "%8.2f %10ld\en", $1, $2
396.P2
397prints
398.UL $1
399as a floating point number 8 digits wide,
400with two after the decimal point,
401and
402.UL $2
403as a 10-digit long decimal number,
404followed by a newline.
405No output separators are produced automatically;
406you must add them yourself,
407as in this example.
408The version of
409.UL printf
410is identical to that used with C.
411.[
412C programm language prentice hall 1978
413.]
414.NH 1
415Patterns
416.PP
417A pattern in front of an action acts as a selector
418that determines whether the action is to be executed.
419A variety of expressions may be used as patterns:
420regular expressions,
421arithmetic relational expressions,
422string-valued expressions,
423and arbitrary boolean
424combinations of these.
425.NH 2
426BEGIN and END
427.PP
428The special pattern
429.UL BEGIN
430matches the beginning of the input,
431before the first record is read.
432The pattern
433.UL END
434matches the end of the input,
435after the last record has been processed.
436.UL BEGIN
437and
438.UL END
439thus provide a way to gain control before and after processing,
440for initialization and wrapup.
441.PP
442As an example, the field separator
443can be set to a colon by
444.P1
445BEGIN { FS = ":" }
446.ft I
447\&... rest of program ...
448.ft 3
449.P2
450Or the input lines may be counted by
451.P1
452END { print NR }
453.P2
454If
455.UL BEGIN
456is present, it must be the first pattern;
457.UL END
458must be the last if used.
459.NH 2
460Regular Expressions
461.PP
462The simplest regular expression is a literal string of characters
463enclosed in slashes,
464like
465.P1
466/smith/
467.P2
468This
469is actually a complete
470.IT awk
471program which
472will print all lines which contain any occurrence
473of the name ``smith''.
474If a line contains ``smith''
475as part of a larger word,
476it will also be printed, as in
477.P1
478blacksmithing
479.P2
480.PP
481.IT Awk
482regular expressions include the regular expression
483forms found in
484the
485.UC UNIX
486text editor
487.IT ed\|
488.[
489unix program manual
490.]
491and
492.IT grep
493(without back-referencing).
494In addition,
495.IT awk
496allows
497parentheses for grouping, | for alternatives,
498.UL +
499for ``one or more'', and
500.UL ?
501for ``zero or one'',
502all as in
503.IT lex .
504Character classes
505may be abbreviated:
506.UL [a\-zA\-Z0\-9]
507is the set of all letters and digits.
508As an example,
509the
510.IT awk
511program
512.P1
513/[Aa]ho\||[Ww]einberger\||[Kk]ernighan/
514.P2
515will print all lines which contain any of the names
516``Aho,'' ``Weinberger'' or ``Kernighan,''
517whether capitalized or not.
518.PP
519Regular expressions
520(with the extensions listed above)
521must be enclosed in slashes,
522just as in
523.IT ed
524and
525.IT sed .
526Within a regular expression,
527blanks and the regular expression
528metacharacters are significant.
529To turn of the magic meaning
530of one of the regular expression characters,
531precede it with a backslash.
532An example is the pattern
533.P1
534/\|\e/\^.\^*\e//
535.P2
536which matches any string of characters
537enclosed in slashes.
538.PP
539One can also specify that any field or variable
540matches
541a regular expression (or does not match it) with the operators
542.UL ~
543and
544.UL !~ .
545The program
546.P1
547$1 ~ /[jJ]ohn/
548.P2
549prints all lines where the first field matches ``john'' or ``John.''
550Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on.
551To restrict it to exactly
552.UL [jJ]ohn ,
553use
554.P1
555$1 ~ /^[jJ]ohn$/
556.P2
557The caret ^ refers to the beginning
558of a line or field;
559the dollar sign
560.UL $
561refers to the end.
562.NH 2
563Relational Expressions
564.PP
565An
566.IT awk
567pattern can be a relational expression
568involving the usual relational operators
569.UL < ,
570.UL <= ,
571.UL == ,
572.UL != ,
573.UL >= ,
574and
575.UL > .
576An example is
577.P1
578$2 > $1 + 100
579.P2
580which selects lines where the second field
581is at least 100 greater than the first field.
582Similarly,
583.P1
584NF % 2 == 0
585.P2
586prints lines with an even number of fields.
587.PP
588In relational tests, if neither operand is numeric,
589a string comparison is made;
590otherwise it is numeric.
591Thus,
592.P1
593$1 >= "s"
594.P2
595selects lines that begin with an
596.UL s ,
597.UL t ,
598.UL u ,
599etc.
600In the absence of any other information,
601fields are treated as strings, so
602the program
603.P1
604$1 > $2
605.P2
606will perform a string comparison.
607.NH 2
608Combinations of Patterns
609.PP
610A pattern can be any boolean combination of patterns,
611using the operators
612.UL \||\||
613(or),
614.UL &&
615(and), and
616.UL !
617(not).
618For example,
619.P1
620$1 >= "s" && $1 < "t" && $1 != "smith"
621.P2
622selects lines where the first field begins with ``s'', but is not ``smith''.
623.UL &&
624and
625.UL \||\||
626guarantee that their operands
627will be evaluated
628from left to right;
629evaluation stops as soon as the truth or falsehood
630is determined.
631.NH 2
632Pattern Ranges
633.PP
634The ``pattern'' that selects an action may also
635consist of two patterns separated by a comma, as in
636.P1
637pat1, pat2 { ... }
638.P2
639In this case, the action is performed for each line between
640an occurrence of
641.UL pat1
642and the next occurrence of
643.UL pat2
644(inclusive).
645For example,
646.P1
647/start/, /stop/
648.P2
649prints all lines between
650.UL start
651and
652.UL stop ,
653while
654.P1
655NR == 100, NR == 200 { ... }
656.P2
657does the action for lines 100 through 200
658of the input.
659.NH 1
660Actions
661.PP
662An
663.IT awk
664action is a sequence of action statements
665terminated by newlines or semicolons.
666These action statements can be used to do a variety of
667bookkeeping and string manipulating tasks.
668.NH 2
669Built-in Functions
670.PP
671.IT Awk
672provides a ``length'' function
673to compute the length of a string of characters.
674This program prints each record,
675preceded by its length:
676.P1
677{print length, $0}
678.P2
679.UL length
680by itself is a ``pseudo-variable'' which
681yields the length of the current record;
682.UL length(argument)
683is a function which yields the length of its argument,
684as in
685the equivalent
686.P1
687{print length($0), $0}
688.P2
689The argument may be any expression.
690.PP
691.IT Awk
692also
693provides the arithmetic functions
694.UL sqrt ,
695.UL log ,
696.UL exp ,
697and
698.UL int ,
699for
700square root,
701base
702.IT e
703logarithm,
704exponential,
705and integer part of their respective arguments.
706.PP
707The name of one of these built-in functions,
708without argument or parentheses,
709stands for the value of the function on the
710whole record.
711The program
712.P1
713length < 10 || length > 20
714.P2
715prints lines whose length
716is less than 10 or greater
717than 20.
718.PP
719The function
720.UL substr(s,\ m,\ n)
721produces the substring of
722.UL s
723that begins at position
724.UL m
725(origin 1)
726and is at most
727.UL n
728characters long.
729If
730.UL n
731is omitted, the substring goes to the end of
732.UL s .
733The function
734.UL index(s1,\ s2)
735returns the position where the string
736.UL s2
737occurs in
738.UL s1 ,
739or zero if it does not.
740.PP
741The function
742.UL sprintf(f,\ e1,\ e2,\ ...)
743produces the value of the expressions
744.UL e1 ,
745.UL e2 ,
746etc.,
747in the
748.UL printf
749format specified by
750.UL f .
751Thus, for example,
752.P1
753x = sprintf("%8.2f %10ld", $1, $2)
754.P2
755sets
756.UL x
757to the string produced by formatting
758the values of
759.UL $1
760and
761.UL $2 .
762.NH 2
763Variables, Expressions, and Assignments
764.PP
765.IT Awk
766variables take on numeric (floating point)
767or string values according to context.
768For example, in
769.P1
770x = 1
771.P2
772.UL x
773is clearly a number, while in
774.P1
775x = "smith"
776.P2
777it is clearly a string.
778Strings are converted to numbers and
779vice versa whenever context demands it.
780For instance,
781.P1
782x = "3" + "4"
783.P2
784assigns 7 to
785.UL x .
786Strings which cannot be interpreted
787as numbers in a numerical context
788will generally have numeric value zero,
789but it is unwise to count on this behavior.
790.PP
791By default, variables (other than built-ins) are initialized to the null string,
792which has numerical value zero;
793this eliminates the need for most
794.UL BEGIN
795sections.
796For example, the sums of the first two fields can be computed by
797.P1
798 { s1 += $1; s2 += $2 }
799END { print s1, s2 }
800.P2
801.PP
802Arithmetic is done internally in floating point.
803The arithmetic operators are
804.UL + ,
805.UL \- ,
806.UL \(** ,
807.UL / ,
808and
809.UL %
810(mod).
811The C increment
812.UL ++
813and
814decrement
815.UL \-\-
816operators are also available,
817and so are the assignment operators
818.UL += ,
819.UL \-= ,
820.UL *= ,
821.UL /= ,
822and
823.UL %= .
824These operators may all be used in expressions.
825.NH 2
826Field Variables
827.PP
828Fields in
829.IT awk
830share essentially all of the properties of variables _
831they may be used in arithmetic or string operations,
832and may be assigned to.
833Thus one can
834replace the first field with a sequence number like this:
835.P1
836{ $1 = NR; print }
837.P2
838or
839accumulate two fields into a third, like this:
840.P1
841{ $1 = $2 + $3; print $0 }
842.P2
843or assign a string to a field:
844.P1
845{ if ($3 > 1000)
846 $3 = "too big"
847 print
848}
849.P2
850which replaces the third field by ``too big'' when it is,
851and in any case prints the record.
852.PP
853Field references may be numerical expressions,
854as in
855.P1
856{ print $i, $(i+1), $(i+n) }
857.P2
858Whether a field is deemed numeric or string depends on context;
859in ambiguous cases like
860.P1
861if ($1 == $2) ...
862.P2
863fields are treated as strings.
864.PP
865Each input line is split into fields automatically as necessary.
866It is also possible to split any variable or string
867into fields:
868.P1
869n = split(s, array, sep)
870.P2
871splits the
872the string
873.UL s
874into
875.UL array[1] ,
876\&...,
877.UL array[n] .
878The number of elements found is returned.
879If the
880.UL sep
881argument is provided, it is used as the field separator;
882otherwise
883.UL FS
884is used as the separator.
885.NH 2
886String Concatenation
887.PP
888Strings may be concatenated.
889For example
890.P1
891length($1 $2 $3)
892.P2
893returns the length of the first three fields.
894Or in a
895.UL print
896statement,
897.P1
898print $1 " is " $2
899.P2
900prints
901the two fields separated by `` is ''.
902Variables and numeric expressions may also appear in concatenations.
903.NH 2
904Arrays
905.PP
906Array elements are not declared;
907they spring into existence by being mentioned.
908Subscripts may have
909.ul
910any
911non-null
912value, including non-numeric strings.
913As an example of a conventional numeric subscript,
914the statement
915.P1
916x[NR] = $0
917.P2
918assigns the current input record to
919the
920.UL NR -th
921element of the array
922.UL x .
923In fact, it is possible in principle (though perhaps slow)
924to process the entire input in a random order with the
925.IT awk
926program
927.P1
928 { x[NR] = $0 }
929END { \fI... program ...\fP }
930.P2
931The first action merely records each input line in
932the array
933.UL x .
934.PP
935Array elements may be named by non-numeric values,
936which gives
937.IT awk
938a capability rather like the associative memory of
939Snobol tables.
940Suppose the input contains fields with values like
941.UL apple ,
942.UL orange ,
943etc.
944Then the program
945.P1
946/apple/ { x["apple"]++ }
947/orange/ { x["orange"]++ }
948END { print x["apple"], x["orange"] }
949.P2
950increments counts for the named array elements,
951and prints them at the end of the input.
952.NH 2
953Flow-of-Control Statements
954.PP
955.IT Awk
956provides the basic flow-of-control statements
957.UL if-else ,
958.UL while ,
959.UL for ,
960and statement grouping with braces, as in C.
961We showed the
962.UL if
963statement in section 3.3 without describing it.
964The condition in parentheses is evaluated;
965if it is true, the statement following the
966.UL if
967is done.
968The
969.UL else
970part is optional.
971.PP
972The
973.UL while
974statement is exactly like that of C.
975For example, to print all input fields one per line,
976.P1
977i = 1
978while (i <= NF) {
979 print $i
980 ++i
981}
982.P2
983.PP
984The
985.UL for
986statement is also exactly that of C:
987.P1
988for (i = 1; i <= NF; i++)
989 print $i
990.P2
991does the same job as the
992.UL while
993statement above.
994.PP
995There is an alternate form of the
996.UL for
997statement which is suited for accessing the
998elements of an associative array:
999.P1
1000for (i in array)
1001 \fIstatement\f3
1002.P2
1003does
1004.ul
1005statement
1006with
1007.UL i
1008set in turn to each element of
1009.UL array .
1010The elements are accessed in an apparently random order.
1011Chaos will ensue if
1012.UL i
1013is altered, or if any new elements are
1014accessed during the loop.
1015.PP
1016The expression in the condition part of an
1017.UL if ,
1018.UL while
1019or
1020.UL for
1021can include relational operators like
1022.UL < ,
1023.UL <= ,
1024.UL > ,
1025.UL >= ,
1026.UL ==
1027(``is equal to''),
1028and
1029.UL !=
1030(``not equal to'');
1031regular expression matches with the match operators
1032.UL ~
1033and
1034.UL !~ ;
1035the logical operators
1036.UL \||\|| ,
1037.UL && ,
1038and
1039.UL ! ;
1040and of course parentheses for grouping.
1041.PP
1042The
1043.UL break
1044statement causes an immediate exit
1045from an enclosing
1046.UL while
1047or
1048.UL for ;
1049the
1050.UL continue
1051statement
1052causes the next iteration to begin.
1053.PP
1054The statement
1055.UL next
1056causes
1057.IT awk
1058to skip immediately to
1059the next record and begin scanning the patterns from the top.
1060The statement
1061.UL exit
1062causes the program to behave as if the end of the input
1063had occurred.
1064.PP
1065Comments may be placed in
1066.IT awk
1067programs:
1068they begin with the character
1069.UL #
1070and end with the end of the line,
1071as in
1072.P1
1073print x, y # this is a comment
1074.P2
1075.NH
1076Design
1077.PP
1078The
1079.UX
1080system
1081already provides several programs that
1082operate by passing input through a
1083selection mechanism.
1084.IT Grep ,
1085the first and simplest, merely prints all lines which
1086match a single specified pattern.
1087.IT Egrep
1088provides more general patterns, i.e., regular expressions
1089in full generality;
1090.IT fgrep
1091searches for a set of keywords with a particularly fast algorithm.
1092.IT Sed\|
1093.[
1094unix programm manual
1095.]
1096provides most of the editing facilities of
1097the editor
1098.IT ed ,
1099applied to a stream of input.
1100None of these programs provides
1101numeric capabilities,
1102logical relations,
1103or variables.
1104.PP
1105.IT Lex\|
1106.[
1107lesk lexical analyzer cstr
1108.]
1109provides general regular expression recognition capabilities,
1110and, by serving as a C program generator,
1111is essentially open-ended in its capabilities.
1112The use of
1113.IT lex ,
1114however, requires a knowledge of C programming,
1115and a
1116.IT lex
1117program must be compiled and loaded before use,
1118which discourages its use for one-shot applications.
1119.PP
1120.IT Awk
1121is an attempt
1122to fill in another part of the matrix of possibilities.
1123It
1124provides general regular expression capabilities
1125and an implicit input/output loop.
1126But it also provides convenient numeric processing,
1127variables,
1128more general selection,
1129and control flow in the actions.
1130It
1131does not require compilation or a knowledge of C.
1132Finally,
1133.IT awk
1134provides
1135a convenient way to access fields within lines;
1136it is unique in this respect.
1137.PP
1138.IT Awk
1139also tries to integrate strings and numbers
1140completely,
1141by treating all quantities as both string and numeric,
1142deciding which representation is appropriate
1143as late as possible.
1144In most cases the user can simply ignore the differences.
1145.PP
1146Most of the effort in developing
1147.I awk
1148went into deciding what
1149.I awk
1150should or should not do
1151(for instance, it doesn't do string substitution)
1152and what the syntax should be
1153(no explicit operator for concatenation)
1154rather
1155than on writing or debugging the code.
1156We have tried
1157to make the syntax powerful
1158but easy to use and well adapted
1159to scanning files.
1160For example,
1161the absence of declarations and implicit initializations,
1162while probably a bad idea for a general-purpose programming language,
1163is desirable in a language
1164that is meant to be used for tiny programs
1165that may even be composed on the command line.
1166.PP
1167In practice,
1168.IT awk
1169usage seems to fall into two broad categories.
1170One is what might be called ``report generation'' \(em
1171processing an input to extract counts,
1172sums, sub-totals, etc.
1173This also includes the writing of trivial
1174data validation programs,
1175such as verifying that a field contains only numeric information
1176or that certain delimiters are properly balanced.
1177The combination of textual and numeric processing is invaluable here.
1178.PP
1179A second area of use is as a data transformer,
1180converting data from the form produced by one program
1181into that expected by another.
1182The simplest examples merely select fields, perhaps with rearrangements.
1183.NH
1184Implementation
1185.PP
1186The actual implementation of
1187.IT awk
1188uses the language development tools available
1189on the
1190.UC UNIX
1191operating system.
1192The grammar is specified with
1193.IT yacc ;
1194.[
1195yacc johnson cstr
1196.]
1197the lexical analysis is done by
1198.IT lex ;
1199the regular expression recognizers are
1200deterministic finite automata
1201constructed directly from the expressions.
1202An
1203.IT awk
1204program is translated into a
1205parse tree which is then directly executed
1206by a simple interpreter.
1207.PP
1208.IT Awk
1209was designed for ease of use rather than processing speed;
1210the delayed evaluation of variable types
1211and the necessity to break input
1212into fields makes high speed difficult to achieve in any case.
1213Nonetheless,
1214the program has not proven to be unworkably slow.
1215.PP
1216Table I below shows the execution (user + system) time
1217on a PDP-11/70 of
1218the
1219.UC UNIX
1220programs
1221.IT wc ,
1222.IT grep ,
1223.IT egrep ,
1224.IT fgrep ,
1225.IT sed ,
1226.IT lex ,
1227and
1228.IT awk
1229on the following simple tasks:
1230.IP "\ \ 1."
1231count the number of lines.
1232.IP "\ \ 2."
1233print all lines containing ``doug''.
1234.IP "\ \ 3."
1235print all lines containing ``doug'', ``ken'' or ``dmr''.
1236.IP "\ \ 4."
1237print the third field of each line.
1238.IP "\ \ 5."
1239print the third and second fields of each line, in that order.
1240.IP "\ \ 6."
1241append all lines containing ``doug'', ``ken'', and ``dmr''
1242to files ``jdoug'', ``jken'', and ``jdmr'', respectively.
1243.IP "\ \ 7."
1244print each line prefixed by ``line-number\ :\ ''.
1245.IP "\ \ 8."
1246sum the fourth column of a table.
1247.LP
1248The program
1249.IT wc
1250merely counts words, lines and characters in its input;
1251we have already mentioned the others.
1252In all cases the input was a file containing
125310,000 lines
1254as created by the
1255command
1256.IT "ls \-l" ;
1257each line has the form
1258.P1
1259-rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
1260.P2
1261The total length of this input is
1262452,960 characters.
1263Times for
1264.IT lex
1265do not include compile or load.
1266.PP
1267As might be expected,
1268.IT awk
1269is not as fast as the specialized tools
1270.IT wc ,
1271.IT sed ,
1272or the programs in the
1273.IT grep
1274family,
1275but
1276is faster than the more general tool
1277.IT lex .
1278In all cases, the tasks were
1279about as easy to express as
1280.IT awk
1281programs
1282as programs in these other languages;
1283tasks involving fields were
1284considerably easier to express as
1285.IT awk
1286programs.
1287Some of the test programs are shown in
1288.IT awk ,
1289.IT sed
1290and
1291.IT lex .
1292.[
1293$LIST$
1294.]
1295.1C
1296.TS
1297center;
1298c c c c c c c c c
1299c c c c c c c c c
1300c|n|n|n|n|n|n|n|n|.
1301 Task
1302Program 1 2 3 4 5 6 7 8
1303_
1304\fIwc\fR 8.6
1305\fIgrep\fR 11.7 13.1
1306\fIegrep\fR 6.2 11.5 11.6
1307\fIfgrep\fR 7.7 13.8 16.1
1308\fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1
1309\fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
1310\fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1
1311_
1312.TE
1313.sp
1314.ce
1315\fBTable I.\fR Execution Times of Programs. (Times are in sec.)
1316.sp 2
1317.2C
1318.PP
1319The programs for some of these jobs are shown below.
1320The
1321.IT lex
1322programs are generally too long to show.
1323.LP
1324AWK:
1325.LP
1326.P1
13271. END {print NR}
1328.P2
1329.P1
13302. /doug/
1331.P2
1332.P1
13333. /ken|doug|dmr/
1334.P2
1335.P1
13364. {print $3}
1337.P2
1338.P1
13395. {print $3, $2}
1340.P2
1341.P1
13426. /ken/ {print >"jken"}
1343 /doug/ {print >"jdoug"}
1344 /dmr/ {print >"jdmr"}
1345.P2
1346.P1
13477. {print NR ": " $0}
1348.P2
1349.P1
13508. {sum = sum + $4}
1351 END {print sum}
1352.P2
1353.LP
1354SED:
1355.LP
1356.P1
13571. $=
1358.P2
1359.P1
13602. /doug/p
1361.P2
1362.P1
13633. /doug/p
1364 /doug/d
1365 /ken/p
1366 /ken/d
1367 /dmr/p
1368 /dmr/d
1369.P2
1370.P1
13714. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p
1372.P2
1373.P1
13745. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p
1375.P2
1376.P1
13776. /ken/w jken
1378 /doug/w jdoug
1379 /dmr/w jdmr
1380.P2
1381.LP
1382LEX:
1383.LP
1384.P1
13851. %{
1386 int i;
1387 %}
1388 %%
1389 \en i++;
1390 . ;
1391 %%
1392 yywrap() {
1393 printf("%d\en", i);
1394 }
1395.P2
1396.P1
13972. %%
1398 ^.*doug.*$ printf("%s\en", yytext);
1399 . ;
1400 \en ;
1401.P2