386BSD 0.1 development
[unix-history] / usr / src / usr.bin / awk / awk.1
CommitLineData
a7e60862
WJ
1.TH MAWK 1 "Jan 22 1992" "Version 1.1" "USER COMMANDS"
2.\" strings
3.ds ex \fIexpr\fR
4.SH NAME
5mawk \- pattern scanning and text processing language
6
7.SH SYNOPSIS
8.B mawk
9[\-\fBW
10.IR option ]
11[\-\fBF
12.IR value ]
13[\-\fBv
14.IR var=value ]
15[\-\|\-] 'program text' [file ...]
16.br
17.B mawk
18[\-\fBW
19.IR option ]
20[\-\fBF
21.IR value ]
22[\-\fBv
23.IR var=value ]
24[\-\fBf
25.IR program-file ]
26[\-\|\-] [file ...]
27
28.SH DESCRIPTION
29.B mawk
30is an interpreter for the AWK Programming Language.
31The AWK language
32is useful for manipulation of data files,
33text retrieval and processing,
34and for prototyping and experimenting with algorithms.
35.B mawk
36is a \fInew awk\fR meaning it implements the AWK language as
37defined in Aho, Kernighan and Weinberger,
38.I "The AWK Programming Language,"
39Addison-Wesley Publishing, 1988. (Hereafter referred to as
40the AWK book.)
41.B mawk
42conforms to the Posix 1003.2
43(draft 11.2)
44definition of the AWK language
45which contains a few features not described in the AWK
46book, and
47.B mawk
48provides a small number of extensions.
49
50An AWK program is a sequence of \fIpattern {action}\fR pairs and
51function definitions.
52Short programs are entered on the command line
53usually enclosed in ' ' to avoid shell
54interpretation.
55Longer programs can be read in from a
56file with the \-f option.
57Data input is read from the list of files on
58the command line or from standard input when the list is empty.
59The input is broken into records as determined by the
60record separator variable, \fBRS\fR. Initially,
61.B RS
62= "\\n" and records are synonymous with lines.
63Each record is compared against each
64.I pattern
65and if it matches, the program text for
66.I "{action}"
67is executed.
68
69.SH OPTIONS
70
71.TP \w'\-\fBv'+\w'\fIvar=value'u+2n
72\-\fBF \fIvalue
73sets the field separator, \fBFS\fR, to
74.IR value .
75
76.IP "\-\fBf \fIfile"
77Program text is read from \fIfile\fR instead of from the
78command line. Multiple \-f options are allowed.
79
80.IP "\-\fBv \fIvar=value"
81assigns
82.I value
83to program variable
84.IR var .
85
86.IP "\-\|\-"
87indicates the unambiguous end of options.
88.PP
89The above options will be available with any Posix compatible
90implementation of AWK, and implementation specific options are
91prefaced with \-W.
92.B mawk
93provides three:
94
95.TP \w'\-\fBv'+\w'\fIvar=value'u+2n
96\-\fBW \fRversion
97.B mawk
98writes its version and copyright
99to stdout and compiled limits to
100stderr and exits 0.
101.TP
102\-\fBW \fRdump
103writes an assembler like listing of the internal
104representation of the program to stderr.
105.TP
106\-\fBW \fRsprintf=\fInum
107adjusts the size of
108.B mawk's
109internal sprintf buffer to
110.I num
111bytes. More than rare use of this option indicates
112.B mawk
113should be recompiled.
114.TP
115\-\fBW \fRposix_space
116forces
117.B mawk
118not to consider '\\n' to be space.
119
120.SH "THE AWK LANGUAGE"
121.SS "\fB1. Program structure"
122An AWK program is a sequence of
123.I "pattern {action}"
124pairs and user
125function definitions.
126.PP
127A pattern can be:
128.nf
129.RS
130\fBBEGIN
131END\fR
132expression
133expression , expression
134.sp
135.RE
136.fi
137One, but not both,
138of \fIpattern {action}\fR can be omitted. If
139.I {action}
140is omitted it is implicitly { print }. If
141.I pattern
142is omitted, then it is implicitly matched.
143.B BEGIN
144and
145.B END
146patterns require an action.
147.PP
148Statements are terminated by newlines, semi-colons or both.
149Groups of statements such as
150actions or loop bodies are blocked via { ... } as in C. The
151last statement in a block doesn't need a terminator. Blank lines
152have no meaning; an empty statement is terminated with a
153semi-colon. Long statements
154can be continued with a backslash, \\\|. A statement can be broken
155without a backslash after a comma, left brace, &&, ||,
156.BR do ,
157.BR else ,
158the right parenthesis of an
159.BR if ,
160.B while
161or
162.B for
163statement, and the
164right parenthesis of a function definition.
165A comment starts with # and extends to, but does not include
166the end of line.
167.PP
168The following statements control program flow inside blocks.
169.RS
170.PP
171.B if
172( \*(ex )
173.I statement
174.PP
175.B if
176( \*(ex )
177.I statement
178.B else
179.I statement
180.PP
181.B while
182( \*(ex )
183.I statement
184.PP
185.B do
186.I statement
187.B while
188( \*(ex )
189.PP
190.B for
191(
192\fIopt_expr\fR ;
193\fIopt_expr\fR ;
194\fIopt_expr\fR
195)
196.I statement
197.PP
198.B for
199( \fIvar \fBin \fIarray\fR )
200.I statement
201.PP
202.B continue
203.PP
204.B break
205.RE
206.\"
207.SS "\fB2. Data types, conversion and comparison"
208There are two basic data types, numeric and string.
209Numeric constants can be integer like \-2,
210decimal like 1.08, or in scientific notation like
211\-1.1e4 or .28E\-3. All numbers are represented internally and all
212computations are done in floating point arithmetic.
213So for example, the expression
2140.2e2 == 20
215is true and true is represented as 1.0.
216.PP
217String constants are enclosed in double quotes.
218.sp
219.ce
220"This is a string with a newline at the end.\\n"
221.sp
222Strings can be continued across a line by escaping (\\) the newline.
223The following escape sequences are recognized.
224.nf
225.sp
226 \\\\ \\
227 \\" "
228 \\a alert, ascii 7
229 \\b backspace, ascii 8
230 \\t tab, ascii 9
231 \\n newline, ascii 10
232 \\v vertical tab, ascii 11
233 \\f formfeed, ascii 12
234 \\r carriage return, ascii 13
235 \\ddd 1, 2 or 3 octal digits for ascii ddd
236 \\xhh 1 or 2 hex digits for ascii hh
237.sp
238.fi
239If you escape any other character \\c, you get \\c, i.e.,
240.B mawk
241ignores the escape.
242.PP
243There are really three basic data types; the third is
244.I "number and string"
245which has both a numeric value and a string value
246at the same time.
247User defined variables come into existence when first referenced
248and are initialized to
249.IR null ,
250a number and string value which has numeric value 0 and string value
251"".
252Non-trivial number and string typed data come from input
253and are typically stored in fields. (See section 4).
254.PP
255The type of an expression is determined by its context and automatic
256type conversion occurs if needed. For example, to evaluate the
257statements
258.nf
259.sp
260 y = x + 2 ; z = x "hello"
261.sp
262.fi
263The value stored in variable y will be typed numeric.
264If x is not numeric,
265the value taken from x is converted to numeric before it is added to
2662 and stored in y. The value stored in variable z will be typed
267string, and the value of x will be converted to string if necessary
268and concatenated with "hello". (Of course, the value and type
269stored in x is not changed by any conversions.)
270A string expression is converted to numeric using its longest
271numeric prefix as with
272.IR atof (3).
273A numeric expression is converted to string by replacing
274.I expr
275with
276.BR sprintf(CONVFMT ,
277.IR expr ),
278unless
279.I expr
280can be represented on the host machine as an exact integer then
281it is converted to \fBsprintf\fR("%d", \*(ex).
282.B Sprintf()
283is an AWK built-in that duplicates the functionality of
284.IR sprintf (3),
285and
286.B CONVFMT
287is a built-in variable used for internal conversion
288from number to string and initialized to "%.6g".
289Explicit type conversions can be forced,
290\*(ex ""
291is string and
292.IR expr +0
293is numeric.
294.PP
295To evaluate,
296\*(ex\d1\u \fBrel-op \*(ex\d2\u,
297if both operands are numeric or number and string then the comparison
298is numeric; if both operands are string the comparison is string.
299If exactly one operand is string and after trimming spaces and
300tabs from the front and back the remaining string is entirely
301numeric in form, then the string is converted to number and the
302comparison is numeric; otherwise, the numeric operand is converted
303to string and the comparison is string.
304The result of a comparison is numeric, 0 or 1.
305.PP
306In boolean contexts such as,
307\fBif\fR ( \*(ex ) \fIstatement\fR,
308a string expression evaluates true if and only if it is not the
309empty string "";
310numeric values if and only if not numerically zero.
311.\"
312.SS "\fB3. Regular expressions"
313In the AWK language, records, fields and strings are often
314tested for matching a
315.IR "regular expression" .
316Regular expressions are enclosed in slashes, and
317.nf
318.sp
319 \*(ex ~ /\fIr\fR/
320.sp
321.fi
322is an AWK expression that evaluates to 1 if \*(ex "matches"
323.IR r ,
324which means a substring of \*(ex is in the set of strings
325defined by
326.IR r .
327With no match the expression evaluates to 0; replacing
328~ with the "not match" operator, !~ , reverses the meaning.
329As pattern-action pairs,
330.nf
331.sp
332 /\fIr\fR/ { \fIaction\fR } and\
333 \fB$0\fR ~ /\fIr\fR/ { \fIaction\fR }
334.sp
335.fi
336are the same,
337and for each input record that matches
338.IR r ,
339.I action
340is executed.
341In fact, /\fIr\fR/ is an AWK expression that is
342equivalent to (\fB$0\fR ~ /\fIr\fR/) anywhere except when on the
343right side of a match operator or passed as an argument to
344a built-in function that expects a regular expression
345argument.
346.PP
347AWK uses extended regular expressions as with
348.IR egrep (1).
349The regular expression metacharacters, i.e., those with special
350meaning in regular expressions are
351.nf
352.sp
353 \ ^ $ . [ ] | ( ) * + ?
354.sp
355.fi
356Regular expressions are built up from characters as follows:
357.RS
358.TP \w'[^c\d1\uc\d2\uc\d3\u...]'u+1n
359\fIc\fR
360matches any non-metacharacter
361.IR c .
362.IP "\e\fIc\fR"
363matches a character defined by the same escape sequences used
364in string constants or the literal
365character
366.I c
367if
368\\\fIc\fR
369is not an escape sequence.
370.IP \.
371matches any character (including newline).
372.TP
373^
374matches the front of a string.
375.TP
376$
377matches the back of a string.
378.TP
379[c\d1\uc\d2\uc\d3\u...]
380matches any character in the class
381c\d1\uc\d2\uc\d3\u... . An interval of characters is denoted
382c\d1\u\-c\d2\u inside a class [...].
383.TP
384[^c\d1\uc\d2\uc\d3\u...]
385matches any character not in the class
386c\d1\uc\d2\uc\d3\u...
387.RE
388.sp
389Regular expressions are built up from other regular expressions
390as follows:
391.RS
392.TP
393\fIr\fR\d1\u\fIr\fR\d2\u
394matches
395\fIr\fR\d1\u
396followed immediately by
397\fIr\fR\d2\u
398(concatenation).
399.TP
400\fIr\fR\d1\u | \fIr\fR\d2\u
401matches
402\fIr\fR\d1\u or
403\fIr\fR\d2\u
404(alternation).
405.TP
406\fIr\fR*
407matches \fIr\fR repeated zero or more times.
408.TP
409\fIr\fR+
410matches \fIr\fR repeated one or more times.
411.TP
412\fIr\fR?
413matches \fIr\fR zero or once.
414.TP
415(\fIr\fR)
416matches \fIr\fR, providing grouping.
417.RE
418.sp
419The increasing precedence of operators is alternation,
420concatenation and
421unary (*, + or ?).
422.PP
423For example,
424.nf
425.sp
426 /^[_a\-zA-Z][_a\-zA\-Z0\-9]*$/ and
427 /^[\-+]?([0\-9]+\\\|.?|\\\|.[0\-9])[0\-9]*([eE][\-+]?[0\-9]+)?$/
428.sp
429.fi
430are matched by AWK identifiers and AWK numeric constants
431respectively. Note that . has to be escaped to be
432recognized as a decimal point, and that metacharacters are not
433special inside character classes.
434.PP
435Any expression can be used on the right hand side of the ~ or !~
436operators or
437passed to a built-in that expects
438a regular expression.
439If needed, it is converted to string, and then interpreted
440as a regular expression. For example,
441.nf
442.sp
443 BEGIN { identifier = "[_a\-zA\-Z][_a\-zA\-Z0\-9]*" }
444
445 $0 ~ "^" identifier
446.sp
447.fi
448prints all lines that start with an AWK identifier.
449.PP
450.B mawk
451recognizes the empty regular expression, //\|, which matches the
452empty string and hence is matched by any string at the front,
453back and between every character. For example,
454.nf
455.sp
456 echo abc | mawk { gsub(//, "X") ; print }
457 XaXbXcX
458.sp
459.fi
460.\"
461.SS "\fB4. Records and fields"
462Records are read in one at a time, and stored in the
463.I field
464variable
465.BR $0 .
466The record is split into
467.I fields
468which are stored in
469.BR $1 ,
470.BR $2 ", ...,"
471.BR $NF .
472The built-in variable
473.B NF
474is set to the number of fields,
475and
476.B NR
477and
478.B FNR
479are incremented by 1.
480Fields above
481.B $NF
482are set to "".
483.PP
484Assignment to
485.B $0
486causes the fields and
487.B NF
488to be recomputed.
489Assignment to
490.B NF
491or to a field
492causes
493.B $0
494to be reconstructed by
495concatenating the
496.B $i's
497separated by
498.BR OFS .
499Assignment to a field with index greater than
500.BR NF ,
501increases
502.B NF
503and causes
504.B $0
505to be reconstructed.
506.PP
507Data input stored in fields
508is string, unless the entire field has numeric
509form and then the type is number and string.
510For example,
511.sp
512.nf
513 echo 24 24E |
514 mawk '{ print($1>100, $1>"100", $2>100, $2>"100") }'
515 0 0 1 1
516.fi
517.sp
518.B $0
519and
520.B $2
521are string and
522.B $1
523is number and string. The first
524and second comparisons are numeric and the last
525two are string. In the second "100" is
526converted to 100, and in the third 100 is
527converted to "100".
528.\"
529.SS "\fB5. Expressions and operators"
530.PP
531The expression syntax is
532similar to C. Primary expressions are numeric constants,
533string constants, variables, fields, arrays and functions.
534The identifier
535for a variable, array or function can be a sequence of
536letters, digits and underscores, that does
537not start with a digit.
538Variables are not declared; they exist when first referenced and
539are initialized to
540.IR null .
541.PP
542New
543expressions are composed with the following operators in
544order of increasing precedence.
545.PP
546.RS
547.nf
548.vs +2p \" open up a little
549\fIassignment\fR = += \-= *= /= %= ^=
550\fIconditional\fR ? :
551\fIlogical or\fR ||
552\fIlogical and\fR &&
553\fIarray membership\fR \fBin
554\fImatching\fR ~ !~
555\fIrelational\fR < > <= >= == !=
556\fIconcatenation\fR (no explicit operator)
557\fIadd ops\fR + \-
558\fImul ops\fR * / %
559\fIunary\fR + \-
560\fIlogical not\fR !
561\fIexponentiation\fR ^
562\fIinc and dec\fR ++ \-\|\- (both post and pre)
563\fIfield\fR $
564.vs
565.RE
566.PP
567.fi
568Assignment, conditional and exponentiation associate right to
569left; the other operators associate left to right. Any
570expression can be parenthesized.
571.\"
572.SS "\fB6. Arrays"
573.ds ae \fIarray\fR[\fIexpr\fR]
574Awk provides one-dimensional arrays. Array elements are expressed
575as \*(ae.
576.I Expr
577is internally converted to string type, so, for example,
578A[1] and A["1"] are the same element and the actual
579index is "1".
580Arrays indexed by strings are called associative arrays.
581Initially an array is empty; elements exist when first accessed.
582An expression,
583\fIexpr\fB in\fI array\fR
584evaluates to 1 if
585\*(ae
586exists, else to 0.
587.PP
588There is a form of the
589.B for
590statement that loops over each index of an array.
591.nf
592.sp
593 \fBfor\fR ( \fIvar\fB in \fIarray \fR) \fIstatement\fR
594.sp
595.fi
596sets
597.I var
598to each index of
599.I array
600and executes
601.IR statement .
602The order that
603.I var
604transverses the indices of
605.I array
606is not defined.
607.PP
608The statement,
609.B delete
610\*(ae,
611causes
612\*(ae
613not to exist.
614.PP
615Multidimensional arrays are synthesized with concatenation using
616the built-in variable
617.BR SUBSEP .
618\fIarray\fR[\fIexpr\fR\d1\u,\|\fIexpr\fR\d2\u]
619is equivalent to
620\fIarray\fR[\fIexpr\fR\d1\u \fBSUBSEP \fIexpr\fR\d2\u].
621Testing for a multidimensional element uses a parenthesized index,
622such as
623.sp
624.nf
625 if ( (i, j) in A ) print A[i, j]
626.fi
627.sp
628.\"
629.SS "\fB7. Builtin-variables\fR"
630.PP
631The following variables are built-in and initialized before program
632execution.
633.RS
634.TP \w'FILENAME'u+2n
635.B ARGC
636number of command line arguments.
637.TP
638.B ARGV
639array of command line arguments, 0..ARGC-1.
640.TP
641.B CONVFMT
642format for internal conversion of numbers to string,
643initially = "%.6g".
644.TP
645.B ENVIRON
646array indexed by environment variables. An environment string,
647\fIvar=value\fR is stored as
648\fBENVIRON\fR[\fIvar\fR] =
649.IR value .
650.TP
651.B FILENAME
652name of the current input file.
653.TP
654.B FNR
655current record number in
656.BR FILENAME .
657.TP
658.B FS
659splits records into fields as a regular expression.
660.TP
661.B NF
662number of fields in the current record.
663.TP
664.B NR
665current record number in the total input stream.
666.TP
667.B OFMT
668format for printing numbers; initially = "%.6g".
669.TP
670.B OFS
671inserted between fields on output, initially = " ".
672.TP
673.B ORS
674terminates each record on output, initially = "\\n".
675.TP
676.B RLENGTH
677length set by the last call to the built-in function,
678.BR match() .
679.TP
680.B RS
681input record separator, initially = "\\n".
682.TP
683.B RSTART
684index set by the last call to
685.BR match() .
686.TP
687.B SUBSEP
688used to build multiple array subscripts, initially = "\\034".
689.RE
690.\"
691.SS "\fB8. Built-in functions"
692String functions
693.RS
694.TP
695gsub(\fIr,s,t\fR) gsub(\fIr,s\fR)
696Global substitution, every match of regular expression
697.I r
698in variable
699.I t
700is replaced by string
701.IR s .
702The number of replacements is returned.
703If
704.I t
705is omitted,
706.B $0
707is used. An & in the replacement string
708.I s
709is replaced by the matched substring of
710.IR t .
711\\& puts a literal & in the replacement string.
712.TP
713index(\fIs,t\fR)
714If
715.I t
716is a substring of
717.IR s ,
718then the position where
719.I t
720starts is returned, else 0 is returned.
721The first character of
722.I s
723is in position 1.
724.TP
725length(\fIs\fR) length()
726Returns the length of string
727.IR s ;
728without an argument, returns the length of
729.BR $0 .
730.TP
731match(\fIs,r\fR)
732Returns the index of the first longest match of regular expression
733.I r
734in string
735.IR s .
736Returns 0 if no match.
737As a side effect,
738.B RSTART
739is set to the return value.
740.B RLENGTH
741is set to the length of the match or \-1 if no match. If the
742empty string is matched,
743.B RLENGTH
744is set to 0, and 1 is returned if the match is at the front, and
745length(\fIs\fR)+1 is returned if the match is at the back.
746.TP
747split(\fIs,A,r\fR) split(\fIs,A\fR)
748String
749.I s
750is split into fields by regular expression
751.I r
752and the fields are loaded into array
753.IR A .
754The number of fields
755is returned. See section 11 below for more detail.
756If
757.I r
758is omitted,
759.B FS
760is used.
761.TP
762sprintf(\fIformat,expr-list\fR)
763Returns a string constructed from
764.I expr-list
765according to
766.IR format .
767See the description of printf() below.
768.TP
769sub(\fIr,s,t\fR) sub(\fIr,s\fR)
770Single substitution, same as gsub() except at most one substitution.
771.TP
772substr(\fIs,i,n\fR) substr(\fIs,i\fR)
773Returns the substring of string
774.IR s ,
775starting at index
776.IR i ,
777of length
778.IR n .
779If
780.I n
781is omitted, the suffix of
782.IR s ,
783starting at
784.I i
785is returned.
786.TP
787tolower(\fIs\fR)
788Returns a copy of
789.I s
790with all upper case characters converted to lower case.
791.TP
792toupper(\fIs\fR)
793Returns a copy of
794.I s
795with all lower case characters converted to upper case.
796.RE
797.PP
798Arithmetic functions
799.RS
800.PP
801.nf
802atan2(\fIy,x\fR) Arctan of \fIy\fR/\fIx\fR between -\(*p and \(*p.
803.PP
804cos(\fIx\fR) Cosine function, \fIx\fR in radians.
805.PP
806exp(\fIx\fR) Exponential function.
807.PP
808int(\fIx\fR) Returns \fIx\fR truncated towards zero.
809.PP
810log(\fIx\fR) Natural logarithm.
811.PP
812rand() Returns a random number between zero and one.
813.PP
814sin(\fIx\fR) Sine function, \fIx\fR in radians.
815.PP
816sqrt(\fIx\fR) Returns square root of \fIx\fR.
817.fi
818.TP
819srand(\fIexpr\fR) srand()
820Seeds the random number generator, using the clock if
821.I expr
822is omitted, and returns the value of the previous seed.
823.B mawk
824seeds the random number generator from the clock at startup
825so there is no real need to call srand(). Srand(\fIexpr\fR)
826is useful for repeating pseudo random sequences.
827.RE
828.\"
829.SS "\fB9. Input and output"
830There are two output statements,
831.B print
832and
833.BR printf .
834.RS
835.TP
836print
837writes
838.B "$0 ORS"
839to standard output.
840.TP
841print \*(ex\d1\u, \*(ex\d2\u, ..., \*(ex\dn\u
842writes
843\*(ex\d1\u \fBOFS \*(ex\d2\u \fBOFS\fR ... \*(ex\dn\u
844.B ORS
845to standard output. Numeric expressions are converted to
846string with
847.BR OFMT .
848.TP
849printf \fIformat, expr-list\fR
850duplicates the printf C library function writing to standard output.
851The complete ANSI C format specifications are recognized with
852conversions %c, %d, %e, %E, %f, %g, %G,
853%i, %o, %s, %u, %x, %X and %%,
854and conversion qualifiers h and l.
855.RE
856.PP
857The argument list to print or printf can optionally be enclosed in
858parentheses.
859Print formats numbers using
860.B OFMT
861or "%d" for exact integers.
862"%c" with a numeric argument prints the corresponding 8 bit
863character, with a string argument it prints the first character of
864the string.
865The output of print and printf can be redirected to a file or
866command by appending >
867.IR file ,
868>>
869.I file
870or
871|
872.I command
873to the end of the print statement.
874Redirection opens
875.I file
876or
877.I command
878only once, subsequent redirections append to the already open stream.
879By convention,
880.B mawk
881associates the filename "/dev/stderr" with stderr which allows
882print and printf to be redirected to stderr.
883.PP
884The input function
885.B getline
886has the following variations.
887.RS
888.TP
889getline
890reads into
891.BR $0 ,
892updates the fields,
893.BR NF ,
894.B NR
895and
896.BR FNR .
897.TP
898getline < \fIfile\fR
899reads into
900.B $0
901from \fIfile\fR,
902updates the fields and
903.BR NF .
904.TP
905getline \fIvar
906reads the next record into
907.IR var ,
908updates
909.B NR
910and
911.BR FNR .
912.TP
913getline \fIvar\fR < \fIfile
914reads the next record of
915.I file
916into
917.IR var .
918.TP
919\fI command\fR | getline
920pipes a record from
921.I command
922into
923.B $0
924and updates the fields and
925.BR NF .
926.TP
927\fI command\fR | getline \fIvar
928pipes a record from
929.I command
930into
931.IR var .
932.RE
933.PP
934Getline returns 0 on end-of-file, \-1 on error, otherwise 1.
935.PP
936Commands on the end of pipes are executed by /bin/sh.
937.PP
938The function \fBclose\fR(\*(ex) closes the file or pipe
939associated with
940.IR expr .
941Close returns 0 if
942.I expr
943is an open file,
944the exit status if
945.I expr
946is a piped command, and -1 otherwise.
947Close() is used to reread a file or command, make sure the other
948end of an output pipe is finished or conserve file resources.
949.PP
950The function
951\fBsystem\fR(\fIexpr\fR)
952uses
953/bin/sh
954to execute
955.I expr
956and returns the exit status of the command
957.IR expr .
958Changes made to the
959.B ENVIRON
960array are not passed to commands executed with
961.B system
962or pipes.
963.SS \fB10. User defined functions
964The syntax for a user defined function is
965.nf
966.sp
967 \fBfunction\fR name( \fIargs\fR ) { \fIstatements\fR }
968.sp
969.fi
970The function body can contain a return statement
971.nf
972.sp
973 \fBreturn\fI opt_expr\fR
974.sp
975.fi
976A return statement is not required.
977Function calls may be nested or recursive.
978Functions are passed expressions by value
979and arrays by reference.
980Extra arguments serve as local variables
981and are initialized to
982.IR null .
983For example, csplit(\fIs,\|A\fR) puts each character of
984.I s
985into array
986.I A
987and returns the length of
988.IR s .
989.nf
990.sp
991 function csplit(s, A, n, i)
992 {
993 n = length(s)
994 for( i = 1 ; i <= n ; i++ ) A[i] = substr(s, i, 1)
995 return n
996 }
997.sp
998.fi
999Putting extra space between passed arguments and local
1000variables is conventional.
1001Functions can be referenced before they are defined, but the
1002function name and the '(' of the arguments must touch to
1003avoid confusion with concatenation.
1004.\"
1005.SS "\fB11. Splitting strings, records and files"
1006Awk programs use the same algorithm to
1007split strings into arrays with split(), and records into fields
1008on
1009.BR FS .
1010.B mawk
1011uses essentially the same algorithm to split files into
1012records on
1013.BR RS .
1014.PP
1015Split(\fIexpr,\|A,\|sep\fR) works as follows:
1016.RS
1017.TP
1018(1)
1019If
1020.I sep
1021is omitted, it is replaced by
1022.BR FS .
1023.I Sep
1024can be an expression or regular expression. If it is an
1025expression of non-string type, it is converted to string.
1026.TP
1027(2)
1028If
1029.I sep
1030= " " (a single space),
1031then <SPACE> is trimmed from the front and back of
1032.IR expr ,
1033and
1034.I sep
1035becomes <SPACE>.
1036.B mawk
1037defines <SPACE> as the regular expression
1038/[\ \\t\\n]+/.
1039Otherwise
1040.I sep
1041is treated as a regular expression, except that meta-characters
1042are ignored for a string of length 1,
1043e.g.,
1044split(x, A, "*") and split(x, A, /\\*/) are the same.
1045.TP
1046(3)
1047If \*(ex is not string, it is converted to string.
1048If \*(ex is then the empty string "", split() returns 0
1049and
1050.I A
1051is unchanged.
1052Otherwise,
1053all non-overlapping, non-null and longest matches of
1054.I sep
1055in
1056.IR expr ,
1057separate
1058.I expr
1059into fields which are loaded into
1060.IR A .
1061The fields are placed in
1062A[1], A[2], ..., A[n] and split() returns n, the number
1063of fields which is the number
1064of matches plus one.
1065Data placed in
1066.I A
1067that looks numeric is typed number and string.
1068.RE
1069.PP
1070Splitting records into fields works the same except the
1071pieces are loaded into
1072.BR $1 ,
1073\fB$2\fR,...,
1074.BR $NF .
1075If
1076.B $0
1077is empty,
1078.B NF
1079is set to 0 and all
1080.B $i
1081to "".
1082.PP
1083.B mawk
1084splits files into records by the same algorithm, but with the
1085slight difference that
1086.B RS
1087is really a terminator instead of a separator.
1088(\fBORS\fR is really a terminator too).
1089.RS
1090.PP
1091E.g., if
1092.B FS
1093= ":+" and
1094.B $0
1095= "a::b:" , then
1096.B NF
1097= 3 and
1098.B $1
1099= "a",
1100.B $2
1101= "b" and
1102.B $3
1103= "", but
1104if "a::b:" is the contents of an input file and
1105.B RS
1106= ":+", then
1107there are two records "a" and "b".
1108.RE
1109.PP
1110.B RS
1111= " " is not special.
1112.\"
1113.SS "\fB12. Multi-line records"
1114Since
1115.B mawk
1116interprets
1117.B RS
1118as a regular expression, multi-line
1119records are easy. Setting
1120.B RS
1121= "\\n\\n+", makes one or more blank
1122lines separate records. If
1123.B FS
1124= " " (the default), then single
1125newlines, by the rules for <SPACE> above, become space and
1126single newlines are field separators.
1127.RS
1128.PP
1129For example, if a file is "a\ b\\nc\\n\\n",
1130.B RS
1131= "\\n\\n+" and
1132.B FS
1133= "\ ", then there is one record "a\ b\\nc" with three
1134fields "a", "b" and "c". Changing
1135.B FS
1136= "\\n", gives two
1137fields "a b" and "c"; changing
1138.B FS
1139= "", gives one field
1140identical to the record.
1141.RE
1142.PP
1143If you want lines with spaces or tabs to be considered blank,
1144set
1145.B RS
1146= "\\n([\ \\t]*\\n)+".
1147For compatibility with other awks, setting
1148.B RS
1149= "" has the same
1150effect as if blank lines are stripped from the
1151front and back of files and then records are determined as if
1152.B RS
1153= "\\n\\n+".
1154Posix requires that "\\n" always separates records when
1155.B RS
1156= "" regardless of the value of
1157.BR FS .
1158.B mawk
1159does not support this convention, because defining
1160"\\n" as <SPACE> makes it unnecessary.
1161.\"
1162.PP
1163Most of the time when you change
1164.B RS
1165for multi-line records, you
1166will also want to change
1167.B ORS
1168to "\\n\\n" so the record spacing is preserved on output.
1169.\"
1170.SS "\fB13. Program execution"
1171This section describes the order of program execution.
1172First
1173.B ARGC
1174is set to the total number of command line arguments passed to
1175the execution phase of the program.
1176.B ARGV[0]
1177is set the name of the AWK interpreter and
1178\fBARGV[1]\fR ...
1179.B ARGV[ARGC-1]
1180holds the remaining command line arguments exclusive of
1181options and program source.
1182For example with
1183.nf
1184.sp
1185 mawk \-f prog v=1 A t=hello B
1186.sp
1187.fi
1188.B ARGC
1189= 5 with
1190.B ARGV[0]
1191= "mawk",
1192.B ARGV[1]
1193= "v=1",
1194.B ARGV[2]
1195= "A",
1196.B ARGV[3]
1197= "t=hello" and
1198.B ARGV[4]
1199= "B".
1200
1201Next, each
1202.B BEGIN
1203block is executed in order.
1204If the program consists
1205entirely of
1206.B BEGIN
1207blocks, then execution terminates, else
1208an input stream is opened and execution continues.
1209If
1210.B ARGC
1211equals 1,
1212the input stream is set to stdin,
1213else the command line arguments
1214.BR ARGV[1] " ...
1215.B ARGV[ARGC-1]
1216are examined for a file argument.
1217.PP
1218The command line arguments divide into three sets:
1219file arguments, assignment arguments and empty strings "".
1220An assignment has the form
1221\fIvar\fR=\fIstring\fR.
1222When an
1223.B ARGV[i]
1224is examined as a possible file argument,
1225if it is empty it is skipped;
1226if it is an assignment argument, the assignment to
1227.I var
1228takes place and
1229.B i
1230skips to the next argument;
1231else
1232.B ARGV[i]
1233is opened for input.
1234If it fails to open, execution terminates with exit code 1.
1235If no command line argument is a file argument, then input
1236comes from stdin.
1237Getline in a
1238.B BEGIN
1239action opens input. "\-" as a file argument denotes stdin.
1240.PP
1241Once an input stream is open, each input record is tested
1242against each
1243.IR pattern ,
1244and if it matches, the associated
1245.I action
1246is executed.
1247An expression pattern matches if it is boolean true (see
1248the end of section 2).
1249A
1250.B BEGIN
1251pattern matches before any input has been read, and
1252an
1253.B END
1254pattern matches after all input has been read.
1255A range pattern,
1256\fIexpr\fR1,\|\fIexpr\fR2 ,
1257matches every record between the match of
1258.IR expr 1
1259and the match
1260.IR expr 2
1261inclusively.
1262.PP
1263When end of file occurs on the input stream, the remaining
1264command line arguments are examined for a file argument, and
1265if there is one it is opened, else the
1266.B END
1267.I pattern
1268is considered matched
1269and all
1270.B END
1271.I actions
1272are executed.
1273.PP
1274In the example, the assignment
1275v=1
1276takes place after the
1277.B BEGIN
1278.I actions
1279are executed, and
1280the data placed in
1281v
1282is typed number and string.
1283Input is then read from file A.
1284On end of file A,
1285t
1286is set to the string "hello",
1287and B is opened for input.
1288On end of file B, the
1289.B END
1290.I actions
1291are executed.
1292.PP
1293Program flow at the
1294.I pattern
1295.I {action}
1296level can be changed with the
1297.nf
1298.sp
1299 \fBnext\fR and
1300 \fBexit \fIopt_expr\fR
1301.sp
1302.fi
1303statements.
1304A
1305.B next
1306statement
1307causes the next input record to be read and pattern testing
1308to restart with the first
1309.I "pattern {action}"
1310pair in the program.
1311An
1312.B exit
1313statement
1314causes immediate execution of the
1315.B END
1316actions or program termination if there are none or
1317if the
1318.B exit
1319occurs in an
1320.B END
1321action.
1322The
1323.I opt_expr
1324sets the exit value of the program unless overridden by
1325a later
1326.B exit
1327or subsequent error.
1328
1329.SH EXAMPLES
1330.nf
13311. emulate cat.
1332
1333 { print }
1334
13352. emulate wc.
1336
1337 { chars += length($0) + 1 # add one for the \\n
1338 words += NF
1339 }
1340
1341 END{ print NR, words, chars }
1342
13433. count the number of unique "real words".
1344
1345 BEGIN { FS = "[^A-Za-z]+" }
1346
1347 { for(i = 1 ; i <= NF ; i++) word[$i] = "" }
1348
1349 END { delete word[""]
1350 for ( i in word ) cnt++
1351 print cnt
1352 }
1353
1354.fi
13554. sum the second field of
1356every record based on the first field.
1357.nf
1358
1359 $1 ~ /credit\||\|gain/ { sum += $2 }
1360 $1 ~ /debit\||\|loss/ { sum \-= $2 }
1361
1362 END { print sum }
1363
13645. sort a file, comparing as string
1365
1366 { line[NR] = $0 "" } # make sure of comparison type
1367 # in case some lines look numeric
1368
1369 END { isort(line, NR)
1370 for(i = 1 ; i <= NR ; i++) print line[i]
1371 }
1372
1373 #insertion sort of A[1..n]
1374 function isort( A, n, i, j, hold)
1375 {
1376 for( i = 2 ; i <= n ; i++)
1377 {
1378 hold = A[j = i]
1379 while ( A[j\-1] > hold )
1380 { j\-\|\- ; A[j+1] = A[j] }
1381 A[j] = hold
1382 }
1383 # sentinel A[0] = "" will be created if needed
1384 }
1385
1386.fi
1387
1388.SH "COMPATIBILITY ISSUES"
1389The Posix 1003.2(draft 11.2) definition of the AWK language
1390is AWK as described in the AWK book with a few extensions
1391that appeared in SystemVR4 nawk. The extensions are:
1392.sp
1393.RS
1394New functions: toupper() and tolower().
1395
1396New variables: ENVIRON[\|] and CONVFMT.
1397
1398ANSI C conversion specifications for printf() and sprintf().
1399
1400New command options: \-v var=value, multiple -f options and
1401implementation options as arguments to \-W.
1402.RE
1403.sp
1404Posix AWK is oriented to operate on files a line at
1405a time.
1406.B RS
1407can be changed from "\\n" to another single character,
1408but it
1409is hard to find any use for this \(em there are no
1410examples in the AWK book.
1411By convention, \fBRS\fR = "", makes one or more blank lines
1412separate records, allowing multi-line records. When
1413\fBRS\fR = "", "\\n" is always a field separator
1414regardless of the value in
1415.BR FS .
1416.PP
1417.BR mawk ,
1418on the other hand,
1419allows
1420.B RS
1421to be a regular expression.
1422When "\\n" appears in records, it is treated as space, and
1423.B FS
1424always determines fields.
1425.PP
1426Removing the line at a time paradigm can make some programs
1427simpler and can
1428often improve performance. For example,
1429redoing example 3 from above,
1430.nf
1431.sp
1432 BEGIN { RS = "[^A-Za-z]+" }
1433
1434 { word[ $0 ] = "" }
1435
1436 END { delete word[ "" ]
1437 for( i in word ) cnt++
1438 print cnt
1439 }
1440.sp
1441.fi
1442counts the number of unique words by making each word a record.
1443On moderate size files,
1444.B mawk
1445executes twice as fast, because of the simplified inner loop.
1446.PP
1447The following program replaces each comment by a single space in
1448a C program file,
1449.nf
1450.sp
1451 BEGIN {
1452 RS = "/\|\\*([^*]\||\|\\*+[^/*])*\\*+/"
1453 # comment is record separator
1454 ORS = " "
1455 getline hold
1456 }
1457
1458 { print hold ; hold = $0 }
1459
1460 END { printf "%s" , hold }
1461.sp
1462.fi
1463Buffering one record is needed to avoid terminating the last
1464record with a space.
1465.PP
1466With
1467.BR mawk ,
1468the following are all equivalent,
1469.nf
1470.sp
1471 x ~ /a\\+b/ x ~ "a\\+b" x ~ "a\\\\+b"
1472.sp
1473.fi
1474The strings get scanned twice, once as string and once as
1475regular expression. On the string scan,
1476.B mawk
1477ignores the escape on non-escape characters while the AWK
1478book advocates
1479.I \ec
1480be recognized as
1481.I c
1482which necessitates the double escaping of meta-characters in
1483strings.
1484Posix explicitly declines to define the behavior which passively
1485forces programs that must run under a variety of awks to use
1486the more portable but less readable, double escape.
1487.PP
1488Posix AWK does not recognize "/dev/stderr" or \\x hex escape
1489sequences in strings. Unlike ANSI C,
1490.B mawk
1491limits the number of digits that follows \\x to two.
1492.PP
1493Finally, here is how
1494.B mawk
1495handles exceptional cases not discussed in the
1496AWK book or the Posix draft. It is unsafe to assume
1497consistency across awks and safe to skip to
1498the next section.
1499.PP
1500.RS
1501substr(s, i, n) returns the characters of s in the intersection
1502of the closed interval [1, length(s)] and the half-open interval
1503[i, i+n). When this intersection is empty, the empty string is
1504returned; so substr("ABC", 1, 0) = "" and
1505substr("ABC", \-4, 6) = "A".
1506
1507Every string, including the empty string, matches the empty string
1508at the
1509front so, s ~ // and s ~ "", are always 1 as is match(s, //) and
1510match(s, ""). The last two set
1511.B RLENGTH
1512to 0.
1513
1514index(s, t) is always the same as match(s, t1) where t1 is the
1515same as t with metacharacters escaped. Hence consistency
1516with match requires that
1517index(s, "") always returns 1.
1518Also the condition, index(s,t) != 0 if and only t is a substring
1519of s, requires index("","") = 1.
1520
1521If getline encounters end of file, getline var, leaves var
1522unchanged. Similarly, on entry to the
1523.B END
1524actions,
1525.BR $0 ,
1526the fields and
1527.B NF
1528have their value unaltered from the last record.
1529
1530.SH SEE ALSO
1531.I egrep
1532(1)
1533.PP
1534Aho, Kernighan and Weinberger,
1535.IR "The AWK Programming Language" ,
1536Addison-Wesley Publishing, 1988, (the AWK book),
1537defines the language, opening with a tutorial
1538and advancing to many interesting programs that delve into
1539issues of software design and analysis relevant to programming
1540in any language.
1541.PP
1542.IR "The GAWK Manual" ,
1543The Free Software Foundation, 1991, is a tutorial
1544and language reference
1545that does not attempt the depth of the AWK book
1546and assumes the reader may be a novice programmer.
1547The section on AWK arrays is excellent. It also
1548discusses Posix requirements for AWK.
1549
1550
1551.SH BUGS
1552.B mawk
1553cannot handle ascii NUL \\0 in the source or data files. You
1554can output NUL using printf with %c, and any other 8 bit
1555character is acceptable input.
1556
1557.B mawk
1558implements printf() and sprintf() using the C library functions,
1559printf and sprintf, so full ANSI compatibility requires an ANSI
1560C library. In practice this means the h conversion qualifier may
1561not be available. Also
1562.B mawk
1563inherits any bugs or limitations of the library functions.
1564
1565Implementors of the AWK language have shown a consistent lack
1566of imagination when naming their programs.
1567
1568.SH AUTHOR
1569Mike Brennan (brennan@boeing.com).