....TM "78-1271-12, 78-1273-6" 39199 39199-11
.if n .ta 5 10 15 20 25 30 35 40 45 50 55 60
.if t .ta .3i .6i .9i 1.2i
. \"use first argument as indent if present
. \"2=not last lines; 4= no -xx; 8=no xx-
. \"special chars in programs
Awk \(em A Pattern Scanning and Processing Language
is a programming language whose
is to search a set of files
for patterns, and to perform specified actions upon lines or fields of lines which
contain instances of those patterns.
makes certain data selection and transformation operations easy to express;
prints all input lines whose length exceeds 72 characters;
prints all lines with an even number of fields;
replaces the first field of each line by its logarithm.
patterns may include arbitrary boolean combinations of regular expressions
and of relational operators on strings, numbers, fields, variables, and array elements.
Actions may include the same pattern-matching constructions as in patterns,
arithmetic and string expressions and assignments,
and multiple output streams.
This report contains a user's guide, a discussion of the design and implementation of
and some timing statistics.
....It supersedes TM-77-1271-5, dated September 8, 1977.
is a programming language designed to make
information retrieval and text manipulation tasks
easy to state and to perform.
is to scan a set of input lines in order,
searching for lines which match any of a set of patterns
which the user has specified.
For each pattern, an action can be specified;
this action will be performed on each line that matches the pattern.
Readers familiar with the
the approach, although in
and the actions allowed are more involved than merely
printing the matching line.
prints the third and second columns of a table
prints all input lines with an A, B, or C in the second field.
$1 != prev { print; prev = $1 }
prints all lines in which the first field is different
from the previous first field.
on the set of named files,
or on the standard input if there are no files.
The statements can also be placed in a file
and executed by the command
program is a sequence of statements of the form:
each of the patterns in turn.
For each pattern that matches, the associated action
When all the patterns have been tested, the next line
is fetched and the matching starts over.
Either the pattern or the action may be left out,
If there is no action for a pattern,
the matching line is simply
(Thus a line which matches several patterns can be printed several times.)
If there is no pattern for an action,
then the action is performed for every input line.
A line which matches no pattern is ignored.
Since patterns and actions are both optional,
actions must be enclosed in braces
to distinguish them from patterns.
``records'' terminated by a record separator.
The default record separator is a newline,
processes its input a line at a time.
The number of the current record is available in a variable
is considered to be divided into ``fields.''
Fields are normally separated by
white space \(em blanks or tabs \(em
but the input field separator may be changed, as described below.
Fields are referred to as
is the whole input record itself.
Fields may be assigned to.
The number of fields in the current record
is available in a variable named
refer to the input field and record separators;
they may be changed at any time to any single character.
The optional command-line argument
If the record separator is empty,
an empty input line is taken as the record separator,
and blanks, tabs and newlines are treated as field separators.
contains the name of the current input file.
An action may have no pattern,
in which case the action is executed for
The simplest action is to print some or all of a record;
this is accomplished by the
prints each record, thus copying the input to the output intact.
More useful is to print a field or fields from each record.
prints the first two fields in reverse order.
Items separated by a comma in the print statement will be separated by the current output field separator
Items not separated by commas will be concatenated,
runs the first and second fields together.
prints each record preceded by the record number and the number of fields.
Output may be diverted to multiple files;
{ print $1 >"foo1"; print $2 >"foo2" }
and the second field on file
notation can also be used:
appends the output to the file
The file name can be a variable or a field as well as a constant;
uses the contents of field 2 as a file name.
Naturally there is a limit on the number of output files;
Similarly, output can be piped into another process
may be used to change the current
output field separator and output
The output record separator is
appended to the output of the
statement for output formatting:
printf format expr, expr, ...
formats the expressions in the list
according to the specification
printf "%8.2f %10ld\en", $1, $2
as a floating point number 8 digits wide,
with two after the decimal point,
as a 10-digit long decimal number,
No output separators are produced automatically;
you must add them yourself,
is identical to that used with C.
C programm language prentice hall 1978
A pattern in front of an action acts as a selector
that determines whether the action is to be executed.
A variety of expressions may be used as patterns:
arithmetic relational expressions,
string-valued expressions,
matches the beginning of the input,
before the first record is read.
matches the end of the input,
after the last record has been processed.
thus provide a way to gain control before and after processing,
for initialization and wrapup.
As an example, the field separator
\&... rest of program ...
Or the input lines may be counted by
is present, it must be the first pattern;
must be the last if used.
The simplest regular expression is a literal string of characters
will print all lines which contain any occurrence
If a line contains ``smith''
as part of a larger word,
it will also be printed, as in
regular expressions include the regular expression
(without back-referencing).
parentheses for grouping, | for alternatives,
is the set of all letters and digits.
/[Aa]ho\||[Ww]einberger\||[Kk]ernighan/
will print all lines which contain any of the names
``Aho,'' ``Weinberger'' or ``Kernighan,''
whether capitalized or not.
(with the extensions listed above)
must be enclosed in slashes,
Within a regular expression,
blanks and the regular expression
metacharacters are significant.
To turn of the magic meaning
of one of the regular expression characters,
precede it with a backslash.
An example is the pattern
which matches any string of characters
One can also specify that any field or variable
a regular expression (or does not match it) with the operators
prints all lines where the first field matches ``john'' or ``John.''
Notice that this will also match ``Johnson'', ``St. Johnsbury'', and so on.
To restrict it to exactly
The caret ^ refers to the beginning
pattern can be a relational expression
involving the usual relational operators
which selects lines where the second field
is at least 100 greater than the first field.
prints lines with an even number of fields.
In relational tests, if neither operand is numeric,
a string comparison is made;
selects lines that begin with an
In the absence of any other information,
fields are treated as strings, so
will perform a string comparison.
A pattern can be any boolean combination of patterns,
$1 >= "s" && $1 < "t" && $1 != "smith"
selects lines where the first field begins with ``s'', but is not ``smith''.
guarantee that their operands
evaluation stops as soon as the truth or falsehood
The ``pattern'' that selects an action may also
consist of two patterns separated by a comma, as in
In this case, the action is performed for each line between
and the next occurrence of
NR == 100, NR == 200 { ... }
does the action for lines 100 through 200
action is a sequence of action statements
terminated by newlines or semicolons.
These action statements can be used to do a variety of
bookkeeping and string manipulating tasks.
provides a ``length'' function
to compute the length of a string of characters.
This program prints each record,
by itself is a ``pseudo-variable'' which
yields the length of the current record;
is a function which yields the length of its argument,
The argument may be any expression.
provides the arithmetic functions
and integer part of their respective arguments.
The name of one of these built-in functions,
without argument or parentheses,
stands for the value of the function on the
length < 10 || length > 20
prints lines whose length
is less than 10 or greater
produces the substring of
is omitted, the substring goes to the end of
returns the position where the string
.UL sprintf(f,\ e1,\ e2,\ ...)
produces the value of the expressions
x = sprintf("%8.2f %10ld", $1, $2)
to the string produced by formatting
Variables, Expressions, and Assignments
variables take on numeric (floating point)
or string values according to context.
is clearly a number, while in
Strings are converted to numbers and
vice versa whenever context demands it.
Strings which cannot be interpreted
as numbers in a numerical context
will generally have numeric value zero,
but it is unwise to count on this behavior.
By default, variables (other than built-ins) are initialized to the null string,
which has numerical value zero;
this eliminates the need for most
For example, the sums of the first two fields can be computed by
Arithmetic is done internally in floating point.
The arithmetic operators are
operators are also available,
and so are the assignment operators
These operators may all be used in expressions.
share essentially all of the properties of variables _
they may be used in arithmetic or string operations,
replace the first field with a sequence number like this:
accumulate two fields into a third, like this:
{ $1 = $2 + $3; print $0 }
or assign a string to a field:
which replaces the third field by ``too big'' when it is,
and in any case prints the record.
Field references may be numerical expressions,
{ print $i, $(i+1), $(i+n) }
Whether a field is deemed numeric or string depends on context;
fields are treated as strings.
Each input line is split into fields automatically as necessary.
It is also possible to split any variable or string
The number of elements found is returned.
argument is provided, it is used as the field separator;
is used as the separator.
Strings may be concatenated.
returns the length of the first three fields.
the two fields separated by `` is ''.
Variables and numeric expressions may also appear in concatenations.
Array elements are not declared;
they spring into existence by being mentioned.
value, including non-numeric strings.
As an example of a conventional numeric subscript,
assigns the current input record to
In fact, it is possible in principle (though perhaps slow)
to process the entire input in a random order with the
END { \fI... program ...\fP }
The first action merely records each input line in
Array elements may be named by non-numeric values,
a capability rather like the associative memory of
Suppose the input contains fields with values like
/orange/ { x["orange"]++ }
END { print x["apple"], x["orange"] }
increments counts for the named array elements,
and prints them at the end of the input.
Flow-of-Control Statements
provides the basic flow-of-control statements
and statement grouping with braces, as in C.
statement in section 3.3 without describing it.
The condition in parentheses is evaluated;
if it is true, the statement following the
statement is exactly like that of C.
For example, to print all input fields one per line,
statement is also exactly that of C:
for (i = 1; i <= NF; i++)
There is an alternate form of the
statement which is suited for accessing the
elements of an associative array:
set in turn to each element of
The elements are accessed in an apparently random order.
is altered, or if any new elements are
accessed during the loop.
The expression in the condition part of an
can include relational operators like
regular expression matches with the match operators
and of course parentheses for grouping.
statement causes an immediate exit
causes the next iteration to begin.
the next record and begin scanning the patterns from the top.
causes the program to behave as if the end of the input
Comments may be placed in
they begin with the character
and end with the end of the line,
print x, y # this is a comment
already provides several programs that
operate by passing input through a
the first and simplest, merely prints all lines which
match a single specified pattern.
provides more general patterns, i.e., regular expressions
searches for a set of keywords with a particularly fast algorithm.
provides most of the editing facilities of
applied to a stream of input.
None of these programs provides
lesk lexical analyzer cstr
provides general regular expression recognition capabilities,
and, by serving as a C program generator,
is essentially open-ended in its capabilities.
however, requires a knowledge of C programming,
program must be compiled and loaded before use,
which discourages its use for one-shot applications.
to fill in another part of the matrix of possibilities.
provides general regular expression capabilities
and an implicit input/output loop.
But it also provides convenient numeric processing,
and control flow in the actions.
does not require compilation or a knowledge of C.
a convenient way to access fields within lines;
it is unique in this respect.
also tries to integrate strings and numbers
by treating all quantities as both string and numeric,
deciding which representation is appropriate
In most cases the user can simply ignore the differences.
Most of the effort in developing
(for instance, it doesn't do string substitution)
and what the syntax should be
(no explicit operator for concatenation)
than on writing or debugging the code.
to make the syntax powerful
but easy to use and well adapted
the absence of declarations and implicit initializations,
while probably a bad idea for a general-purpose programming language,
is desirable in a language
that is meant to be used for tiny programs
that may even be composed on the command line.
usage seems to fall into two broad categories.
One is what might be called ``report generation'' \(em
processing an input to extract counts,
This also includes the writing of trivial
data validation programs,
such as verifying that a field contains only numeric information
or that certain delimiters are properly balanced.
The combination of textual and numeric processing is invaluable here.
A second area of use is as a data transformer,
converting data from the form produced by one program
into that expected by another.
The simplest examples merely select fields, perhaps with rearrangements.
The actual implementation of
uses the language development tools available
The grammar is specified with
the lexical analysis is done by
the regular expression recognizers are
deterministic finite automata
constructed directly from the expressions.
program is translated into a
parse tree which is then directly executed
was designed for ease of use rather than processing speed;
the delayed evaluation of variable types
and the necessity to break input
into fields makes high speed difficult to achieve in any case.
the program has not proven to be unworkably slow.
Table I below shows the execution (user + system) time
on the following simple tasks:
count the number of lines.
print all lines containing ``doug''.
print all lines containing ``doug'', ``ken'' or ``dmr''.
print the third field of each line.
print the third and second fields of each line, in that order.
append all lines containing ``doug'', ``ken'', and ``dmr''
to files ``jdoug'', ``jken'', and ``jdmr'', respectively.
print each line prefixed by ``line-number\ :\ ''.
sum the fourth column of a table.
merely counts words, lines and characters in its input;
we have already mentioned the others.
In all cases the input was a file containing
-rw-rw-rw- 1 ava 123 Oct 15 17:05 xxx
The total length of this input is
do not include compile or load.
is not as fast as the specialized tools
is faster than the more general tool
In all cases, the tasks were
about as easy to express as
as programs in these other languages;
tasks involving fields were
considerably easier to express as
Some of the test programs are shown in
\fIegrep\fR 6.2 11.5 11.6
\fIfgrep\fR 7.7 13.8 16.1
\fIsed\fR 10.2 11.6 15.8 29.0 30.5 16.1
\fIlex\fR 65.1 150.1 144.2 67.7 70.3 104.0 81.7 92.8
\fIawk\fR 15.0 25.6 29.9 33.3 38.9 46.4 71.4 31.1
\fBTable I.\fR Execution Times of Programs. (Times are in sec.)
The programs for some of these jobs are shown below.
programs are generally too long to show.
4. /[^ ]* [ ]*[^ ]* [ ]*\e([^ ]*\e) .*/s//\e1/p
5. /[^ ]* [ ]*\e([^ ]*\e) [ ]*\e([^ ]*\e) .*/s//\e2 \e1/p
^.*doug.*$ printf("%s\en", yytext);