Research V7 development

author Stephen C. Johnson <scj@research.uucp>

Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)

committer Stephen C. Johnson <scj@research.uucp>

Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)
author Stephen C. Johnson <scj@research.uucp>
Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)
committer Stephen C. Johnson <scj@research.uucp>
Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)
diff --git a/usr/doc/lint b/usr/doc/lint

new file mode 100644 (file)

index 0000000..d1f5819
--- /dev/null
+++ b/usr/doc/lint
@@ -0,0 +1,1063 @@
+.RP
+.ND "July 26, 1978"
+.OK
+Program Portability
+Strong Type Checking
+.TL
+Lint, a C Program Checker
+.AU "MH 2C-559" 3968
+S. C. Johnson
+.AI
+.MH
+.AB
+.PP
+.I Lint
+is a command which examines C source programs,
+detecting
+a number of bugs and obscurities.
+It enforces the type rules of C more strictly than
+the C compilers.
+It may also be used to enforce a number of portability
+restrictions involved in moving
+programs between different machines and/or operating systems.
+Another option detects a number of wasteful, or error prone, constructions
+which nevertheless are, strictly speaking, legal.
+.PP
+.I Lint
+accepts multiple input files and library specifications, and checks them for consistency.
+.PP
+The separation of function between
+.I lint
+and the C compilers has both historical and practical
+rationale.
+The compilers turn C programs into executable files rapidly
+and efficiently.
+This is possible in part because the
+compilers do not do sophisticated
+type checking, especially between
+separately compiled programs.
+.I Lint
+takes a more global, leisurely view of the program,
+looking much more carefully at the compatibilities.
+.PP
+This document discusses the use of
+.I lint ,
+gives an overview of the implementation, and gives some hints on the
+writing of machine independent C code.
+.AE
+.CS 10 2 12 0 0 5
+.SH
+Introduction and Usage
+.PP
+Suppose there are two C
+.[
+Kernighan Ritchie Programming Prentice 1978
+.]
+source files,
+.I file1. c
+and
+.I file2.c  ,
+which are ordinarily compiled and loaded together.
+Then the command
+.DS
+lint  file1.c  file2.c
+.DE
+produces messages describing inconsistencies and inefficiencies
+in the programs.
+The program enforces the typing rules of C
+more strictly than the C compilers
+(for both historical and practical reasons)
+enforce them.
+The command
+.DS
+lint  \-p  file1.c  file2.c
+.DE
+will produce, in addition to the above messages, additional messages
+which relate to the portability of the programs to other operating
+systems and machines.
+Replacing the
+.B \-p
+by
+.B \-h
+will produce messages about various error-prone or wasteful constructions
+which, strictly speaking, are not bugs.
+Saying
+.B \-hp
+gets the whole works.
+.PP
+The next several sections describe the major messages;
+the document closes with sections
+discussing the implementation and giving suggestions
+for writing portable C.
+An appendix gives a summary of the
+.I lint
+options.
+.SH
+A Word About Philosophy
+.PP
+Many of the facts which
+.I lint
+needs may be impossible to
+discover.
+For example, whether a given function in a program ever gets called
+may depend on the input data.
+Deciding whether
+.I exit
+is ever called is equivalent to solving the famous ``halting problem,'' known to be
+recursively undecidable.
+.PP
+Thus, most of the
+.I lint
+algorithms are a compromise.
+If a function is never mentioned, it can never be called.
+If a function is mentioned,
+.I lint
+assumes it can be called; this is not necessarily so, but in practice is quite reasonable.
+.PP
+.I Lint
+tries to give information with a high degree of relevance.
+Messages of the form ``\fIxxx\fR might be a bug''
+are easy to generate, but are acceptable only in proportion
+to the fraction of real bugs they uncover.
+If this fraction of real bugs is too small, the messages lose their credibility
+and serve merely to clutter up the output,
+obscuring the more important messages.
+.PP
+Keeping these issues in mind, we now consider in more detail
+the classes of messages which
+.I lint
+produces.
+.SH
+Unused Variables and Functions
+.PP
+As sets of programs evolve and develop,
+previously used variables and arguments to
+functions may become unused;
+it is not uncommon for external variables, or even entire
+functions, to become unnecessary, and yet
+not be removed from the source.
+These ``errors of commission'' rarely cause working programs to fail, but they are a source
+of inefficiency, and make programs harder to understand
+and change.
+Moreover, information about such unused variables and functions can occasionally
+serve to discover bugs; if a function does a necessary job, and
+is never called, something is wrong!
+.PP
+.I Lint
+complains about variables and functions which are defined but not otherwise
+mentioned.
+An exception is variables which are declared through explicit
+.B extern
+statements but are never referenced; thus the statement
+.DS
+extern  float  sin(\|);
+.DE
+will evoke no comment if
+.I sin
+is never used.
+Note that this agrees with the semantics of the C compiler.
+In some cases, these unused external declarations might be of some interest; they
+can be discovered by adding the
+.B \-x
+flag to the
+.I lint
+invocation.
+.PP
+Certain styles of programming
+require many functions to be written with similar interfaces;
+frequently, some of the arguments may be unused
+in many of the calls.
+The
+.B \-v
+option is available to suppress the printing of
+complaints about unused arguments.
+When
+.B \-v
+is in effect, no messages are produced about unused
+arguments except for those
+arguments which are unused and also declared as
+register arguments; this can be considered
+an active (and preventable) waste of the register
+resources of the machine.
+.PP
+There is one case where information about unused, or
+undefined, variables is more distracting
+than helpful.
+This is when
+.I lint
+is applied to some, but not all, files out of a collection
+which are to be loaded together.
+In this case, many of the functions and variables defined
+may not be used, and, conversely,
+many functions and variables defined elsewhere may be used.
+The
+.B \-u
+flag may be used to suppress the spurious messages which might otherwise appear.
+.SH
+Set/Used Information
+.PP
+.I Lint
+attempts to detect cases where a variable is used before it is set.
+This is very difficult to do well;
+many algorithms take a good deal of time and space,
+and still produce messages about perfectly valid programs.
+.I Lint
+detects local variables (automatic and register storage classes)
+whose first use appears physically earlier in the input file than the first assignment to the variable.
+It assumes that taking the address of a variable constitutes a ``use,'' since the actual use
+may occur at any later time, in a data dependent fashion.
+.PP
+The restriction to the physical appearance of variables in the file makes the
+algorithm very simple and quick to implement,
+since the true flow of control need not be discovered.
+It does mean that
+.I lint
+can complain about some programs which are legal,
+but these programs would probably be considered bad on stylistic grounds (e.g. might
+contain at least two \fBgoto\fR's).
+Because static and external variables are initialized to 0,
+no meaningful information can be discovered about their uses.
+The algorithm deals correctly, however, with initialized automatic variables, and variables
+which are used in the expression which first sets them.
+.PP
+The set/used information also permits recognition of those local variables which are set
+and never used; these form a frequent source of inefficiencies, and may also be symptomatic of bugs.
+.SH
+Flow of Control
+.PP
+.I Lint
+attempts to detect unreachable portions of the programs which it processes.
+It will complain about unlabeled statements immediately following
+\fBgoto\fR, \fBbreak\fR, \fBcontinue\fR, or \fBreturn\fR statements.
+An attempt is made to detect loops which can never be left at the bottom, detecting the
+special cases
+\fBwhile\fR( 1 ) and \fBfor\fR(;;) as infinite loops.
+.I Lint
+also complains about loops which cannot be entered at the top;
+some valid programs may have such loops, but at best they are bad style,
+at worst bugs.
+.PP
+.I Lint
+has an important area of blindness in the flow of control algorithm:
+it has no way of detecting functions which are called and never return.
+Thus, a call to
+.I exit
+may cause unreachable code which
+.I lint
+does not detect; the most serious effects of this are in the
+determination of returned function values (see the next section).
+.PP
+One form of unreachable statement is not usually complained about by
+.I lint;
+a
+.B break
+statement that cannot be reached causes no message.
+Programs generated by
+.I yacc ,
+.[
+Johnson Yacc 1975
+.]
+and especially
+.I lex ,
+.[
+Lesk Lex
+.]
+may have literally hundreds of unreachable
+.B break
+statements.
+The
+.B \-O
+flag in the C compiler will often eliminate the resulting object code inefficiency.
+Thus, these unreached statements are of little importance,
+there is typically nothing the user can do about them, and the
+resulting messages would clutter up the
+.I lint
+output.
+If these messages are desired,
+.I lint
+can be invoked with the
+.B \-b
+option.
+.SH
+Function Values
+.PP
+Sometimes functions return values which are never used;
+sometimes programs incorrectly use function ``values''
+which have never been returned.
+.I Lint
+addresses this problem in a number of ways.
+.PP
+Locally, within a function definition,
+the appearance of both
+.DS
+return(  \fIexpr\fR  );
+.DE
+and
+.DS
+return ;
+.DE
+statements is cause for alarm;
+.I lint
+will give the message
+.DS
+function \fIname\fR contains return(e) and return
+.DE
+The most serious difficulty with this is detecting when a function return is implied
+by flow of control reaching the end of the function.
+This can be seen with a simple example:
+.DS
+.ta .5i 1i 1.5i
+\fRf ( a ) {
+       if ( a ) return ( 3 );
+       g (\|);
+       }
+.DE
+Notice that, if \fIa\fR tests false, \fIf\fR will call \fIg\fR and then return
+with no defined return value; this will trigger a complaint from
+.I lint .
+If \fIg\fR, like \fIexit\fR, never returns,
+the message will still be produced when in fact nothing is wrong.
+.PP
+In practice, some potentially serious bugs have been discovered by this feature;
+it also accounts for a substantial fraction of the ``noise'' messages produced
+by
+.I lint .
+.PP
+On a global scale,
+.I lint
+detects cases where a function returns a value, but this value is sometimes,
+or always, unused.
+When the value is always unused, it may constitute an inefficiency in the function definition.
+When the value is sometimes unused, it may represent bad style (e.g., not testing for
+error conditions).
+.PP
+The dual problem, using a function value when the function does not return one,
+is also detected.
+This is a serious problem.
+Amazingly, this bug has been observed on a couple of occasions
+in ``working'' programs; the desired function value just happened to have been computed
+in the function return register!
+.SH
+Type Checking
+.PP
+.I Lint
+enforces the type checking rules of C more strictly than the compilers do.
+The additional checking is in four major areas:
+across certain binary operators and implied assignments,
+at the structure selection operators,
+between the definition and uses of functions,
+and in the use of enumerations.
+.PP
+There are a number of operators which have an implied balancing between types of the operands.
+The assignment, conditional ( ?\|: ), and relational operators
+have this property; the argument
+of a \fBreturn\fR statement,
+and expressions used in initialization also suffer similar conversions.
+In these operations,
+\fBchar\fR, \fBshort\fR, \fBint\fR, \fBlong\fR, \fBunsigned\fR, \fBfloat\fR, and \fBdouble\fR types may be freely intermixed.
+The types of pointers must agree exactly,
+except that arrays of \fIx\fR's can, of course, be intermixed with pointers to \fIx\fR's.
+.PP
+The type checking rules also require that, in structure references, the
+left operand of the \(em> be a pointer to structure, the left operand of the \fB.\fR
+be a structure, and the right operand of these operators be a member
+of the structure implied by the left operand.
+Similar checking is done for references to unions.
+.PP
+Strict rules apply to function argument and return value
+matching.
+The types \fBfloat\fR and \fBdouble\fR may be freely matched,
+as may the types \fBchar\fR, \fBshort\fR, \fBint\fR, and \fBunsigned\fR.
+Also, pointers can be matched with the associated arrays.
+Aside from this, all actual arguments must agree in type with their declared counterparts.
+.PP
+With enumerations, checks are made that enumeration variables or members are not mixed
+with other types, or other enumerations,
+and that the only operations applied are =, initialization, ==, !=, and function arguments and return values.
+.SH
+Type Casts
+.PP
+The type cast feature in C was introduced largely as an aid
+to producing more portable programs.
+Consider the assignment
+.DS
+p = 1 ;
+.DE
+where
+.I p
+is a character pointer.
+.I Lint
+will quite rightly complain.
+Now, consider the assignment
+.DS
+p = (char \(**)1 ;
+.DE
+in which a cast has been used to
+convert the integer to a character pointer.
+The programmer obviously had a strong motivation
+for doing this, and has clearly signaled his intentions.
+It seems harsh for
+.I lint
+to continue to complain about this.
+On the other hand, if this code is moved to another
+machine, such code should be looked at carefully.
+The
+.B \-c
+flag controls the printing of comments about casts.
+When
+.B \-c
+is in effect, casts are treated as though they were assignments
+subject to complaint; otherwise, all legal casts are passed without comment,
+no matter how strange the type mixing seems to be.
+.SH
+Nonportable Character Use
+.PP
+On the PDP-11, characters are signed quantities, with a range
+from \-128 to 127.
+On most of the other C implementations, characters take on only positive
+values.
+Thus,
+.I lint
+will flag certain comparisons and assignments as being
+illegal or nonportable.
+For example, the fragment
+.DS
+char c;
+       ...
+if( (c = getchar(\|)) < 0 ) ....
+.DE
+works on the PDP-11, but
+will fail on machines where characters always take
+on positive values.
+The real solution is to declare
+.I c
+an integer, since
+.I getchar
+is actually returning
+integer values.
+In any case,
+.I lint
+will say
+``nonportable character comparison''.
+.PP
+A similar issue arises with bitfields; when assignments
+of constant values are made to bitfields, the field may
+be too small to hold the value.
+This is especially true because
+on some machines bitfields are considered as signed
+quantities.
+While it may seem unintuitive to consider
+that a two bit field declared of type
+.B int
+cannot hold the value 3, the problem disappears
+if the bitfield is declared to have type
+.B unsigned .
+.SH
+Assignments of longs to ints
+.PP
+Bugs may arise from the assignment of
+.B long
+to
+an
+.B int ,
+which loses accuracy.
+This may happen in programs
+which have been incompletely converted to use
+.B typedefs .
+When a
+.B typedef
+variable
+is changed from \fBint\fR to \fBlong\fR,
+the program can stop working because
+some intermediate results may be assigned
+to \fBints\fR, losing accuracy.
+Since there are a number of legitimate reasons for
+assigning \fBlongs\fR to \fBints\fR, the detection
+of these assignments is enabled
+by the
+.B \-a
+flag.
+.SH
+Strange Constructions
+.PP
+Several perfectly legal, but somewhat strange, constructions
+are flagged by
+.I lint;
+the messages hopefully encourage better code quality, clearer style, and
+may even point out bugs.
+The
+.B \-h
+flag is used to enable these checks.
+For example, in the statement
+.DS
+\(**p++ ;
+.DE
+the \(** does nothing; this provokes the message ``null effect'' from
+.I lint .
+The program fragment
+.DS
+unsigned x ;
+if( x < 0 ) ...
+.DE
+is clearly somewhat strange; the
+test will never succeed.
+Similarly, the test
+.DS
+if( x > 0 ) ...
+.DE
+is equivalent to
+.DS
+if( x != 0 )
+.DE
+which may not be the intended action.
+.I Lint
+will say ``degenerate unsigned comparison'' in these cases.
+If one says
+.DS
+if( 1 != 0 ) ....
+.DE
+.I lint
+will report
+``constant in conditional context'', since the comparison
+of 1 with 0 gives a constant result.
+.PP
+Another construction
+detected by
+.I lint
+involves
+operator precedence.
+Bugs which arise from misunderstandings about the precedence
+of operators can be accentuated by spacing and formatting,
+making such bugs extremely hard to find.
+For example, the statements
+.DS
+if( x&077 == 0 ) ...
+.DE
+or
+.DS
+x<\h'-.3m'<2 + 40
+.DE
+probably do not do what was intended.
+The best solution is to parenthesize such expressions,
+and
+.I lint
+encourages this by an appropriate message.
+.PP
+Finally, when the
+.B \-h
+flag is in force
+.I lint
+complains about variables which are redeclared in inner blocks
+in a way that conflicts with their use in outer blocks.
+This is legal, but is considered by many (including the author) to
+be bad style, usually unnecessary, and frequently a bug.
+.SH
+Ancient History
+.PP
+There are several forms of older syntax which are being officially
+discouraged.
+These fall into two classes, assignment operators and initialization.
+.PP
+The older forms of assignment operators (e.g., =+, =\-, . . . )
+could cause ambiguous expressions, such as
+.DS
+a  =\-1 ;
+.DE
+which could be taken as either
+.DS
+a =\-  1 ;
+.DE
+or
+.DS
+a  =  \-1 ;
+.DE
+The situation is especially perplexing if this
+kind of ambiguity arises as the result of a macro substitution.
+The newer, and preferred operators (+=, \-=, etc. )
+have no such ambiguities.
+To spur the abandonment of the older forms,
+.I lint
+complains about these old fashioned operators.
+.PP
+A similar issue arises with initialization.
+The older language allowed
+.DS
+int  x  \fR1 ;
+.DE
+to initialize
+.I x
+to 1.
+This also caused syntactic difficulties: for example,
+.DS
+int  x  ( \-1 ) ;
+.DE
+looks somewhat like the beginning of a function declaration:
+.DS
+int  x  ( y ) {  . . .
+.DE
+and the compiler must read a fair ways past
+.I x
+in order to sure what the declaration really is..
+Again, the problem is even more perplexing when the
+initializer involves a macro.
+The current syntax places an equals sign between the
+variable and the initializer:
+.DS
+int  x  =  \-1 ;
+.DE
+This is free of any possible syntactic ambiguity.
+.SH
+Pointer Alignment
+.PP
+Certain pointer assignments may be reasonable on some machines,
+and illegal on others, due entirely to
+alignment restrictions.
+For example, on the PDP-11, it is reasonable
+to assign integer pointers to double pointers, since
+double precision values may begin on any integer boundary.
+On the Honeywell 6000, double precision values must begin
+on even word boundaries;
+thus, not all such assignments make sense.
+.I Lint
+tries to detect cases where pointers are assigned to other
+pointers, and such alignment problems might arise.
+The message ``possible pointer alignment problem''
+results from this situation whenever either the
+.B \-p
+or
+.B \-h
+flags are in effect.
+.SH
+Multiple Uses and Side Effects
+.PP
+In complicated expressions, the best order in which to evaluate
+subexpressions may be highly machine dependent.
+For example, on machines (like the PDP-11) in which the stack
+runs backwards, function arguments will probably be best evaluated
+from right-to-left; on machines with a stack running forward,
+left-to-right seems most attractive.
+Function calls embedded as arguments of other functions
+may or may not be treated similarly to ordinary arguments.
+Similar issues arise with other operators which have side effects,
+such as the assignment operators and the increment and decrement operators.
+.PP
+In order that the efficiency of C on a particular machine not be
+unduly compromised, the C language leaves the order
+of evaluation of complicated expressions up to the
+local compiler, and, in fact, the various C compilers have considerable
+differences in the order in which they will evaluate complicated
+expressions.
+In particular, if any variable is changed by a side effect, and
+also used elsewhere in the same expression, the result is explicitly undefined.
+.PP
+.I Lint
+checks for the important special case where
+a simple scalar variable is affected.
+For example, the statement
+.DS
+\fIa\fR[\fIi\|\fR] = \fIb\fR[\fIi\fR++] ;
+.DE
+will draw the complaint:
+.DS
+warning: \fIi\fR evaluation order undefined
+.DE
+.SH
+Implementation
+.PP
+.I Lint
+consists of two programs and a driver.
+The first program is a version of the
+Portable C Compiler
+.[
+Johnson Ritchie BSTJ Portability Programs System
+.]
+.[
+Johnson portable compiler  1978
+.]
+which is the basis of the
+IBM 370, Honeywell 6000, and Interdata 8/32 C compilers.
+This compiler does lexical and syntax analysis on the input text,
+constructs and maintains symbol tables, and builds trees for expressions.
+Instead of writing an intermediate file which is passed to
+a code generator, as the other compilers
+do,
+.I lint
+produces an intermediate file which consists of lines of ascii text.
+Each line contains an external variable name,
+an encoding of the context in which it was seen (use, definition, declaration, etc.),
+a type specifier, and a source file name and line number.
+The information about variables local to a function or file
+is collected
+by accessing the symbol table, and examining the expression trees.
+.PP
+Comments about local problems are produced as detected.
+The information about external names is collected
+onto an intermediate file.
+After all the source files and library descriptions have
+been collected, the intermediate file is sorted
+to bring all information collected about a given external
+name together.
+The second, rather small, program then reads the lines
+from the intermediate file and compares all of the
+definitions, declarations, and uses for consistency.
+.PP
+The driver controls this
+process, and is also responsible for making the options available
+to both passes of
+.I lint .
+.SH
+Portability
+.PP
+C on the Honeywell and IBM systems is used, in part, to write system code for the host operating system.
+This means that the implementation of C tends to follow local conventions rather than
+adhere strictly to
+.UX
+system conventions.
+Despite these differences, many C programs have been successfully moved to GCOS and the various IBM
+installations with little effort.
+This section describes some of the differences between the implementations, and
+discusses the
+.I lint
+features which encourage portability.
+.PP
+Uninitialized external variables are treated differently in different
+implementations of C.
+Suppose two files both contain a declaration without initialization, such as
+.DS
+int a ;
+.DE
+outside of any function.
+The
+.UX
+loader will resolve these declarations, and cause only a single word of storage
+to be set aside for \fIa\fR.
+Under the GCOS and IBM implementations, this is not feasible (for various stupid reasons!)
+so each such declaration causes a word of storage to be set aside and called \fIa\fR.
+When loading or library editing takes place, this causes fatal conflicts which prevent
+the proper operation of the program.
+If
+.I lint
+is invoked with the \fB\-p\fR flag,
+it will detect such multiple definitions.
+.PP
+A related difficulty comes from the amount of information retained about external names during the
+loading process.
+On the
+.UX
+system, externally known names have seven significant characters, with the upper/lower
+case distinction kept.
+On the IBM systems, there are eight significant characters, but the case distinction
+is lost.
+On GCOS, there are only six characters, of a single case.
+This leads to situations where programs run on the
+.UX
+system, but encounter loader
+problems on the IBM or GCOS systems.
+.I Lint
+.B \-p
+causes all external symbols to be mapped to one case and truncated to six characters,
+providing a worst-case analysis.
+.PP
+A number of differences arise in the area of character handling: characters in the
+.UX
+system are eight bit ascii, while they are eight bit ebcdic on the IBM, and
+nine bit ascii on GCOS.
+Moreover, character strings go from high to low bit positions (``left to right'')
+on GCOS and IBM, and low to high (``right to left'') on the PDP-11.
+This means that code attempting to construct strings
+out of character constants, or attempting to use characters as indices
+into arrays, must be looked at with great suspicion.
+.I Lint
+is of little help here, except to flag multi-character character constants.
+.PP
+Of course, the word sizes are different!
+This causes less trouble than might be expected, at least when
+moving from the
+.UX
+system (16 bit words) to the IBM (32 bits) or GCOS (36 bits).
+The main problems are likely to arise in shifting or masking.
+C now supports a bit-field facility, which can be used to write much of
+this code in a reasonably portable way.
+Frequently, portability of such code can be enhanced by
+slight rearrangements in coding style.
+Many of the incompatibilities seem to have the flavor of writing
+.DS
+x &= 0177700 ;
+.DE
+to clear the low order six bits of \fIx\fR.
+This suffices on the PDP-11, but fails badly on GCOS and IBM.
+If the bit field feature cannot be used, the same effect can be obtained by
+writing
+.DS
+x &= \(ap 077 ;
+.DE
+which will work on all these machines.
+.PP
+The right shift operator is arithmetic shift on the PDP-11, and logical shift on most
+other machines.
+To obtain a logical shift on all machines, the left operand can be
+typed \fBunsigned\fR.
+Characters are considered signed integers on the PDP-11, and unsigned on the other machines.
+This persistence of the sign bit may be reasonably considered a bug in the PDP-11 hardware
+which has infiltrated itself into the C language.
+If there were a good way to discover the programs which would be affected, C could be changed;
+in any case,
+.I lint
+is no help here.
+.PP
+The above discussion may have made the problem of portability seem
+bigger than it in fact is.
+The issues involved here are rarely subtle or mysterious, at least to the
+implementor of the program, although they can involve some work to straighten out.
+The most serious bar to the portability of
+.UX
+system utilities has been the inability to mimic
+essential
+.UX
+system functions on the other systems.
+The inability to seek to a random character position in a text file, or to establish a pipe
+between processes, has involved far more rewriting
+and debugging than any of the differences in C compilers.
+On the other hand,
+.I lint
+has been very helpful
+in moving the
+.UX
+operating system and associated
+utility programs to other machines.
+.SH
+Shutting Lint Up
+.PP
+There are occasions when
+the programmer is smarter than
+.I lint .
+There may be valid reasons for ``illegal'' type casts,
+functions with a variable number of arguments, etc.
+Moreover, as specified above, the flow of control information
+produced by
+.I lint
+often has blind spots, causing occasional spurious
+messages about perfectly reasonable programs.
+Thus, some way of communicating with
+.I lint ,
+typically to shut it up, is desirable.
+.PP
+The form which this mechanism should take is not at all clear.
+New keywords would require current and old compilers to
+recognize these keywords, if only to ignore them.
+This has both philosophical and practical problems.
+New preprocessor syntax suffers from similar problems.
+.PP
+What was finally done was to cause a number of words
+to be recognized by
+.I lint
+when they were embedded in comments.
+This required minimal preprocessor changes;
+the preprocessor just had to agree to pass comments
+through to its output, instead of deleting them
+as had been previously done.
+Thus,
+.I lint
+directives are invisible to the compilers, and
+the effect on systems with the older preprocessors
+is merely that the
+.I lint
+directives don't work.
+.PP
+The first directive is concerned with flow of control information;
+if a particular place in the program cannot be reached,
+but this is not apparent to
+.I lint ,
+this can be asserted by the directive
+.DS
+/* NOTREACHED */
+.DE
+at the appropriate spot in the program.
+Similarly, if it is desired to turn off
+strict type checking for
+the next expression, the directive
+.DS
+/* NOSTRICT */
+.DE
+can be used; the situation reverts to the
+previous default after the next expression.
+The
+.B \-v
+flag can be turned on for one function by the directive
+.DS
+/* ARGSUSED */
+.DE
+Complaints about variable number of arguments in calls to a function
+can be turned off by the directive
+.DS
+/* VARARGS */
+.DE
+preceding the function definition.
+In some cases, it is desirable to check the
+first several arguments, and leave the later arguments unchecked.
+This can be done by following the VARARGS keyword immediately
+with a digit giving the number of arguments which should be checked; thus,
+.DS
+/* VARARGS2 */
+.DE
+will cause the first two arguments to be checked, the others unchecked.
+Finally, the directive
+.DS
+/* LINTLIBRARY */
+.DE
+at the head of a file identifies this file as
+a library declaration file; this topic is worth a
+section by itself.
+.SH
+Library Declaration Files
+.PP
+.I Lint
+accepts certain library directives, such as
+.DS
+\-ly
+.DE
+and tests the source files for compatibility with these libraries.
+This is done by accessing library description files whose
+names are constructed from the library directives.
+These files all begin with the directive
+.DS
+/* LINTLIBRARY */
+.DE
+which is followed by a series of dummy function
+definitions.
+The critical parts of these definitions
+are the declaration of the function return type,
+whether the dummy function returns a value, and
+the number and types of arguments to the function.
+The VARARGS and ARGSUSED directives can
+be used to specify features of the library functions.
+.PP
+.I Lint
+library files are processed almost exactly like ordinary
+source files.
+The only difference is that functions which are defined on a library file,
+but are not used on a source file, draw no complaints.
+.I Lint
+does not simulate a full library search algorithm,
+and complains if the source files contain a redefinition of
+a library routine (this is a feature!).
+.PP
+By default,
+.I lint
+checks the programs it is given against a standard library
+file, which contains descriptions of the programs which
+are normally loaded when
+a C program
+is run.
+When the
+.B -p
+flag is in effect, another file is checked containing
+descriptions of the standard I/O library routines
+which are expected to be portable across various machines.
+The
+.B -n
+flag can be used to suppress all library checking.
+.SH
+Bugs, etc.
+.PP
+.I Lint
+was a difficult program to write, partially
+because it is closely connected with matters of programming style,
+and partially because users usually don't notice bugs which cause
+.I lint
+to miss errors which it should have caught.
+(By contrast, if
+.I lint
+incorrectly complains about something that is correct, the
+programmer reports that immediately!)
+.PP
+A number of areas remain to be further developed.
+The checking of structures and arrays is rather inadequate;
+size
+incompatibilities go unchecked,
+and no attempt is made to match up structure and union
+declarations across files.
+Some stricter checking of the use of the
+.B typedef
+is clearly desirable, but what checking is appropriate, and how
+to carry it out, is still to be determined.
+.PP
+.I Lint
+shares the preprocessor with the C compiler.
+At some point it may be appropriate for a
+special version of the preprocessor to be constructed
+which checks for things such as unused macro definitions,
+macro arguments which have side effects which are
+not expanded at all, or are expanded more than once, etc.
+.PP
+The central problem with
+.I lint
+is the packaging of the information which it collects.
+There are many options which
+serve only to turn off, or slightly modify,
+certain features.
+There are pressures to add even more of these options.
+.PP
+In conclusion, it appears that the general notion of having two
+programs is a good one.
+The compiler concentrates on quickly and accurately turning the
+program text into bits which can be run;
+.I lint
+concentrates on issues
+of portability, style, and efficiency.
+.I Lint
+can afford to be wrong, since incorrectness and over-conservatism
+are merely annoying, not fatal.
+The compiler can be fast since it knows that
+.I lint
+will cover its flanks.
+Finally, the programmer can
+concentrate at one stage
+of the programming process solely on the algorithms,
+data structures, and correctness of the
+program, and then later retrofit,
+with the aid of
+.I lint ,
+the desirable properties of universality and portability.
+.SG MH-1273-SCJ-unix
+.bp
+.[
+$LIST$
+.]
+.bp
+.SH
+Appendix:   Current Lint Options
+.PP
+The command currently has the form
+.DS
+lint\fR [\fB\-\fRoptions ] files... library-descriptors...
+.DE
+The options are
+.IP \fBh\fR
+Perform heuristic checks
+.IP \fBp\fR
+Perform portability checks
+.IP \fBv\fR
+Don't report unused arguments
+.IP \fBu\fR
+Don't report unused or undefined externals
+.IP \fBb\fR
+Report unreachable
+.B break
+statements.
+.IP \fBx\fR
+Report unused external declarations
+.IP \fBa\fR
+Report assignments of
+.B long
+to
+.B int
+or shorter.
+.IP \fBc\fR
+Complain about questionable casts
+.IP \fBn\fR
+No library checking is done
+.IP \fBs\fR
+Same as
+.B h
+(for historical reasons)
diff --git a/usr/doc/yacc/ss.. b/usr/doc/yacc/ss..

new file mode 100644 (file)

index 0000000..1b4c307
--- /dev/null
+++ b/usr/doc/yacc/ss..
@@ -0,0 +1,52 @@
+.RP
+.ND "July 31, 1978"
+.TL
+Yacc:
+Yet Another Compiler-Compiler
+.AU "MH 2C-559" 3968
+Stephen C. Johnson
+.AI
+.MH
+.AB
+.PP
+Computer program input generally has some structure;
+in fact, every computer program that does input can be thought of as defining
+an ``input language'' which it accepts.
+An input language may be as complex as a programming language, or as simple as
+a sequence of numbers.
+Unfortunately, usual input facilities
+are limited, difficult to use,
+and often are lax about checking their inputs for validity.
+.PP
+Yacc provides a general tool for describing
+the input to a computer program.
+The Yacc user specifies the structures
+of his input, together with code to be invoked as
+each such structure is recognized.
+Yacc turns such a specification into a subroutine that
+handles the input process;
+frequently, it is convenient and appropriate to have most
+of the flow of control in the user's application
+handled by this subroutine.
+.PP
+The input subroutine produced by Yacc calls a user-supplied routine to
+return the next basic input item.
+Thus, the user can specify his input in terms of individual input characters, or
+in terms of higher level constructs such as names and numbers.
+The user-supplied routine may also handle idiomatic features such as
+comment and continuation conventions, which typically defy easy grammatical specification.
+.PP
+Yacc is written in portable C.
+The class of specifications accepted is a very general one: LALR(1)
+grammars with disambiguating rules.
+.PP
+In addition to compilers for C, APL, Pascal, RATFOR, etc., Yacc
+has also been used for less conventional languages,
+including a phototypesetter language, several desk calculator languages, a document retrieval system,
+and a Fortran debugging system.
+.AE
+.OK
+Computer Languages
+Compilers
+Formal Language Theory
+.CS 23 11 34 0 0 8
diff --git a/usr/doc/yacc/ss0 b/usr/doc/yacc/ss0

new file mode 100644 (file)

index 0000000..19b4c28
--- /dev/null
+++ b/usr/doc/yacc/ss0
@@ -0,0 +1,199 @@
+.SH
+0: Introduction
+.PP
+Yacc provides a general tool for imposing structure on the input to a computer program.
+The Yacc user prepares a
+specification of the input process; this includes rules
+describing the input structure, code to be invoked when these
+rules are recognized, and a low-level routine to do the
+basic input.
+Yacc then generates a function to control the input process.
+This function, called a
+.I parser ,
+calls the user-supplied low-level input routine
+(the
+.I "lexical analyzer" )
+to pick up the basic items
+(called
+.I tokens )
+from the input stream.
+These tokens are organized according to the input structure rules,
+called
+.I "grammar rules" \|;
+when one of these rules has been recognized,
+then user code supplied for this rule, an
+.I action ,
+is invoked; actions have the ability to return values and
+make use of the values of other actions.
+.PP
+Yacc is written in a portable dialect of C
+.[
+Ritchie Kernighan Language Prentice
+.]
+and the actions, and output subroutine, are in C as well.
+Moreover, many of the syntactic conventions of Yacc follow C.
+.PP
+The heart of the input specification is a collection of grammar rules.
+Each rule describes an allowable structure and gives it a name.
+For example, one grammar rule might be
+.DS
+date  :  month\_name  day  \',\'  year   ;
+.DE
+Here,
+.I date ,
+.I month\_name ,
+.I day ,
+and
+.I year
+represent structures of interest in the input process;
+presumably,
+.I month\_name ,
+.I day ,
+and
+.I year
+are defined elsewhere.
+The comma ``,'' is enclosed in single quotes; this implies that the
+comma is to appear literally in the input.
+The colon and semicolon merely serve as punctuation in the rule, and have
+no significance in controlling the input.
+Thus, with proper definitions, the input
+.DS
+July  4, 1776
+.DE
+might be matched by the above rule.
+.PP
+An important part of the input process is carried out by the
+lexical analyzer.
+This user routine reads the input stream, recognizing the lower level structures,
+and communicates these tokens
+to the parser.
+For historical reasons, a structure recognized by the lexical analyzer is called a
+.I "terminal symbol" ,
+while the structure recognized by the parser is called a
+.I "nonterminal symbol" .
+To avoid confusion, terminal symbols will usually be referred to as
+.I tokens .
+.PP
+There is considerable leeway in deciding whether to recognize structures using the lexical
+analyzer or grammar rules.
+For example, the rules
+.DS
+month\_name  :  \'J\' \'a\' \'n\'   ;
+month\_name  :  \'F\' \'e\' \'b\'   ;
+
+         . . .
+
+month\_name  :  \'D\' \'e\' \'c\'   ;
+.DE
+might be used in the above example.
+The lexical analyzer would only need to recognize individual letters, and
+.I month\_name
+would be a nonterminal symbol.
+Such low-level rules tend to waste time and space, and may
+complicate the specification beyond Yacc's ability to deal with it.
+Usually, the lexical analyzer would
+recognize the month names,
+and return an indication that a
+.I month\_name
+was seen; in this case,
+.I month\_name
+would be a token.
+.PP
+Literal characters such as ``,'' must also be passed through the lexical
+analyzer, and are also considered tokens.
+.PP
+Specification files are very flexible.
+It is realively easy to add to the above example the rule
+.DS
+date  :  month \'/\' day \'/\' year   ;
+.DE
+allowing
+.DS
+7 / 4 / 1776
+.DE
+as a synonym for
+.DS
+July 4, 1776
+.DE
+In most cases, this new rule could be ``slipped in'' to a working system with minimal effort,
+and little danger of disrupting existing input.
+.PP
+The input being read may not conform to the
+specifications.
+These input errors are detected as early as is theoretically possible with a
+left-to-right scan;
+thus, not only is the chance of reading and computing with bad
+input data substantially reduced, but the bad data can usually be quickly found.
+Error handling,
+provided as part of the input specifications,
+permits the reentry of bad data,
+or the continuation of the input process after skipping over the bad data.
+.PP
+In some cases, Yacc fails to produce a parser when given a set of
+specifications.
+For example, the specifications may be self contradictory, or they may
+require a more powerful recognition mechanism than that available to Yacc.
+The former cases represent design errors;
+the latter cases
+can often be corrected
+by making
+the lexical analyzer
+more powerful, or by rewriting some of the grammar rules.
+While Yacc cannot handle all possible specifications, its power
+compares favorably with similar systems;
+moreover, the
+constructions which are difficult for Yacc to handle are
+also frequently difficult for human beings to handle.
+Some users have reported that the discipline of formulating valid
+Yacc specifications for their input revealed errors of
+conception or design early in the program development.
+.PP
+The theory underlying Yacc has been described elsewhere.
+.[
+Aho Johnson Surveys LR Parsing
+.]
+.[
+Aho Johnson Ullman Ambiguous Grammars
+.]
+.[
+Aho Ullman Principles Compiler Design
+.]
+Yacc has been extensively used in numerous practical applications,
+including
+.I lint ,
+.[
+Johnson Lint
+.]
+the Portable C Compiler,
+.[
+Johnson Portable Compiler Theory
+.]
+and a system for typesetting mathematics.
+.[
+Kernighan Cherry typesetting system CACM
+.]
+.PP
+The next several sections describe the
+basic process of preparing a Yacc specification;
+Section 1 describes the preparation of grammar rules,
+Section 2 the preparation of the user supplied actions associated with these rules,
+and Section 3 the preparation of lexical analyzers.
+Section 4 describes the operation of the parser.
+Section 5 discusses various reasons why Yacc may be unable to produce a
+parser from a specification, and what to do about it.
+Section 6 describes a simple mechanism for
+handling operator precedences in arithmetic expressions.
+Section 7 discusses error detection and recovery.
+Section 8 discusses the operating environment and special features
+of the parsers Yacc produces.
+Section 9 gives some suggestions which should improve the
+style and efficiency of the specifications.
+Section 10 discusses some advanced topics, and Section 11 gives
+acknowledgements.
+Appendix A has a brief example, and Appendix B gives a
+summary of the Yacc input syntax.
+Appendix C gives an example using some of the more advanced
+features of Yacc, and, finally,
+Appendix D describes mechanisms and syntax
+no longer actively supported, but
+provided for historical continuity with older versions of Yacc.
diff --git a/usr/doc/yacc/ss1 b/usr/doc/yacc/ss1

new file mode 100644 (file)

index 0000000..596a4e6
--- /dev/null
+++ b/usr/doc/yacc/ss1
@@ -0,0 +1,136 @@
+.tr *\(**
+.tr |\(or
+.SH
+1: Basic Specifications
+.PP
+Names refer to either tokens or nonterminal symbols.
+Yacc requires
+token names to be declared as such.
+In addition, for reasons discussed in Section 3, it is often desirable
+to include the lexical analyzer as part of the specification file;
+it may be useful to include other programs as well.
+Thus, every specification file consists of three sections:
+the
+.I declarations ,
+.I "(grammar) rules" ,
+and
+.I programs .
+The sections are separated by double percent ``%%'' marks.
+(The percent ``%'' is generally used in Yacc specifications as an escape character.)
+.PP
+In other words, a full specification file looks like
+.DS
+declarations
+%%
+rules
+%%
+programs
+.DE
+.PP
+The declaration section may be empty.
+Moreover, if the programs section is omitted, the second %% mark may be omitted also;
+thus, the smallest legal Yacc specification is
+.DS
+%%
+rules
+.DE
+.PP
+Blanks, tabs, and newlines are ignored except
+that they may not appear in names or multi-character reserved symbols.
+Comments may appear wherever a name is legal; they are enclosed
+in /* . . . */, as in C and PL/I.
+.PP
+The rules section is made up of one or more grammar rules.
+A grammar rule has the form:
+.DS
+A  :  BODY  ;
+.DE
+A represents a nonterminal name, and BODY represents a sequence of zero or more names and literals.
+The colon and the semicolon are Yacc punctuation.
+.PP
+Names may be of arbitrary length, and may be made up of letters, dot ``.'', underscore ``\_'', and
+non-initial digits.
+Upper and lower case letters are distinct.
+The names used in the body of a grammar rule may represent tokens or nonterminal symbols.
+.PP
+A literal consists of a character enclosed in single quotes ``\'''.
+As in C, the backslash ``\e'' is an escape character within literals, and all the C escapes
+are recognized.
+Thus
+.DS
+\'\en\'        newline
+\'\er\'        return
+\'\e\'\'       single quote ``\'''
+\'\e\e\'       backslash ``\e''
+\'\et\'        tab
+\'\eb\'        backspace
+\'\ef\'        form feed
+\'\exxx\'      ``xxx'' in octal
+.DE
+For a number of technical reasons, the
+\s-2NUL\s0
+character (\'\e0\' or 0) should never
+be used in grammar rules.
+.PP
+If there are several grammar rules with the same left hand side, the vertical bar ``|''
+can be used to avoid rewriting the left hand side.
+In addition,
+the semicolon at the end of a rule can be dropped before a vertical bar.
+Thus the grammar rules
+.DS
+A      :       B  C  D   ;
+A      :       E  F   ;
+A      :       G   ;
+.DE
+can be given to Yacc as
+.DS
+A      :       B  C  D
+       |       E  F
+       |       G
+       ;
+.DE
+It is not necessary that all grammar rules with the same left side appear together in the grammar rules section,
+although it makes the input much more readable, and easier to change.
+.PP
+If a nonterminal symbol matches the empty string, this can be indicated in the obvious way:
+.DS
+empty :   ;
+.DE
+.PP
+Names representing tokens must be declared; this is most simply done by writing
+.DS
+%token   name1  name2 . . .
+.DE
+in the declarations section.
+(See Sections 3 , 5, and 6 for much more discussion).
+Every name not defined in the declarations section is assumed to represent a nonterminal symbol.
+Every nonterminal symbol must appear on the left side of at least one rule.
+.PP
+Of all the nonterminal symbols, one, called the
+.I "start symbol" ,
+has particular importance.
+The parser is designed to recognize the start symbol; thus,
+this symbol represents the largest,
+most general structure described by the grammar rules.
+By default,
+the start symbol is taken to be the left hand side of the first
+grammar rule in the rules section.
+It is possible, and in fact desirable, to declare the start
+symbol explicitly in the declarations section using the %start keyword:
+.DS
+%start   symbol
+.DE
+.PP
+The end of the input to the parser is signaled by a special token, called the
+.I endmarker .
+If the tokens up to, but not including, the endmarker form a structure
+which matches the start symbol, the parser function returns to its caller
+after the endmarker is seen; it
+.I accepts
+the input.
+If the endmarker is seen in any other context, it is an error.
+.PP
+It is the job of the user-supplied lexical analyzer
+to return the endmarker when appropriate; see section 3, below.
+Usually the endmarker represents some reasonably obvious 
+I/O status, such as ``end-of-file'' or ``end-of-record''.
diff --git a/usr/doc/yacc/ss2 b/usr/doc/yacc/ss2

new file mode 100644 (file)

index 0000000..6500602
--- /dev/null
+++ b/usr/doc/yacc/ss2
@@ -0,0 +1,149 @@
+.SH
+2: Actions
+.PP
+With each grammar rule, the user may associate actions to be performed each time
+the rule is recognized in the input process.
+These actions may return values, and may obtain the values returned by previous
+actions.
+Moreover, the lexical analyzer can return values
+for tokens, if desired.
+.PP
+An action is an arbitrary C statement, and as such can do
+input and output, call subprograms, and alter
+external vectors and variables.
+An action is specified by
+one or more statements, enclosed in curly braces ``{'' and ``}''.
+For example,
+.DS
+A      :       \'(\'  B  \')\'
+                       {       hello( 1, "abc" );  }
+.DE
+and
+.DS
+XXX    :       YYY  ZZZ
+                       {       printf("a message\en");
+                               flag = 25;   }
+.DE
+are grammar rules with actions.
+.PP
+To facilitate easy communication between the actions and the parser, the action statements are altered
+slightly.
+The symbol ``dollar sign'' ``$'' is used as a signal to Yacc in this context.
+.PP
+To return a value, the action normally sets the
+pseudo-variable ``$$'' to some value.
+For example, an action that does nothing but return the value 1 is
+.DS
+       {  $$ = 1;  }
+.DE
+.PP
+To obtain the values returned by previous actions and the lexical analyzer, the
+action may use the pseudo-variables $1, $2, . . .,
+which refer to the values returned by the
+components of the right side of a rule, reading from left to right.
+Thus, if the rule is
+.DS
+A      :       B  C  D   ;
+.DE
+for example, then $2 has the value returned by C, and $3 the value returned by D.
+.PP
+As a more concrete example, consider the rule
+.DS
+expr   :       \'(\'  expr  \')\'   ;
+.DE
+The value returned by this rule is usually the value of the
+.I expr
+in parentheses.
+This can be indicated by
+.DS
+expr   :        \'(\'  expr  \')\'             {  $$ = $2 ;  }
+.DE
+.PP
+By default, the value of a rule is the value of the first element in it ($1).
+Thus, grammar rules of the form
+.DS
+A      :       B    ;
+.DE
+frequently need not have an explicit action.
+.PP
+In the examples above, all the actions came at the end of their rules.
+Sometimes, it is desirable to get control before a rule is fully parsed.
+Yacc permits an action to be written in the middle of a rule as well
+as at the end.
+This rule is assumed to return a value, accessible
+through the usual \$ mechanism by the actions to
+the right of it.
+In turn, it may access the values
+returned by the symbols to its left.
+Thus, in the rule
+.DS
+A      :       B
+                       {  $$ = 1;  }
+               C
+                       {   x = $2;   y = $3;  }
+       ;
+.DE
+the effect is to set
+.I x
+to 1, and
+.I y
+to the value returned by C.
+.PP
+Actions that do not terminate a rule are actually
+handled by Yacc by manufacturing a new nonterminal
+symbol name, and a new rule matching this
+name to the empty string.
+The interior action is the action triggered off by recognizing
+this added rule.
+Yacc actually treats the above example as if
+it had been written:
+.DS
+$ACT   :       /* empty */
+                       {  $$ = 1;  }
+       ;
+
+A      :       B  $ACT  C
+                       {   x = $2;   y = $3;  }
+       ;
+.DE
+.PP
+In many applications, output is not done directly by the actions;
+rather, a data structure, such as a parse tree, is constructed in memory,
+and transformations are applied to it before output is generated.
+Parse trees are particularly easy to
+construct, given routines to build and maintain the tree
+structure desired.
+For example, suppose there is a C function
+.I node ,
+written so that the call
+.DS
+node( L, n1, n2 )
+.DE
+creates a node with label L, and descendants n1 and n2, and returns the index of
+the newly created node.
+Then parse tree can be built by supplying actions such as:
+.DS
+expr   :       expr  \'+\'  expr  
+                       {  $$ = node( \'+\', $1, $3 );  }
+.DE
+in the specification.
+.PP
+The user may define other variables to be used by the actions.
+Declarations and definitions can appear in
+the declarations section,
+enclosed in the marks ``%{'' and ``%}''.
+These declarations and definitions have global scope, 
+so they are known to the action statements and the lexical analyzer.
+For example,
+.DS
+%{   int variable = 0;   %}
+.DE
+could be placed in the declarations section,
+making
+.I variable
+accessible to all of the actions.
+The Yacc parser uses only names beginning in ``yy'';
+the user should avoid such names.
+.PP
+In these examples, all the values are integers: a discussion of
+values of other types will be found in Section 10.
diff --git a/usr/doc/yacc/ss3 b/usr/doc/yacc/ss3

new file mode 100644 (file)

index 0000000..527bdef
--- /dev/null
+++ b/usr/doc/yacc/ss3
@@ -0,0 +1,102 @@
+.SH
+3: Lexical Analysis
+.PP
+The user must supply a lexical analyzer to read the input stream and communicate tokens
+(with values, if desired) to the parser.
+The lexical analyzer is an integer-valued function called
+.I yylex .
+The function returns an integer, the
+.I "token number" ,
+representing the kind of token read.
+If there is a value associated with that token, it should be assigned
+to the external variable
+.I yylval .
+.PP
+The parser and the lexical analyzer must agree on these token numbers in order for
+communication between them to take place.
+The numbers may be chosen by Yacc, or chosen by the user.
+In either case, the ``# define'' mechanism of C is used to allow the lexical analyzer
+to return these numbers symbolically.
+For example, suppose that the token name DIGIT has been defined in the declarations section of the
+Yacc specification file.
+The relevant portion of the lexical analyzer might look like:
+.DS
+yylex(){
+       extern int yylval;
+       int c;
+       . . .
+       c = getchar();
+       . . .
+       switch( c ) {
+               . . .
+       case \'0\':
+       case \'1\':
+         . . .
+       case \'9\':
+               yylval = c\-\'0\';
+               return( DIGIT );
+               . . .
+               }
+       . . .
+.DE
+.PP
+The intent is to return a token number of DIGIT, and a value equal to the numerical value of the
+digit.
+Provided that the lexical analyzer code is placed in the programs section of the specification file,
+the identifier DIGIT will be defined as the token number associated
+with the token DIGIT.
+.PP
+This mechanism leads to clear,
+easily modified lexical analyzers; the only pitfall is the need
+to avoid using any token names in the grammar that are reserved
+or significant in C or the parser; for example, the use of
+token names
+.I if
+or
+.I while
+will almost certainly cause severe
+difficulties when the lexical analyzer is compiled.
+The token name
+.I error
+is reserved for error handling, and should not be used naively
+(see Section 7).
+.PP
+As mentioned above, the token numbers may be chosen by Yacc or by the user.
+In the default situation, the numbers are chosen by Yacc.
+The default token number for a literal
+character is the numerical value of the character in the local character set.
+Other names are assigned token numbers
+starting at 257.
+.PP
+To assign a token number to a token (including literals),
+the first appearance of the token name or literal
+.I
+in the declarations section
+.R
+can be immediately followed by
+a nonnegative integer.
+This integer is taken to be the token number of the name or literal.
+Names and literals not defined by this mechanism retain their default definition.
+It is important that all token numbers be distinct.
+.PP
+For historical reasons, the endmarker must have token
+number 0 or negative.
+This token number cannot be redefined by the user; thus, all
+lexical analyzers should be prepared to return 0 or negative as a token number
+upon reaching the end of their input.
+.PP
+A very useful tool for constructing lexical analyzers is
+the
+.I Lex
+program developed by Mike Lesk.
+.[
+Lesk Lex
+.]
+These lexical analyzers are designed to work in close
+harmony with Yacc parsers.
+The specifications for these lexical analyzers
+use regular expressions instead of grammar rules.
+Lex can be easily used to produce quite complicated lexical analyzers,
+but there remain some languages (such as FORTRAN) which do not
+fit any theoretical framework, and whose lexical analyzers
+must be crafted by hand.
diff --git a/usr/doc/yacc/ss4 b/usr/doc/yacc/ss4

new file mode 100644 (file)

index 0000000..7cac9b2
--- /dev/null
+++ b/usr/doc/yacc/ss4
@@ -0,0 +1,328 @@
+.SH
+4: How the Parser Works
+.PP
+Yacc turns the specification file into a C program, which
+parses the input according to the specification given.
+The algorithm used to go from the
+specification to the parser is complex, and will not be discussed
+here (see
+the references for more information).
+The parser itself, however, is relatively simple,
+and understanding how it works, while
+not strictly necessary, will nevertheless make
+treatment of error recovery and ambiguities much more
+comprehensible.
+.PP
+The parser produced by Yacc consists
+of a finite state machine with a stack.
+The parser is also capable of reading and remembering the next
+input token (called the
+.I lookahead
+token).
+The
+.I "current state"
+is always the one on the top of the stack.
+The states of the finite state machine are given
+small integer labels; initially, the machine is in state 0,
+the stack contains only state 0, and no lookahead token has been read.
+.PP
+The machine has only four actions available to it, called
+.I shift ,
+.I reduce ,
+.I accept ,
+and
+.I error .
+A move of the parser is done as follows:
+.IP 1.
+Based on its current state, the parser decides
+whether it needs a lookahead token to decide
+what action should be done; if it needs one, and does
+not have one, it calls
+.I yylex
+to obtain the next token.
+.IP 2.
+Using the current state, and the lookahead token
+if needed, the parser decides on its next action, and
+carries it out.
+This may result in states being pushed onto the stack, or popped off of
+the stack, and in the lookahead token being processed
+or left alone.
+.PP
+The
+.I shift
+action is the most common action the parser takes.
+Whenever a shift action is taken, there is always
+a lookahead token.
+For example, in state 56 there may be
+an action:
+.DS
+       IF      shift 34
+.DE
+which says, in state 56, if the lookahead token is IF,
+the current state (56) is pushed down on the stack,
+and state 34 becomes the current state (on the
+top of the stack).
+The lookahead token is cleared.
+.PP
+The
+.I reduce
+action keeps the stack from growing without
+bounds.
+Reduce actions are appropriate when the parser has seen
+the right hand side of a grammar rule,
+and is prepared to announce that it has seen
+an instance of the rule, replacing the right hand side
+by the left hand side.
+It may be necessary to consult the lookahead token
+to decide whether to reduce, but
+usually it is not; in fact, the default
+action (represented by a ``.'') is often a reduce action.
+.PP
+Reduce actions are associated with individual grammar rules.
+Grammar rules are also given small integer
+numbers, leading to some confusion.
+The action
+.DS
+       \fB.\fR reduce 18
+.DE
+refers to
+.I "grammar rule"
+18, while the action
+.DS
+       IF      shift 34
+.DE
+refers to
+.I state
+34.
+.PP
+Suppose the rule being reduced is
+.DS
+A      \fB:\fR x  y  z    ;
+.DE
+The reduce action depends on the
+left hand symbol (A in this case), and the number of
+symbols on the right hand side (three in this case).
+To reduce, first pop off the top three states
+from the stack
+(In general, the number of states popped equals the number of symbols on the
+right side of the rule).
+In effect, these states were the ones
+put on the stack while recognizing
+.I x ,
+.I y ,
+and
+.I z ,
+and no longer serve any useful purpose.
+After popping these states, a state is uncovered
+which was the state the parser was in before beginning to
+process the rule.
+Using this uncovered state, and the symbol
+on the left side of the rule, perform what is in
+effect a shift of A.
+A new state is obtained, pushed onto the stack, and parsing continues.
+There are significant differences between the processing of
+the left hand symbol and an ordinary shift of a token,
+however, so this action is called a
+.I goto
+action.
+In particular, the lookahead token is cleared by a shift, and
+is not affected by a goto.
+In any case, the uncovered state contains an entry such as:
+.DS
+       A       goto 20
+.DE
+causing state 20 to be pushed
+onto the stack, and become the current state.
+.PP
+In effect, the reduce action ``turns back the clock'' in the parse,
+popping the states off the stack to go back to the
+state where the right hand side of the rule was first seen.
+The parser then behaves as if it had seen the left side at that time.
+If the right hand side of the rule is empty,
+no states are popped off of the stack: the uncovered state
+is in fact the current state.
+.PP
+The reduce action is also important in the treatment of user-supplied
+actions and values.
+When a rule is reduced, the code supplied with the rule is executed
+before the stack is adjusted.
+In addition to the stack holding the states, another stack,
+running in parallel with it, holds the values returned
+from the lexical analyzer and the actions.
+When a shift takes place, the external variable
+.I yylval
+is copied onto the value stack.
+After the return from the user code, the reduction is carried out.
+When the
+.I goto
+action is done, the external variable
+.I yyval
+is copied onto the value stack.
+The pseudo-variables $1, $2, etc., refer to the value stack.
+.PP
+The other two parser actions are conceptually much simpler.
+The
+.I accept
+action indicates that the entire input has been seen and
+that it matches the specification.
+This action appears only when the lookahead token is 
+the endmarker, and indicates that the parser has successfully
+done its job.
+The
+.I error
+action, on the other hand, represents a place where the parser
+can no longer continue parsing according to the specification.
+The input tokens it has seen, together with the lookahead token,
+cannot be followed by anything that would result
+in a legal input.
+The parser reports an error, and attempts to recover the situation and
+resume parsing: the error recovery (as opposed to the detection of error)
+will be covered in Section 7.
+.PP
+It is time for an example!
+Consider the specification
+.DS
+%token  DING  DONG  DELL
+%%
+rhyme  :       sound  place
+       ;
+sound  :       DING  DONG
+       ;
+place  :       DELL
+       ;
+.DE
+.PP
+When Yacc is invoked with the
+.B \-v
+option, a file called
+.I y.output
+is produced, with a human-readable description of the parser.
+The
+.I y.output
+file corresponding to the above grammar (with some statistics
+stripped off the end) is:
+.DS
+state 0
+       $accept  :  \_rhyme  $end 
+
+       DING  shift 3
+       .  error
+
+       rhyme  goto 1
+       sound  goto 2
+
+state 1
+       $accept  :   rhyme\_$end 
+
+       $end  accept
+       .  error
+
+state 2
+       rhyme  :   sound\_place 
+
+       DELL  shift 5
+       .  error
+
+       place   goto 4
+
+state 3
+       sound   :   DING\_DONG 
+
+       DONG  shift 6
+       .  error
+
+state 4
+       rhyme  :   sound  place\_    (1)
+
+       .   reduce  1
+
+state 5
+       place  :   DELL\_    (3)
+
+       .   reduce  3
+
+state 6
+       sound   :   DING  DONG\_    (2)
+
+       .   reduce  2
+.DE
+Notice that, in addition to the actions for each state, there is a
+description of the parsing rules being processed in each
+state.  The \_ character is used to indicate
+what has been seen, and what is yet to come, in each rule.
+Suppose the input is
+.DS
+DING  DONG  DELL
+.DE
+It is instructive to follow the steps of the parser while
+processing this input.
+.PP
+Initially, the current state is state 0.
+The parser needs to refer to the input in order to decide
+between the actions available in state 0, so
+the first token,
+.I DING ,
+is read, becoming the lookahead token.
+The action in state 0 on
+.I DING
+is
+is ``shift 3'', so state 3 is pushed onto the stack,
+and the lookahead token is cleared.
+State 3 becomes the current state.
+The next token,
+.I DONG ,
+is read, becoming the lookahead token.
+The action in state 3 on the token
+.I DONG
+is ``shift 6'',
+so state 6 is pushed onto the stack, and the lookahead is cleared.
+The stack now contains 0, 3, and 6.
+In state 6, without even consulting the lookahead,
+the parser reduces by rule 2.
+.DS
+       sound  :   DING  DONG
+.DE
+This rule has two symbols on the right hand side, so
+two states, 6 and 3, are popped off of the stack, uncovering state 0.
+Consulting the description of state 0, looking for a goto on 
+.I sound ,
+.DS
+       sound   goto 2
+.DE
+is obtained; thus state 2 is pushed onto the stack,
+becoming the current state.
+.PP
+In state 2, the next token,
+.I DELL ,
+must be read.
+The action is ``shift 5'', so state 5 is pushed onto the stack, which now has
+0, 2, and 5 on it, and the lookahead token is cleared.
+In state 5, the only action is to reduce by rule 3.
+This has one symbol on the right hand side, so one state, 5,
+is popped off, and state 2 is uncovered.
+The goto in state 2 on
+.I place ,
+the left side of rule 3,
+is state 4.
+Now, the stack contains 0, 2, and 4.
+In state 4, the only action is to reduce by rule 1.
+There are two symbols on the right, so the top two states are popped off,
+uncovering state 0 again.
+In state 0, there is a goto on
+.I rhyme
+causing the parser to enter state 1.
+In state 1, the input is read; the endmarker is obtained,
+indicated by ``$end'' in the
+.I y.output
+file.
+The action in state 1 when the endmarker is seen is to accept,
+successfully ending the parse.
+.PP
+The reader is urged to consider how the parser works
+when confronted with such incorrect strings as
+.I "DING DONG DONG" ,
+.I "DING DONG" ,
+.I "DING DONG DELL DELL" ,
+etc.
+A few minutes spend with this and other simple examples will
+probably be repaid when problems arise in more complicated contexts.
diff --git a/usr/doc/yacc/ss5 b/usr/doc/yacc/ss5

new file mode 100644 (file)

index 0000000..d2ec50b
--- /dev/null
+++ b/usr/doc/yacc/ss5
@@ -0,0 +1,300 @@
+.SH
+5: Ambiguity and Conflicts
+.PP
+A set of grammar rules is
+.I ambiguous
+if there is some input string that can be structured in two or more different ways.
+For example, the grammar rule
+.DS
+expr   :       expr  \'\-\'  expr
+.DE
+is a natural way of expressing the fact that one way of forming an arithmetic expression
+is to put two other expressions together with a minus sign between them.
+Unfortunately, this grammar rule does not
+completely specify the way that all complex inputs
+should be structured.
+For example, if the input is
+.DS
+expr  \-  expr  \-  expr
+.DE
+the rule allows this input to be structured as either
+.DS
+(  expr  \-  expr  )  \-  expr
+.DE
+or as
+.DS
+expr  \-  (  expr  \-  expr  )
+.DE
+(The first is called
+.I "left association" ,
+the second
+.I "right association" ).
+.PP
+Yacc detects such ambiguities when it is attempting to build the parser.
+It is instructive to consider the problem that confronts the parser when it is
+given an input such as
+.DS
+expr  \-  expr  \-  expr
+.DE
+When the parser has read the second expr, the input that it has seen:
+.DS
+expr  \-  expr
+.DE
+matches the right side of the grammar rule above.
+The parser could
+.I reduce
+the input by applying this rule;
+after applying the rule;
+the input is reduced to
+.I expr (the
+left side of the rule).
+The parser would then read the final part of the input:
+.DS
+\-  expr
+.DE
+and again reduce.
+The effect of this is to take the left associative interpretation.
+.PP
+Alternatively, when the parser has seen
+.DS
+expr  \-  expr
+.DE
+it could defer the immediate application of the rule, and continue reading
+the input until it had seen
+.DS
+expr  \-  expr  \-  expr
+.DE
+It could then apply the rule to the rightmost three symbols, reducing them to
+.I expr
+and leaving
+.DS
+expr  \-  expr
+.DE
+Now the rule can be reduced once more; the effect is to
+take the right associative interpretation.
+Thus, having read
+.DS
+expr  \-  expr
+.DE
+the parser can do two legal things, a shift or a reduction, and has no way of
+deciding between them.
+This is called a
+.I "shift / reduce conflict" .
+It may also happen that the parser has a choice of two legal reductions;
+this is called a
+.I "reduce / reduce conflict" .
+Note that there are never any ``Shift/shift'' conflicts.
+.PP
+When there are shift/reduce or reduce/reduce conflicts, Yacc still produces a parser.
+It does this by selecting one of the valid steps wherever it has a choice.
+A rule describing which choice to make in a given situation is called
+a
+.I "disambiguating rule" .
+.PP
+Yacc invokes two disambiguating rules by default:
+.IP 1.
+In a shift/reduce conflict, the default is to do the shift.
+.IP 2.
+In a reduce/reduce conflict, the default is to reduce by the
+.I earlier
+grammar rule (in the input sequence).
+.PP
+Rule 1 implies that reductions are deferred whenever there is a choice,
+in favor of shifts.
+Rule 2 gives the user rather crude control over the behavior of the parser
+in this situation, but reduce/reduce conflicts should be avoided whenever possible.
+.PP
+Conflicts may arise because of mistakes in input or logic, or because the grammar rules, while consistent,
+require a more complex parser than Yacc can construct.
+The use of actions within rules can also cause conflicts, if the action must
+be done before the parser can be sure which rule is being recognized.
+In these cases, the application of disambiguating rules is inappropriate,
+and leads to an incorrect parser.
+For this reason, Yacc
+always reports the number of shift/reduce and reduce/reduce conflicts resolved by Rule 1 and Rule 2.
+.PP
+In general, whenever it is possible to apply disambiguating rules to produce a correct parser, it is also
+possible to rewrite the grammar rules so that the same inputs are read but there are no
+conflicts.
+For this reason, most previous parser generators
+have considered conflicts to be fatal errors.
+Our experience has suggested that this rewriting is somewhat unnatural,
+and produces slower parsers; thus, Yacc will produce parsers even in the presence of conflicts.
+.PP
+As an example of the power of disambiguating rules, consider a fragment from a programming
+language involving an ``if-then-else'' construction:
+.DS
+stat   :       IF  \'(\'  cond  \')\'  stat
+       |       IF  \'(\'  cond  \')\'  stat  ELSE  stat
+       ;
+.DE
+In these rules,
+.I IF
+and
+.I ELSE
+are tokens,
+.I cond
+is a nonterminal symbol describing
+conditional (logical) expressions, and
+.I stat
+is a nonterminal symbol describing statements.
+The first rule will be called the
+.ul
+simple-if
+rule, and the
+second the
+.ul
+if-else
+rule.
+.PP
+These two rules form an ambiguous construction, since input of the form
+.DS
+IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2
+.DE
+can be structured according to these rules in two ways:
+.DS
+IF  (  C1  )  {
+       IF  (  C2  )  S1
+       }
+ELSE  S2
+.DE
+or
+.DS
+IF  (  C1  )  {
+       IF  (  C2  )  S1
+       ELSE  S2
+       }
+.DE
+The second interpretation is the one given in most programming languages
+having this construct.
+Each
+.I ELSE
+is associated with the last preceding
+``un-\fIELSE'\fRd''
+.I IF .
+In this example, consider the situation where the parser has seen
+.DS
+IF  (  C1  )  IF  (  C2  )  S1
+.DE
+and is looking at the
+.I ELSE .
+It can immediately
+reduce
+by the simple-if rule to get
+.DS
+IF  (  C1  )  stat
+.DE
+and then read the remaining input,
+.DS
+ELSE  S2
+.DE
+and reduce
+.DS
+IF  (  C1  )  stat  ELSE  S2
+.DE
+by the if-else rule.
+This leads to the first of the above groupings of the input.
+.PP
+On the other hand, the
+.I ELSE
+may be shifted,
+.I S2
+read, and then the right hand portion of
+.DS
+IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2
+.DE
+can be reduced by the if-else rule to get
+.DS
+IF  (  C1  )  stat
+.DE
+which can be reduced by the simple-if rule.
+This leads to the second of the above groupings of the input, which
+is usually desired.
+.PP
+Once again the parser can do two valid things \- there is a shift/reduce conflict.
+The application of disambiguating rule 1 tells the parser to shift in this case,
+which leads to the desired grouping.
+.PP
+This shift/reduce conflict arises only when there is a particular current input symbol,
+.I ELSE ,
+and particular inputs already seen, such as
+.DS
+IF  (  C1  )  IF  (  C2  )  S1
+.DE
+In general, there may be many conflicts, and each one
+will be associated with an input symbol and
+a set of previously read inputs.
+The previously read inputs are characterized by the
+state
+of the parser.
+.PP
+The conflict messages of Yacc are best understood
+by examining the verbose (\fB\-v\fR) option output file.
+For example, the output corresponding to the above
+conflict state might be:
+.DS L
+23: shift/reduce conflict (shift 45, reduce 18) on ELSE
+
+state 23
+
+         stat  :  IF  (  cond  )  stat\_         (18)
+         stat  :  IF  (  cond  )  stat\_ELSE  stat
+
+        ELSE     shift 45
+        .        reduce 18
+
+.DE
+The first line describes the conflict, giving the state and the input symbol.
+The ordinary state description follows, giving
+the grammar rules active in the state, and the parser actions.
+Recall that the underline marks the
+portion of the grammar rules which has been seen.
+Thus in the example, in state 23 the parser has seen input corresponding
+to
+.DS
+IF  (  cond  )  stat
+.DE
+and the two grammar rules shown are active at this time.
+The parser can do two possible things.
+If the input symbol is
+.I ELSE ,
+it is possible to shift into state
+45.
+State 45 will have, as part of its description, the line
+.DS
+stat  :  IF  (  cond  )  stat  ELSE\_stat
+.DE
+since the
+.I ELSE
+will have been shifted in this state.
+Back in state 23, the alternative action, described by ``\fB.\fR'',
+is to be done if the input symbol is not mentioned explicitly in the above actions; thus,
+in this case, if the input symbol is not
+.I ELSE ,
+the parser reduces by grammar rule 18:
+.DS
+stat  :  IF  \'(\'  cond  \')\'  stat
+.DE
+Once again, notice that the numbers following ``shift'' commands refer to other states,
+while the numbers following ``reduce'' commands refer to grammar
+rule numbers.
+In the
+.I y.output
+file, the rule numbers are printed after those rules which can be reduced.
+In most one states, there will be at most reduce action possible in the
+state, and this will be the default command.
+The user who encounters unexpected shift/reduce conflicts will probably want to
+look at the verbose output to decide whether the default actions are appropriate.
+In really tough cases, the user might need to know more about
+the behavior and construction of the parser than can be covered here.
+In this case, one of the theoretical references
+.[
+Aho Johnson Surveys Parsing
+.]
+.[
+Aho Johnson Ullman Deterministic Ambiguous
+.]
+.[
+Aho Ullman Principles Design
+.]
+might be consulted; the services of a local guru might also be appropriate.
diff --git a/usr/doc/yacc/ss6 b/usr/doc/yacc/ss6

new file mode 100644 (file)

index 0000000..c01ac16
--- /dev/null
+++ b/usr/doc/yacc/ss6
@@ -0,0 +1,144 @@
+.SH
+6: Precedence
+.PP
+There is one common situation
+where the rules given above for resolving conflicts are not sufficient;
+this is in the parsing of arithmetic expressions.
+Most of the commonly used constructions for arithmetic expressions can be naturally
+described by the notion of
+.I precedence
+levels for operators, together with information about left
+or right associativity.
+It turns out that ambiguous grammars with appropriate disambiguating rules
+can be used to create parsers that are faster and easier to
+write than parsers constructed from unambiguous grammars.
+The basic notion is to write grammar rules
+of the form
+.DS
+expr  :  expr  OP  expr
+.DE
+and
+.DS
+expr  :  UNARY  expr
+.DE
+for all binary and unary operators desired.
+This creates a very ambiguous grammar, with many parsing conflicts.
+As disambiguating rules, the user specifies the precedence, or binding
+strength, of all the operators, and the associativity
+of the binary operators.
+This information is sufficient to allow Yacc to resolve the parsing conflicts
+in accordance with these rules, and construct a parser that realizes the desired
+precedences and associativities.
+.PP
+The precedences and associativities are attached to tokens in the declarations section.
+This is done by a series of lines beginning with a Yacc keyword: %left, %right,
+or %nonassoc, followed by a list of tokens.
+All of the tokens on the same line are assumed to have the same precedence level
+and associativity; the lines are listed in
+order of increasing precedence or binding strength.
+Thus,
+.DS
+%left  \'+\'  \'\-\'
+%left  \'*\'  \'/\'
+.DE
+describes the precedence and associativity of the four arithmetic operators.
+Plus and minus are left associative, and have lower precedence than
+star and slash, which are also left associative.
+The keyword %right is used to describe right associative operators,
+and the keyword %nonassoc is used to describe operators, like
+the operator .LT. in Fortran, that may not associate with themselves; thus,
+.DS
+A  .LT.  B  .LT.  C
+.DE
+is illegal in Fortran, and such an operator would be described with the keyword
+%nonassoc in Yacc.
+As an example of the behavior of these declarations, the description
+.DS
+%right  \'=\'
+%left  \'+\'  \'\-\'
+%left  \'*\'  \'/\'
+
+%%
+
+expr   :       expr  \'=\'  expr
+       |       expr  \'+\'  expr
+       |       expr  \'\-\'  expr
+       |       expr  \'*\'  expr
+       |       expr  \'/\'  expr
+       |       NAME
+       ;
+.DE
+might be used to structure the input
+.DS
+a  =  b  =  c*d  \-  e  \-  f*g
+.DE
+as follows:
+.DS
+a = ( b = ( ((c*d)\-e) \- (f*g) ) )
+.DE
+When this mechanism is used,
+unary operators must, in general, be given a precedence.
+Sometimes a unary operator and a binary operator
+have the same symbolic representation, but different precedences.
+An example is unary and binary \'\-\'; unary minus may be given the same
+strength as multiplication, or even higher, while binary minus has a lower strength than
+multiplication.
+The keyword, %prec, changes the precedence level associated with a particular grammar rule.
+%prec appears immediately after the body of the grammar rule, before the action or closing semicolon,
+and is followed by a token name or literal.
+It
+causes the precedence of the grammar rule to become that of the following token name or literal.
+For example, to make unary minus have the same precedence as multiplication the rules might resemble:
+.DS
+%left  \'+\'  \'\-\'
+%left  \'*\'  \'/\'
+
+%%
+
+expr   :       expr  \'+\'  expr
+       |       expr  \'\-\'  expr
+       |       expr  \'*\'  expr
+       |       expr  \'/\'  expr
+       |       \'\-\'  expr      %prec  \'*\'
+       |       NAME
+       ;
+.DE
+.PP
+A token declared
+by %left, %right, and %nonassoc need not be, but may be, declared by %token as well.
+.PP
+The precedences and associativities are used by Yacc to
+resolve parsing conflicts; they give rise to disambiguating rules.
+Formally, the rules work as follows:
+.IP 1.
+The precedences and associativities are recorded for those tokens and literals
+that have them.
+.IP 2.
+A precedence and associativity is associated with each grammar rule; it is the precedence
+and associativity of the last token or literal in the body of the rule.
+If the %prec construction is used, it overrides this default.
+Some grammar rules may have no precedence and associativity associated with them.
+.IP 3.
+When there is a reduce/reduce conflict, or there is a shift/reduce conflict
+and either the input symbol or the grammar rule has no precedence and associativity,
+then the two disambiguating rules given at the beginning of the section are used,
+and the conflicts are reported.
+.IP 4.
+If there is a shift/reduce conflict, and both the grammar rule and the input character
+have precedence and associativity associated with them, then the conflict is resolved
+in favor of the action (shift or reduce) associated with the higher precedence.
+If the precedences are the same, then the associativity is used; left
+associative implies reduce, right associative implies shift, and nonassociating
+implies error.
+.PP
+Conflicts resolved by precedence are not counted in the number of shift/reduce and reduce/reduce
+conflicts reported by Yacc.
+This means that mistakes in the specification of precedences may
+disguise errors in the input grammar; it is a good idea to be sparing
+with precedences, and use them in an essentially ``cookbook'' fashion,
+until some experience has been gained.
+The
+.I y.output
+file
+is very useful in deciding whether the parser is actually doing
+what was intended.
diff --git a/usr/doc/yacc/ss7 b/usr/doc/yacc/ss7

new file mode 100644 (file)

index 0000000..90df1b0
--- /dev/null
+++ b/usr/doc/yacc/ss7
@@ -0,0 +1,122 @@
+.SH
+7: Error Handling
+.PP
+Error handling is an extremely difficult area, and many of the problems are semantic ones.
+When an error is found, for example, it may be necessary to reclaim parse tree storage,
+delete or alter symbol table entries, and, typically, set switches to avoid generating any further output.
+.PP
+It is seldom acceptable to stop all processing when an error is found; it is more useful to continue
+scanning the input to find further syntax errors.
+This leads to the problem of getting the parser ``restarted'' after an error.
+A general class of algorithms to do this involves discarding a number of tokens
+from the input string, and attempting to adjust the parser so that input can continue.
+.PP
+To allow the user some control over this process,
+Yacc provides a simple, but reasonably general, feature.
+The token name ``error'' is reserved for error handling.
+This name can be used in grammar rules;
+in effect, it suggests places where errors are expected, and recovery might take place.
+The parser pops its stack until it enters a state where the token ``error'' is legal.
+It then behaves as if the token ``error'' were the current lookahead token,
+and performs the action encountered.
+The lookahead token is then reset to the token that caused the error.
+If no special error rules have been specified, the processing halts when an error is detected.
+.PP
+In order to prevent a cascade of error messages, the parser, after
+detecting an error, remains in error state until three tokens have been successfully
+read and shifted.
+If an error is detected when the parser is already in error state,
+no message is given, and the input token is quietly deleted.
+.PP
+As an example, a rule of the form
+.DS
+stat   :       error
+.DE
+would, in effect, mean that on a syntax error the parser would attempt to skip over the statement
+in which the error was seen.
+More precisely, the parser will
+scan ahead, looking for three tokens that might legally follow
+a statement, and start processing at the first of these; if
+the beginnings of statements are not sufficiently distinctive, it may make a
+false start in the middle of a statement, and end up reporting a
+second error where there is in fact no error.
+.PP
+Actions may be used with these special error rules.
+These actions might attempt to reinitialize tables, reclaim symbol table space, etc.
+.PP
+Error rules such as the above are very general, but difficult to control.
+Somewhat easier are rules such as
+.DS
+stat   :       error  \';\'
+.DE
+Here, when there is an error, the parser attempts to skip over the statement, but
+will do so by skipping to the next \';\'.
+All tokens after the error and before the next \';\' cannot be shifted, and are discarded.
+When the \';\' is seen, this rule will be reduced, and any ``cleanup''
+action associated with it performed.
+.PP
+Another form of error rule arises in interactive applications, where
+it may be desirable to permit a line to be reentered after an error.
+A possible error rule might be
+.DS
+input  :       error  \'\en\'  {  printf( "Reenter last line: " );  }  input
+                       {       $$  =  $4;  }
+.DE
+There is one potential difficulty with this approach;
+the parser must correctly process three input tokens before it
+admits that it has correctly resynchronized after the error.
+If the reentered line contains an error
+in the first two tokens, the parser deletes the offending tokens,
+and gives no message; this is clearly unacceptable.
+For this reason, there is a mechanism that
+can be used to force the parser
+to believe that an error has been fully recovered from.
+The statement
+.DS
+yyerrok ;
+.DE
+in an action
+resets the parser to its normal mode.
+The last example is better written
+.DS
+input  :       error  \'\en\'
+                       {       yyerrok;
+                               printf( "Reenter last line: " );   }
+               input
+                       {       $$  =  $4;  }
+       ;
+.DE
+.PP
+As mentioned above, the token seen immediately
+after the ``error'' symbol is the input token at which the
+error was discovered.
+Sometimes, this is inappropriate; for example, an
+error recovery action might
+take upon itself the job of finding the correct place to resume input.
+In this case,
+the previous lookahead token must be cleared.
+The statement
+.DS
+yyclearin ;
+.DE
+in an action will have this effect.
+For example, suppose the action after error
+were to call some sophisticated resynchronization routine,
+supplied by the user, that attempted to advance the input to the
+beginning of the next valid statement.
+After this routine was called, the next token returned by yylex would presumably
+be the first token in a legal statement;
+the old, illegal token must be discarded, and the error state reset.
+This could be done by a rule like
+.DS
+stat   :       error 
+                       {       resynch();
+                               yyerrok ;
+                               yyclearin ;   }
+       ;
+.DE
+.PP
+These mechanisms are admittedly crude, but do allow for a simple, fairly effective recovery of the parser
+from many errors;
+moreover, the user can get control to deal with
+the error actions required by other portions of the program.
diff --git a/usr/doc/yacc/ss8 b/usr/doc/yacc/ss8

new file mode 100644 (file)

index 0000000..f6b54e8
--- /dev/null
+++ b/usr/doc/yacc/ss8
@@ -0,0 +1,91 @@
+.SH
+8: The Yacc Environment
+.PP
+When the user inputs a specification
+to Yacc, the output is a file of C programs, called
+.I y.tab.c
+on most
+systems
+(due to local file system conventions, the names may differ from
+installation to installation).
+The function produced by Yacc is called
+.I yyparse \|;
+it is an integer valued function.
+When it is called, it in turn repeatedly calls
+.I yylex ,
+the lexical analyzer
+supplied by the user (see Section 3)
+to obtain input tokens.
+Eventually, either an error is detected, in which case
+(if no error recovery is possible)
+.I yyparse
+returns the value 1,
+or the lexical analyzer returns the endmarker token
+and the parser accepts.
+In this case,
+.I yyparse
+returns the value 0.
+.PP
+The user must provide a certain amount of environment for this
+parser in order to obtain a working program.
+For example, as with every C program, a program called
+.I main
+must be defined, that eventually calls
+.I yyparse .
+In addition, a routine called
+.I yyerror
+prints a message
+when a syntax error is detected.
+.PP
+These two routines must be supplied in one form or another by the
+user.
+To ease the initial effort of using Yacc, a library has been
+provided with default versions of
+.I main
+and
+.I yyerror .
+The name of this library is system dependent;
+on many systems the library is accessed by a
+.B \-ly
+argument to the loader.
+To show the triviality of these default programs, the source is
+given below:
+.DS
+main(){
+       return( yyparse() );
+       }
+.DE
+and
+.DS
+# include <stdio.h>
+
+yyerror(s) char *s; {
+       fprintf( stderr, "%s\en", s );
+       }
+.DE
+The argument to
+.I yyerror
+is a string containing an error message, usually
+the string ``syntax error''.
+The average application will want to do better than this.
+Ordinarily, the program should keep track of the input line number, and print it
+along with the message when a syntax error is detected.
+The external integer variable
+.I yychar
+contains the lookahead token number at the time the error was detected;
+this may be of some interest in giving better diagnostics.
+Since the
+.I main
+program is probably supplied by the user (to read arguments, etc.)
+the Yacc library is useful only in small
+projects, or in the earliest stages of larger ones.
+.PP
+The external integer variable
+.I yydebug
+is normally set to 0.
+If it is set to a nonzero value, the parser will output a
+verbose description of its actions, including
+a discussion of which input symbols have been read, and
+what the parser actions are.
+Depending on the operating environment,
+it may be possible to set this variable by using a debugging system.
diff --git a/usr/doc/yacc/ss9 b/usr/doc/yacc/ss9

new file mode 100644 (file)

index 0000000..99eea8d
--- /dev/null
+++ b/usr/doc/yacc/ss9
@@ -0,0 +1,167 @@
+.SH
+9: Hints for Preparing Specifications
+.PP
+This section contains miscellaneous hints on preparing efficient, easy to change,
+and clear specifications.
+The individual subsections are more or less
+independent.
+.SH
+Input Style
+.PP
+It is difficult to
+provide rules with substantial actions
+and still have a readable specification file.
+The following style hints owe much to Brian Kernighan.
+.IP a.
+Use all capital letters for token names, all lower case letters for
+nonterminal names.
+This rule comes under the heading of ``knowing who to blame when
+things go wrong.''
+.IP b.
+Put grammar rules and actions on separate lines.
+This allows either to be changed without
+an automatic need to change the other.
+.IP c.
+Put all rules with the same left hand side together.
+Put the left hand side in only once, and let all
+following rules begin with a vertical bar.
+.IP d.
+Put a semicolon only after the last rule with a given left hand side,
+and put the semicolon on a separate line.
+This allows new rules to be easily added.
+.IP e.
+Indent rule bodies by two tab stops, and action bodies by three
+tab stops.
+.PP
+The example in Appendix A is written following this style, as are
+the examples in the text of this paper (where space permits).
+The user must make up his own mind about these stylistic questions;
+the central problem, however, is to make the rules visible through
+the morass of action code.
+.SH
+Left Recursion
+.PP
+The algorithm used by the Yacc parser encourages so called ``left recursive''
+grammar rules: rules of the form
+.DS
+name   :       name  rest_of_rule  ;
+.DE
+These rules frequently arise when
+writing specifications of sequences and lists:
+.DS
+list   :       item
+       |       list  \',\'  item
+       ;
+.DE
+and
+.DS
+seq    :       item
+       |       seq  item
+       ;
+.DE
+In each of these cases, the first rule
+will be reduced for the first item only, and the second rule
+will be reduced for the second and all succeeding items.
+.PP
+With right recursive rules, such as
+.DS
+seq    :       item
+       |       item  seq
+       ;
+.DE
+the parser would be a bit bigger, and the items would be seen, and reduced,
+from right to left.
+More seriously, an internal stack in the parser
+would be in danger of overflowing if a very long sequence were read.
+Thus, the user should use left recursion wherever reasonable.
+.PP
+It is worth considering whether a sequence with zero
+elements has any meaning, and if so, consider writing
+the sequence specification with an empty rule:
+.DS
+seq    :       /* empty */
+       |       seq  item
+       ;
+.DE
+Once again, the first rule would always be reduced exactly once, before the
+first item was read,
+and then the second rule would be reduced once for each item read.
+Permitting empty sequences
+often leads to increased generality.
+However, conflicts might arise if Yacc is asked to decide
+which empty sequence it has seen, when it hasn't seen enough to
+know!
+.SH
+Lexical Tie-ins
+.PP
+Some lexical decisions depend on context.
+For example, the lexical analyzer might want to
+delete blanks normally, but not within quoted strings.
+Or names might be entered into a symbol table in declarations,
+but not in expressions.
+.PP
+One way of handling this situation is
+to create a global flag that is
+examined by the lexical analyzer, and set by actions.
+For example, suppose a program
+consists of 0 or more declarations, followed by 0 or more statements.
+Consider:
+.DS
+%{
+       int dflag;
+%}
+  ...  other declarations ...
+
+%%
+
+prog   :       decls  stats
+       ;
+
+decls  :       /* empty */
+                       {       dflag = 1;  }
+       |       decls  declaration
+       ;
+
+stats  :       /* empty */
+                       {       dflag = 0;  }
+       |       stats  statement
+       ;
+
+    ...  other rules ...
+.DE
+The flag
+.I dflag
+is now 0 when reading statements, and 1 when reading declarations,
+.ul
+except for the first token in the first statement.
+This token must be seen by the parser before it can tell that
+the declaration section has ended and the statements have
+begun.
+In many cases, this single token exception does not
+affect the lexical scan.
+.PP
+This kind of ``backdoor'' approach can be elaborated
+to a noxious degree.
+Nevertheless, it represents a way of doing some things
+that are difficult, if not impossible, to
+do otherwise.
+.SH
+Reserved Words
+.PP
+Some programming languages
+permit the user to
+use words like ``if'', which are normally reserved,
+as label or variable names, provided that such use does not
+conflict with the legal use of these names in the programming language.
+This is extremely hard to do in the framework of Yacc;
+it is difficult to pass information to the lexical analyzer
+telling it ``this instance of `if' is a keyword, and that instance is a variable''.
+The user can make a stab at it, using the
+mechanism described in the last subsection,
+but it is difficult.
+.PP
+A number of ways of making this easier are under advisement.
+Until then, it is better that the keywords be
+.I reserved \|;
+that is, be forbidden for use as variable names.
+There are powerful stylistic reasons for preferring this, anyway.
diff --git a/usr/doc/yacc/ssA b/usr/doc/yacc/ssA

new file mode 100644 (file)

index 0000000..564e801
--- /dev/null
+++ b/usr/doc/yacc/ssA
@@ -0,0 +1,182 @@
+.SH
+10: Advanced Topics
+.PP
+This section discusses a number of advanced features
+of Yacc.
+.SH
+Simulating Error and Accept in Actions
+.PP
+The parsing actions of error and accept can be simulated
+in an action by use of macros YYACCEPT and YYERROR.
+YYACCEPT causes
+.I yyparse
+to return the value 0;
+YYERROR causes
+the parser to behave as if the current input symbol
+had been a syntax error;
+.I yyerror
+is called, and error recovery takes place.
+These mechanisms can be used to simulate parsers
+with multiple endmarkers or context-sensitive syntax checking.
+.SH
+Accessing Values in Enclosing Rules.
+.PP
+An action may refer to values
+returned by actions to the left of the current rule.
+The mechanism is simply the same as with ordinary actions,
+a dollar sign followed by a digit, but in this case the
+digit may be 0 or negative.
+Consider
+.DS
+sent   :       adj  noun  verb  adj  noun
+                       {  \fIlook at the sentence\fR . . .  }
+       ;
+
+adj    :       THE             {       $$ = THE;  }
+       |       YOUNG   {       $$ = YOUNG;  }
+       . . .
+       ;
+
+noun   :       DOG
+                       {       $$ = DOG;  }
+       |       CRONE
+                       {       if( $0 == YOUNG ){
+                                       printf( "what?\en" );
+                                       }
+                               $$ = CRONE;
+                               }
+       ;
+       . . .
+.DE
+In the action following the word CRONE, a check is made that the
+preceding token shifted was not YOUNG.
+Obviously, this is only possible when a great deal is known about
+what might precede the symbol
+.I noun
+in the input.
+There is also a distinctly unstructured flavor about this.
+Nevertheless, at times this mechanism will save a great
+deal of trouble, especially when a few combinations are to
+be excluded from an otherwise regular structure.
+.SH
+Support for Arbitrary Value Types
+.PP
+By default, the values returned by actions and the lexical analyzer are integers.
+Yacc can also support
+values of other types, including structures.
+In addition, Yacc keeps track of the types, and inserts
+appropriate union member names so that the resulting parser will
+be strictly type checked.
+The Yacc value stack (see Section 4)
+is declared to be a
+.I union
+of the various types of values desired.
+The user declares the union, and associates union member names
+to each token and nonterminal symbol having a value.
+When the value is referenced through a $$ or $n construction,
+Yacc will automatically insert the appropriate union name, so that
+no unwanted conversions will take place.
+In addition, type checking commands such as
+.I Lint\|
+.[
+Johnson Lint Checker 1273
+.]
+will be far more silent.
+.PP
+There are three mechanisms used to provide for this typing.
+First, there is a way of defining the union; this must be
+done by the user since other programs, notably the lexical analyzer,
+must know about the union member names.
+Second, there is a way of associating a union member name with tokens
+and nonterminals.
+Finally, there is a mechanism for describing the type of those
+few values where Yacc can not easily determine the type.
+.PP
+To declare the union, the user includes in the declaration section:
+.DS
+%union  {
+       body of union ...
+       }
+.DE
+This declares the Yacc value stack,
+and the external variables
+.I yylval
+and
+.I yyval ,
+to have type equal to this union.
+If Yacc was invoked with the
+.B \-d
+option, the union declaration
+is copied onto the
+.I y.tab.h
+file.
+Alternatively,
+the union may be declared in a header file, and a typedef
+used to define the variable YYSTYPE to represent
+this union.
+Thus, the header file might also have said:
+.DS
+typedef union {
+       body of union ...
+       } YYSTYPE;
+.DE
+The header file must be included in the declarations
+section, by use of %{ and %}.
+.PP
+Once YYSTYPE is defined,
+the union member names must be associated
+with the various terminal and nonterminal names.
+The construction
+.DS
+< name >
+.DE
+is used to indicate a union member name.
+If this follows
+one of the
+keywords %token,
+%left, %right, and %nonassoc,
+the union member name is associated with the tokens listed.
+Thus, saying
+.DS
+%left  <optype>  \'+\'  \'\-\'
+.DE
+will cause any reference to values returned by these two tokens to be
+tagged with
+the union member name
+.I optype .
+Another keyword, %type, is
+used similarly to associate
+union member names with nonterminals.
+Thus, one might say
+.DS
+%type  <nodetype>  expr  stat
+.DE
+.PP
+There remain a couple of cases where these mechanisms are insufficient.
+If there is an action within a rule, the value returned
+by this action has no
+.I "a priori"
+type.
+Similarly, reference to left context values (such as $0 \- see the
+previous subsection ) leaves Yacc with no easy way of knowing the type.
+In this case, a type can be imposed on the reference by inserting
+a union member name, between < and >, immediately after
+the first $.
+An example of this usage is
+.DS
+rule   :       aaa  {  $<intval>$  =  3;  } bbb
+                       {       fun( $<intval>2, $<other>0 );  }
+       ;
+.DE
+This syntax has little to recommend it, but the situation arises rarely.
+.PP
+A sample specification is given in Appendix C.
+The facilities in this subsection are not triggered until they are used:
+in particular, the use of %type will turn on these mechanisms.
+When they are used, there is a fairly strict level of checking.
+For example, use of $n or $$ to refer to something with no defined type
+is diagnosed.
+If these facilities are not triggered, the Yacc value stack is used to
+hold
+.I int' s,
+as was true historically.
diff --git a/usr/doc/yacc/ssB b/usr/doc/yacc/ssB

new file mode 100644 (file)

index 0000000..2352a84
--- /dev/null
+++ b/usr/doc/yacc/ssB
@@ -0,0 +1,24 @@
+.SH
+11: Acknowledgements
+.PP
+Yacc owes much to a
+most stimulating collection of users, who have goaded
+me beyond my inclination, and frequently beyond my
+ability, in their endless search for ``one more feature''.
+Their irritating unwillingness to learn how to
+do things my way has usually led to my doing things their way;
+most of the time, they have been right.
+B. W. Kernighan, P. J. Plauger, S. I. Feldman, C. Imagna,
+M. E. Lesk,
+and A. Snyder will recognize some of their ideas in the current version
+of Yacc.
+C. B. Haley contributed to the error recovery algorithm.
+D. M. Ritchie, B. W. Kernighan, and M. O. Harris helped translate this document into English.
+Al Aho also deserves special credit for bringing
+the mountain to Mohammed, and other favors.
+.SG "MH-1273-SCJ-unix"
+.bp
+.[
+$LIST$
+.]
+.bp
diff --git a/usr/doc/yacc/ssa b/usr/doc/yacc/ssa

new file mode 100644 (file)

index 0000000..2ad7a8e
--- /dev/null
+++ b/usr/doc/yacc/ssa
@@ -0,0 +1,111 @@
+.SH
+Appendix A:  A Simple Example
+.PP
+This example gives the complete Yacc specification for a small desk calculator;
+the desk calculator has 26 registers, labeled ``a'' through ``z'', and accepts
+arithmetic expressions made up of the operators +, \-, *, /,
+% (mod operator), & (bitwise and), | (bitwise or), and assignment.
+If an expression at the top level is an assignment, the value is not
+printed; otherwise it is.
+As in C, an integer that begins with 0 (zero) is assumed to be octal;
+otherwise, it is assumed to be decimal.
+.PP
+As an example of a Yacc specification, the desk calculator
+does a reasonable job of showing how precedences and ambiguities
+are used, and demonstrating simple error recovery.
+The major oversimplifications are that the
+lexical analysis phase is much simpler than for most applications, and the
+output is produced immediately, line by line.
+Note the way that decimal and octal integers are read in by the grammar rules;
+This job is probably better done by the lexical analyzer.
+.sp
+.nf
+.ta .5i 1i 1.5i 2i 2.5i
+
+%{
+#  include  <stdio.h>
+#  include  <ctype.h>
+
+int  regs[26];
+int  base;
+
+%}
+
+%start  list
+
+%token  DIGIT  LETTER
+
+%left  \'|\'
+%left  \'&\'
+%left  \'+\'  \'\-\'
+%left  \'*\'  \'/\'  \'%\'
+%left  UMINUS      /*  supplies  precedence  for  unary  minus  */
+
+%%      /*  beginning  of  rules  section  */
+
+list   :       /*  empty  */
+       |       list  stat  \'\en\'
+       |       list  error  \'\en\'
+                       {       yyerrok;  }
+       ;
+
+stat   :       expr
+                       {       printf( "%d\en", $1 );  }
+       |       LETTER  \'=\'  expr
+                       {       regs[$1]  =  $3;  }
+       ;
+
+expr   :       \'(\'  expr  \')\'
+                       {       $$  =  $2;  }
+       |       expr  \'+\'  expr
+                       {       $$  =  $1  +  $3;  }
+       |       expr  \'\-\'  expr
+                       {       $$  =  $1  \-  $3;  }
+       |       expr  \'*\'  expr
+                       {       $$  =  $1  *  $3;  }
+       |       expr  \'/\'  expr
+                       {       $$  =  $1  /  $3;  }
+       |       expr  \'%\'  expr
+                       {       $$  =  $1  %  $3;  }
+       |       expr  \'&\'  expr
+                       {       $$  =  $1  &  $3;  }
+       |       expr  \'|\'  expr
+                       {       $$  =  $1  |  $3;  }
+       |       \'\-\'  expr        %prec  UMINUS
+                       {       $$  =  \-  $2;  }
+       |       LETTER
+                       {       $$  =  regs[$1];  }
+       |       number          
+       ;
+
+number :       DIGIT
+                       {       $$ = $1;    base  =  ($1==0)  ?  8  :  10;  }
+       |       number  DIGIT
+                       {       $$  =  base * $1  +  $2;  }
+       ;
+
+%%      /*  start  of  programs  */
+
+yylex() {              /*  lexical  analysis  routine  */
+              /*  returns  LETTER  for  a  lower  case  letter,  yylval = 0  through  25  */
+              /*  return  DIGIT  for  a  digit,  yylval = 0  through  9  */
+              /*  all  other  characters  are  returned  immediately  */
+
+       int  c;
+
+       while(  (c=getchar())  ==  \' \'  )  {  /*  skip  blanks  */  }
+
+       /*  c  is  now  nonblank  */
+
+       if(  islower(  c  )  )  {       
+               yylval  =  c  \-  \'a\';
+               return  (  LETTER  );
+               }
+       if(  isdigit(  c  )  )  {       
+               yylval  =  c  \-  \'0\';
+               return(  DIGIT  );
+               }
+       return(  c  );
+       }
+.fi
+.bp
diff --git a/usr/doc/yacc/ssb b/usr/doc/yacc/ssb

new file mode 100644 (file)

index 0000000..8d7cfc6
--- /dev/null
+++ b/usr/doc/yacc/ssb
@@ -0,0 +1,108 @@
+.SH
+Appendix B: Yacc Input Syntax
+.PP
+This Appendix has a description of the Yacc input syntax, as a Yacc specification.
+Context dependencies, etc., are not considered.
+Ironically, the Yacc input specification language
+is most naturally specified as an LR(2) grammar; the sticky
+part comes when an identifier is seen in a rule, immediately
+following an action.
+If this identifier is followed by a colon, it is the start of the
+next rule; otherwise
+it is a continuation of the current rule, which just happens to have
+an action embedded in it.
+As implemented, the lexical analyzer looks
+ahead after seeing an identifier, and
+decide whether the next token (skipping blanks, newlines, comments, etc.)
+is a colon.
+If so, it returns the token C_IDENTIFIER.
+Otherwise, it returns IDENTIFIER.
+Literals (quoted strings) are also returned as IDENTIFIERS,
+but never as part of C_IDENTIFIERs.
+.sp
+.nf
+.ta .6i 1.2i 1.8i 2.4i 3i 3.6i
+
+            /*  grammar  for  the  input  to  Yacc  */
+
+       /*  basic  entities  */
+%token IDENTIFIER      /*   includes  identifiers   and  literals  */
+%token C_IDENTIFIER    /*    identifier  (but  not  literal)  followed  by  colon    */
+%token NUMBER          /*    [0-9]+    */
+
+       /*  reserved  words:    %type  =>  TYPE,  %left  =>  LEFT,  etc.  */
+
+%token LEFT  RIGHT  NONASSOC  TOKEN  PREC  TYPE  START  UNION
+
+%token MARK    /*  the  %%  mark  */
+%token LCURL   /*  the  %{  mark  */
+%token RCURL   /*  the  %}  mark  */
+
+       /*  ascii  character  literals  stand  for  themselves  */
+
+%start spec
+
+%%
+
+spec   :       defs  MARK  rules  tail
+       ;
+
+tail   :       MARK    {    \fIIn  this  action,  eat  up  the  rest  of  the  file\fR    }
+       |       /*  empty:  the  second  MARK  is  optional  */
+       ;
+
+defs   :       /*  empty  */
+       |       defs  def
+       ;
+
+def    :       START  IDENTIFIER
+       |       UNION  {  \fICopy union  definition  to  output\fR  }
+       |       LCURL  {  \fICopy  C  code  to  output  file\fR   }  RCURL
+       |       ndefs  rword  tag  nlist
+       ;
+
+rword  :       TOKEN
+       |       LEFT
+       |       RIGHT
+       |       NONASSOC
+       |       TYPE
+       ;
+
+tag    :       /*  empty:  union  tag  is  optional  */
+       |       \'<\'  IDENTIFIER  \'>\'
+       ;
+
+nlist  :       nmno
+       |       nlist  nmno
+       |       nlist  \',\'  nmno
+       ;
+
+nmno   :       IDENTIFIER              /*  NOTE:  literal  illegal  with  %type  */
+       |       IDENTIFIER  NUMBER      /*  NOTE:  illegal  with  %type  */
+       ;
+
+       /*  rules  section  */
+
+rules  :       C_IDENTIFIER  rbody  prec
+       |       rules  rule
+       ;
+
+rule   :       C_IDENTIFIER  rbody  prec
+       |       '|'  rbody  prec
+       ;
+
+rbody  :       /*  empty  */
+       |       rbody  IDENTIFIER
+       |       rbody  act
+       ;
+
+act    :       \'{\'  {  \fICopy  action,  translate  $$,  etc.\fR  }  \'}\'
+       ;
+
+prec   :       /*  empty  */
+       |       PREC  IDENTIFIER
+       |       PREC  IDENTIFIER  act
+       |       prec  \';\'
+       ;
+.fi
+.bp
diff --git a/usr/doc/yacc/ssc b/usr/doc/yacc/ssc

new file mode 100644 (file)

index 0000000..1f4d036
--- /dev/null
+++ b/usr/doc/yacc/ssc
@@ -0,0 +1,309 @@
+.SH
+Appendix C: An Advanced Example
+.PP
+This Appendix gives an example of a grammar using some
+of the advanced features discussed in Section 10.
+The desk calculator example in Appendix A is
+modified to provide a desk calculator that
+does floating point interval arithmetic.
+The calculator understands floating point
+constants, the arithmetic operations +, \-, *, /,
+unary \-, and = (assignment), and has 26 floating
+point variables, ``a'' through ``z''.
+Moreover, it also understands
+.I intervals ,
+written
+.DS
+       ( x , y )
+.DE
+where
+.I x
+is less than or equal to
+.I y .
+There are 26 interval valued variables ``A'' through ``Z''
+that may also be used.
+The usage is similar to that in Appendix A; assignments
+return no value, and print nothing, while expressions print
+the (floating or interval) value.
+.PP
+This example explores a number of interesting features
+of Yacc and C.
+Intervals are represented by a structure, consisting of the
+left and right endpoint values, stored as
+.I double 's.
+This structure is given a type name, INTERVAL, by using
+.I typedef .
+The Yacc value stack can also contain floating point scalars, and
+integers (used to index into the arrays holding the variable values).
+Notice that this entire strategy depends strongly on being able to
+assign structures and unions in C.
+In fact, many of the actions call functions that return structures
+as well.
+.PP
+It is also worth noting the use of YYERROR to handle error conditions:
+division by an interval containing 0, and an interval presented in
+the wrong order.
+In effect, the error recovery mechanism of Yacc is used to throw away the
+rest of the offending line.
+.PP
+In addition to the mixing of types on the value stack,
+this grammar also demonstrates an interesting use of syntax to
+keep track of the type (e.g. scalar or interval) of intermediate
+expressions.
+Note that a scalar can be automatically promoted to an interval if
+the context demands an interval value.
+This causes a large number of conflicts when the grammar is run through
+Yacc: 18 Shift/Reduce and 26 Reduce/Reduce.
+The problem can be seen by looking at the two input lines:
+.DS
+       2.5 + ( 3.5 \- 4. )
+.DE
+and
+.DS
+       2.5 + ( 3.5 , 4. )
+.DE
+Notice that the 2.5 is to be used in an interval valued expression
+in the second example, but this fact is not known until
+the ``,'' is read; by this time, 2.5 is finished, and the parser cannot go back
+and change its mind.
+More generally, it might be necessary to look ahead an arbitrary number of
+tokens to decide whether to convert a scalar to an interval.
+This problem is evaded by having two rules for each binary interval
+valued operator: one when the left operand is a scalar, and one when
+the left operand is an interval.
+In the second case, the right operand must be an interval,
+so the conversion will be applied automatically.
+Despite this evasion, there are still many cases where the
+conversion may be applied or not, leading to the above
+conflicts.
+They are resolved by listing the rules that yield scalars first
+in the specification file; in this way, the conflicts will
+be resolved in the direction of keeping scalar
+valued expressions scalar valued until they are forced to become
+intervals.
+.PP
+This way of handling multiple types is very instructive, but not very general.
+If there were many kinds of expression types, instead of just two,
+the number of rules needed would increase dramatically, and the conflicts
+even more dramatically.
+Thus, while this example is instructive, it is better practice in a
+more normal programming language environment to
+keep the type information as part of the value, and not as part
+of the grammar.
+.PP
+Finally, a word about the lexical analysis.
+The only unusual feature is the treatment of floating point constants.
+The C library routine
+.I atof
+is used to do the actual conversion from a character string
+to a double precision value.
+If the lexical analyzer detects an error,
+it responds by returning a token that
+is illegal in the grammar, provoking a syntax error
+in the parser, and thence error recovery.
+.DS L
+
+%{
+
+#  include  <stdio.h>
+#  include  <ctype.h>
+
+typedef  struct  interval  {
+       double  lo,  hi;
+       }  INTERVAL;
+
+INTERVAL  vmul(),  vdiv();
+
+double  atof();
+
+double  dreg[ 26 ];
+INTERVAL  vreg[ 26 ];
+
+%}
+
+%start    lines
+
+%union    {
+       int  ival;
+       double  dval;
+       INTERVAL  vval;
+       }
+
+%token  <ival>  DREG  VREG     /*  indices  into  dreg,  vreg  arrays  */
+
+%token  <dval>  CONST          /*  floating  point  constant  */
+
+%type  <dval>  dexp            /*  expression  */
+
+%type  <vval>  vexp            /*  interval  expression  */
+
+       /*  precedence  information  about  the  operators  */
+
+%left  \'+\'  \'\-\'
+%left  \'*\'  \'/\'
+%left  UMINUS        /*  precedence  for  unary  minus  */
+
+%%
+
+lines  :       /*  empty  */
+       |       lines  line
+       ;
+
+line   :       dexp  \'\en\'
+                       {       printf(  "%15.8f\en",  $1  );  }
+       |       vexp  \'\en\'
+                       {       printf(  "(%15.8f  ,  %15.8f  )\en",  $1.lo,  $1.hi  );  }
+       |       DREG  \'=\'  dexp  \'\en\'
+                       {       dreg[$1]  =  $3;  }
+       |       VREG  \'=\'  vexp  \'\en\'
+                       {       vreg[$1]  =  $3;  }
+       |       error  \'\en\'
+                       {       yyerrok;  }
+       ;
+
+dexp   :       CONST
+       |       DREG
+                       {       $$  =  dreg[$1];  }
+       |       dexp  \'+\'  dexp
+                       {       $$  =  $1  +  $3;  }
+       |       dexp  \'\-\'  dexp
+                       {       $$  =  $1  \-  $3;  }
+       |       dexp  \'*\'  dexp
+                       {       $$  =  $1  *  $3;  }
+       |       dexp  \'/\'  dexp
+                       {       $$  =  $1  /  $3;  }
+       |       \'\-\'  dexp    %prec  UMINUS
+                       {       $$  =  \- $2;  }
+       |       \'(\'  dexp  \')\'
+                       {       $$  =  $2;  }
+       ;
+
+vexp   :       dexp
+                       {       $$.hi  =  $$.lo  =  $1;  }
+       |       \'(\'  dexp  \',\'  dexp  \')\'
+                       {
+                       $$.lo  =  $2;
+                       $$.hi  =  $4;
+                       if(  $$.lo  >  $$.hi  ){
+                               printf(  "interval  out  of  order\en"  );
+                               YYERROR;
+                               }
+                       }
+       |       VREG
+                       {       $$  =  vreg[$1];    }
+       |       vexp  \'+\'  vexp
+                       {       $$.hi  =  $1.hi  +  $3.hi;
+                               $$.lo  =  $1.lo  +  $3.lo;    }
+       |       dexp  \'+\'  vexp
+                       {       $$.hi  =  $1  +  $3.hi;
+                               $$.lo  =  $1  +  $3.lo;    }
+       |       vexp  \'\-\'  vexp
+                       {       $$.hi  =  $1.hi  \-  $3.lo;
+                               $$.lo  =  $1.lo  \-  $3.hi;    }
+       |       dexp  \'\-\'  vexp
+                       {       $$.hi  =  $1  \-  $3.lo;
+                               $$.lo  =  $1  \-  $3.hi;    }
+       |       vexp  \'*\'  vexp
+                       {       $$  =  vmul(  $1.lo,  $1.hi,  $3  );  }
+       |       dexp  \'*\'  vexp
+                       {       $$  =  vmul(  $1,  $1,  $3  );  }
+       |       vexp  \'/\'  vexp
+                       {       if(  dcheck(  $3  )  )  YYERROR;
+                               $$  =  vdiv(  $1.lo,  $1.hi,  $3  );  }
+       |       dexp  \'/\'  vexp
+                       {       if(  dcheck(  $3  )  )  YYERROR;
+                               $$  =  vdiv(  $1,  $1,  $3  );  }
+       |       \'\-\'  vexp    %prec  UMINUS
+                       {       $$.hi  =  \-$2.lo;    $$.lo  =  \-$2.hi;    }
+       |       \'(\'  vexp  \')\'
+                       {       $$  =  $2;  }
+       ;
+
+%%
+
+#  define  BSZ  50        /*  buffer  size  for  floating  point  numbers  */
+
+       /*  lexical  analysis  */
+
+yylex(){
+       register  c;
+
+       while(  (c=getchar())  ==  \' \'  ){  /*  skip  over  blanks  */  }
+
+       if(  isupper(  c  )  ){
+               yylval.ival  =  c  \-  \'A\';
+               return(  VREG  );
+               }
+       if(  islower(  c  )  ){
+               yylval.ival  =  c  \-  \'a\';
+               return(  DREG  );
+               }
+
+       if(  isdigit(  c  )  ||  c==\'.\'  ){
+               /*  gobble  up  digits,  points,  exponents  */
+
+               char  buf[BSZ+1],  *cp  =  buf;
+               int  dot  =  0,  exp  =  0;
+
+               for(  ;  (cp\-buf)<BSZ  ;  ++cp,c=getchar()  ){
+
+                       *cp  =  c;
+                       if(  isdigit(  c  )  )  continue;
+                       if(  c  ==  \'.\'  ){
+                               if(  dot++  ||  exp  )  return(  \'.\'  );    /*  will  cause  syntax  error  */
+                               continue;
+                               }
+
+                       if(  c  ==  \'e\'  ){
+                               if(  exp++  )  return(  \'e\'  );    /*  will  cause  syntax  error  */
+                               continue;
+                               }
+
+                       /*  end  of  number  */
+                       break;
+                       }
+               *cp  =  \'\e0\';
+               if(  (cp\-buf)  >=  BSZ  )  printf(  "constant  too  long:  truncated\en"  );
+               else  ungetc(  c,  stdin  );    /*  push  back  last  char  read  */
+               yylval.dval  =  atof(  buf  );
+               return(  CONST  );
+               }
+       return(  c  );
+       }
+
+INTERVAL  hilo(  a,  b,  c,  d  )  double  a,  b,  c,  d;  {
+       /*  returns  the  smallest  interval  containing  a,  b,  c,  and  d  */
+       /*  used  by  *,  /  routines  */
+       INTERVAL  v;
+
+       if(  a>b  )  {  v.hi  =  a;    v.lo  =  b;  }
+       else  {  v.hi  =  b;    v.lo  =  a;  }
+
+       if(  c>d  )  {
+               if(  c>v.hi  )  v.hi  =  c;
+               if(  d<v.lo  )  v.lo  =  d;
+               }
+       else  {
+               if(  d>v.hi  )  v.hi  =  d;
+               if(  c<v.lo  )  v.lo  =  c;
+               }
+       return(  v  );
+       }
+
+INTERVAL  vmul(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
+       return(  hilo(  a*v.hi,  a*v.lo,  b*v.hi,  b*v.lo  )  );
+       }
+
+dcheck(  v  )  INTERVAL  v;  {
+       if(  v.hi  >=  0.  &&  v.lo  <=  0.  ){
+               printf(  "divisor  interval  contains  0.\en"  );
+               return(  1  );
+               }
+       return(  0  );
+       }
+
+INTERVAL  vdiv(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
+       return(  hilo(  a/v.hi,  a/v.lo,  b/v.hi,  b/v.lo  )  );
+       }
+.DE
+.bp
diff --git a/usr/doc/yacc/ssd b/usr/doc/yacc/ssd

new file mode 100644 (file)

index 0000000..ef51f13
--- /dev/null
+++ b/usr/doc/yacc/ssd
@@ -0,0 +1,38 @@
+.SH
+Appendix D: Old Features Supported but not Encouraged
+.PP
+This Appendix mentions synonyms and features which are supported for historical
+continuity, but, for various reasons, are not encouraged.
+.IP 1.
+Literals may also be delimited by double quotes ``"''.
+.IP 2.
+Literals may be more than one character long.
+If all the characters are alphabetic, numeric, or \_, the type number of the literal is defined,
+just as if the literal did not have the quotes around it.
+Otherwise, it is difficult to find the value for such literals.
+.IP
+The use of multi-character literals is likely to mislead those unfamiliar with
+Yacc, since it suggests that Yacc is doing a job which must be actually done by the lexical analyzer.
+.IP 3.
+Most places where % is legal, backslash ``\e'' may be used.
+In particular, \e\e is the same as %%, \eleft the same as %left, etc.
+.IP 4.
+There are a number of other synonyms:
+.DS
+%< is the same as %left
+%> is the same as %right
+%binary and %2 are the same as %nonassoc
+%0 and %term are the same as %token
+%= is the same as %prec
+.DE
+.IP 5.
+Actions may also have the form
+.DS
+={ . . . }
+.DE
+and the curly braces can be dropped if the action is a
+single C statement.
+.IP 6.
+C code between %{ and %} used to be permitted at the
+head of the rules section, as well as in the
+declaration section.
author	Stephen C. Johnson <scj@research.uucp>
	Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)
committer	Stephen C. Johnson <scj@research.uucp>
	Wed, 10 Jan 1979 20:23:00 +0000 (15:23 -0500)
usr/doc/lint	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss..	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss0	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss1	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss2	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss3	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss4	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss5	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss6	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss7	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss8	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ss9	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssA	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssB	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssa	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssb	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssc	[new file with mode: 0644]	patch \| blob
usr/doc/yacc/ssd	[new file with mode: 0644]	patch \| blob