git.subgeniuskitty.com - unix-history/.git/blame_incremental

... / ...

Commit	Line	Data
	1	.hc ~
	2	.bd I 2
	3	.de TS
	4	.br
	5	.nf
	6	.SP 1v
	7	.ul 0
	8	..
	9	.de TE
	10	.SP 1v
	11	.fi
	12	..
	13	.de PT
	14	.if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
	15	.if \\n%>1 'sp
	16	..
	17	.ND July 21, 1975
	18	.RP
	19	.TM 75-1274-15 39199 39199-11
	20	.TL
	21	Lex \- A Lexical Analyzer ~Generator~
	22	.AU ``MH 2C-569'' 6377
	23	M. E. Lesk and E. Schmidt
	24	.AI
	25	.MH
	26	.AB
	27	.sp
	28	.bd I 2
	29	.nr PS 8
	30	.nr VS 9
	31	.ps 8
	32	.vs 9p
	33	Lex helps write programs whose control flow
	34	is directed by instances of regular
	35	expressions in the input stream.
	36	It is well suited for editor-script type transformations and
	37	for segmenting input in preparation for
	38	a parsing routine.
	39	.PP
	40	Lex source is a table of regular expressions and corresponding program fragments.
	41	The table is translated to a program
	42	which reads an input stream, copying it to an output stream
	43	and partitioning the input
	44	into strings which match the given expressions.
	45	As each such string is recognized the corresponding
	46	program fragment is executed.
	47	The recognition of the expressions
	48	is performed by a deterministic finite automaton
	49	generated by Lex.
	50	The program fragments written by the user are executed in the order in which the
	51	corresponding regular expressions occur in the input stream.
	52	.if n .if \n(tm .ig
	53	.PP
	54	The lexical analysis
	55	programs written with Lex accept ambiguous specifications
	56	and choose the longest
	57	match possible at each input point.
	58	If necessary, substantial look~ahead
	59	is performed on the input, but the
	60	input stream will be backed up to the
	61	end of the current partition, so that the user
	62	has general freedom to manipulate it.
	63	.PP
	64	Lex can generate analyzers in either C or Ratfor, a language
	65	which can be translated automatically to portable Fortran.
	66	It is available on the PDP-11 UNIX, Honeywell GCOS,
	67	and IBM OS systems.
	68	This manual, however, will only discuss generating analyzers
	69	in C on the UNIX system, which is the only supported
	70	form of Lex under UNIX Version 7.
	71	Lex is designed to simplify
	72	interfacing with Yacc, for those
	73	with access to this compiler-compiler system.
	74	..
	75	.nr PS 9
	76	.nr VS 11
	77	.AE
	78	.SH
	79	.ce 1
	80	Table of Contents
	81	.LP
	82	.ce 100
	83	.TS
	84	r 1l 2r .
	85	1. Introduction. 1
	86	2. Lex Source. 3
	87	3. Lex Regular Expressions. 3
	88	4. Lex Actions. 5
	89	5. Ambiguous Source Rules. 7
	90	6. Lex Source Definitions. 8
	91	7. Usage. 8
	92	8. Lex and Yacc. 9
	93	9. Examples. 10
	94	10. Left Context Sensitivity. 11
	95	11. Character Set. 12
	96	12. Summary of Source Format. 12
	97	13. Caveats and Bugs. 13
	98	14. Acknowledgments. 13
	99	15. References. 13
	100	.TE
	101	.ce 0
	102	.2C
	103	.NH
	104	Introduction.
	105	.PP
	106	Lex is a program generator designed for
	107	lexical processing of character input streams.
	108	It accepts a high-level, problem oriented specification
	109	for character string matching,
	110	and
	111	produces a program in a general purpose language which recognizes
	112	regular expressions.
	113	The regular expressions are specified by the user in the
	114	source specifications given to Lex.
	115	The Lex written code recognizes these expressions
	116	in an input stream and partitions the input stream into
	117	strings matching the expressions. At the bound~aries
	118	between strings
	119	program sections
	120	provided by the user are executed.
	121	The Lex source file associates the regular expressions and the
	122	program fragments.
	123	As each expression appears in the input to the program written by Lex,
	124	the corresponding fragment is executed.
	125	.PP
	126	.de MH
	127	Bell Laboratories, Murray Hill, NJ 07974.
	128	..
	129	The user supplies the additional code
	130	beyond expression matching
	131	needed to complete his tasks, possibly
	132	including code written by other generators.
	133	The program that recognizes the expressions is generated in the
	134	general purpose programming language employed for the
	135	user's program fragments.
	136	Thus, a high level expression
	137	language is provided to write the string expressions to be
	138	matched while the user's freedom to write actions
	139	is unimpaired.
	140	This avoids forcing the user who wishes to use a string manipulation
	141	language for input analysis to write processing programs in the same
	142	and often inappropriate string handling language.
	143	.PP
	144	Lex is not a complete language, but rather a generator representing
	145	a new language feature which can be added to
	146	different programming languages, called ``host languages.''
	147	Just as general purpose languages
	148	can produce code to run on different computer hardware,
	149	Lex can write code in different host languages.
	150	The host language is used for the output code generated by Lex
	151	and also for the program fragments added by the user.
	152	Compatible run-time libraries for the different host languages
	153	are also provided.
	154	This makes Lex adaptable to different environments and
	155	different users.
	156	Each application
	157	may be directed to the combination of hardware and host language appropriate
	158	to the task, the user's background, and the properties of local
	159	implementations.
	160	At present, the only supported host language is C,
	161	although Fortran (in the form of Ratfor [2] has been available
	162	in the past.
	163	Lex itself exists on UNIX, GCOS, and OS/370; but the
	164	code generated by Lex may be taken anywhere the appropriate
	165	compilers exist.
	166	.PP
	167	Lex turns the user's expressions and actions
	168	(called
	169	.ul
	170	source
	171	in this memo) into the host general-purpose language;
	172	the generated program is named
	173	.ul
	174	yylex.
	175	The
	176	.ul
	177	yylex
	178	program
	179	will recognize expressions
	180	in a stream
	181	(called
	182	.ul
	183	input
	184	in this memo)
	185	and perform the specified actions for each expression as it is detected.
	186	See Figure 1.
	187	.GS
	188	.TS
	189	center;
	190	l _ r
	191	l\|c\|r
	192	l _ r
	193	l _ r
	194	l\|c\|r
	195	l _ r
	196	c s s
	197	c s s.
	198
	199	Source \(-> Lex \(-> yylex
	200
	201	.sp 2
	202
	203	Input \(-> yylex \(-> Output
	204
	205	.sp
	206	An overview of Lex
	207	.sp
	208	Figure 1
	209	.TE
	210	.GE
	211	.PP
	212	For a trivial example, consider a program to delete
	213	from the input
	214	all blanks or tabs at the ends of lines.
	215	.TS
	216	center;
	217	l l.
	218	%%
	219	[ \et]+$ ;
	220	.TE
	221	is all that is required.
	222	The program
	223	contains a %% delimiter to mark the beginning of the rules, and
	224	one rule.
	225	This rule contains a regular expression
	226	which matches one or more
	227	instances of the characters blank or tab
	228	(written \et for visibility, in accordance with the C language convention)
	229	just prior to the end of a line.
	230	The brackets indicate the character
	231	class made of blank and tab; the + indicates ``one or more ...'';
	232	and the $ indicates ``end of line,'' as in QED.
	233	No action is specified,
	234	so the program generated by Lex (yylex) will ignore these characters.
	235	Everything else will be copied.
	236	To change any remaining
	237	string of blanks or tabs to a single blank,
	238	add another rule:
	239	.TS
	240	center;
	241	l l.
	242	%%
	243	[ \et]+$ ;
	244	[ \et]+ printf(" ");
	245	.TE
	246	The finite automaton generated for this
	247	source will scan for both rules at once,
	248	observing at
	249	the termination of the string of blanks or tabs
	250	whether or not there is a newline character, and executing
	251	the desired rule action.
	252	The first rule matches all strings of blanks or tabs
	253	at the end of lines, and the second
	254	rule all remaining strings of blanks or tabs.
	255	.PP
	256	Lex can be used alone for simple transformations, or
	257	for analysis and statistics gathering on a lexical level.
	258	Lex can also be used with a parser generator
	259	to perform the lexical analysis phase; it is particularly
	260	easy to interface Lex and Yacc [3].
	261	Lex programs recognize only regular expressions;
	262	Yacc writes parsers that accept a large class of context free grammars,
	263	but require a lower level analyzer to recognize input tokens.
	264	Thus, a combination of Lex and Yacc is often appropriate.
	265	When used as a preprocessor for a later parser generator,
	266	Lex is used to partition the input stream,
	267	and the parser generator assigns structure to
	268	the resulting pieces.
	269	The flow of control
	270	in such a case (which might be the first half of a compiler,
	271	for example) is shown in Figure 2.
	272	Additional programs,
	273	written by other generators
	274	or by hand, can
	275	be added easily to programs written by Lex.
	276	.BS 2
	277	.TS
	278	center;
	279	l c c c l
	280	l c c c l
	281	l c c c l
	282	l _ c _ l
	283	l\|c\|c\|c\|l
	284	l _ c _ l
	285	l c c c l
	286	l _ c _ l
	287	l\|c\|c\|c\|l
	288	l _ c _ l
	289	l c s s l
	290	l c s s l.
	291	lexical grammar
	292	rules rules
	293	\(da \(da
	294
	295	Lex Yacc
	296
	297	\(da \(da
	298
	299	Input \(-> yylex \(-> yyparse \(-> Parsed input
	300
	301	.sp
	302	Lex with Yacc
	303	.sp
	304	Figure 2
	305	.TE
	306	.BE
	307	Yacc users
	308	will realize that the name
	309	.ul
	310	yylex
	311	is what Yacc expects its lexical analyzer to be named,
	312	so that the use of this name by Lex simplifies
	313	interfacing.
	314	.PP
	315	Lex generates a deterministic finite automaton from the regular expressions
	316	in the source [4].
	317	The automaton is interpreted, rather than compiled, in order
	318	to save space.
	319	The result is still a fast analyzer.
	320	In particular, the time taken by a Lex program
	321	to recognize and partition an input stream is
	322	proportional to the length of the input.
	323	The number of Lex rules or
	324	the complexity of the rules is
	325	not important in determining speed,
	326	unless rules which include
	327	forward context require a significant amount of re~scanning.
	328	What does increase with the number and complexity of rules
	329	is the size of the finite
	330	automaton, and therefore the size of the program
	331	generated by Lex.
	332	.PP
	333	In the program written by Lex, the user's fragments
	334	(representing the
	335	.ul
	336	actions
	337	to be performed as each regular expression
	338	is found)
	339	are gathered
	340	as cases of a switch.
	341	The automaton interpreter directs the control flow.
	342	Opportunity is provided for the user to insert either
	343	declarations or additional statements in the routine containing
	344	the actions, or to
	345	add subroutines outside this action routine.
	346	.PP
	347	Lex is not limited to source which can
	348	be interpreted on the basis of one character
	349	look~ahead.
	350	For example,
	351	if there are two rules, one looking for
	352	.I ab
	353	and another for
	354	.I abcdefg ,
	355	and the input stream is
	356	.I abcdefh ,
	357	Lex will recognize
	358	.I ab
	359	and leave
	360	the input pointer just before
	361	.I "cd. . ."
	362	Such backup is more costly
	363	than the processing of simpler languages.
	364	.2C
	365	.NH
	366	Lex Source.
	367	.PP
	368	The general format of Lex source is:
	369	.TS
	370	center;
	371	l.
	372	{definitions}
	373	%%
	374	{rules}
	375	%%
	376	{user subroutines}
	377	.TE
	378	where the definitions and the user subroutines
	379	are often omitted.
	380	The second
	381	.I %%
	382	is optional, but the first is required
	383	to mark the beginning of the rules.
	384	The absolute minimum Lex program is thus
	385	.TS
	386	center;
	387	l.
	388	%%
	389	.TE
	390	(no definitions, no rules) which translates into a program
	391	which copies the input to the output unchanged.
	392	.PP
	393	In the outline of Lex programs shown above, the
	394	.I
	395	rules
	396	.R
	397	represent the user's control
	398	decisions; they are a table, in which the left column
	399	contains
	400	.I
	401	regular expressions
	402	.R
	403	(see section 3)
	404	and the right column contains
	405	.I
	406	actions,
	407	.R
	408	program fragments to be executed when the expressions
	409	are recognized.
	410	Thus an individual rule might appear
	411	.TS
	412	center;
	413	l l.
	414	integer printf("found keyword INT");
	415	.TE
	416	to look for the string
	417	.I integer
	418	in the input stream and
	419	print the message ``found keyword INT'' whenever it appears.
	420	In this example the host procedural language is C and
	421	the C library function
	422	.I
	423	printf
	424	.R
	425	is used to print the string.
	426	The end
	427	of the expression is indicated by the first blank or tab character.
	428	If the action is merely a single C expression,
	429	it can just be given on the right side of the line; if it is
	430	compound, or takes more than a line, it should be enclosed in
	431	braces.
	432	As a slightly more useful example, suppose it is desired to
	433	change a number of words from British to American spelling.
	434	Lex rules such as
	435	.TS
	436	center;
	437	l l.
	438	colour printf("color");
	439	mechanise printf("mechanize");
	440	petrol printf("gas");
	441	.TE
	442	would be a start. These rules are not quite enough,
	443	since
	444	the word
	445	.I petroleum
	446	would become
	447	.I gaseum ;
	448	a way of dealing
	449	with this will be described later.
	450	.2C
	451	.NH
	452	Lex Regular Expressions.
	453	.PP
	454	The definitions of regular expressions are very similar to those
	455	in QED [5].
	456	A regular
	457	expression specifies a set of strings to be matched.
	458	It contains text characters (which match the corresponding
	459	characters in the strings being compared)
	460	and operator characters (which specify
	461	repetitions, choices, and other features).
	462	The letters of the alphabet and the digits are
	463	always text characters; thus the regular expression
	464	.TS
	465	center;
	466	l l.
	467	integer
	468	.TE
	469	matches the string
	470	.ul
	471	integer
	472	wherever it appears
	473	and the expression
	474	.TS
	475	center;
	476	l.
	477	a57D
	478	.TE
	479	looks for the string
	480	.ul
	481	a57D.
	482	.PP
	483	.I
	484	Operators.
	485	.R
	486	The operator characters are
	487	.TS
	488	center;
	489	l.
	490	" \e [ ] ^ \- ? . \(** + \| ( ) $ / { } % < >
	491	.TE
	492	and if they are to be used as text characters, an escape
	493	should be used.
	494	The quotation mark operator (")
	495	indicates that whatever is contained between a pair of quotes
	496	is to be taken as text characters.
	497	Thus
	498	.TS
	499	center;
	500	l.
	501	xyz"++"
	502	.TE
	503	matches the string
	504	.I xyz++
	505	when it appears. Note that a part of a string may be quoted.
	506	It is harmless but unnecessary to quote an ordinary
	507	text character; the expression
	508	.TS
	509	center;
	510	l.
	511	"xyz++"
	512	.TE
	513	is the same as the one above.
	514	Thus by quoting every non-alphanumeric character
	515	being used as a text character, the user can avoid remembering
	516	the list above of current
	517	operator characters, and is safe should further extensions to Lex
	518	lengthen the list.
	519	.PP
	520	An operator character may also be turned into a text character
	521	by preceding it with \e as in
	522	.TS
	523	center;
	524	l.
	525	xyz\e+\e+
	526	.TE
	527	which
	528	is another, less readable, equivalent of the above expressions.
	529	Another use of the quoting mechanism is to get a blank into
	530	an expression; normally, as explained above, blanks or tabs end
	531	a rule.
	532	Any blank character not contained within [\\|] (see below) must
	533	be quoted.
	534	Several normal C escapes with \e
	535	are recognized: \en is newline, \et is tab, and \eb is backspace.
	536	To enter \e itself, use \e\e.
	537	Since newline is illegal in an expression, \en must be used;
	538	it is not
	539	required to escape tab and backspace.
	540	Every character but blank, tab, newline and the list above is always
	541	a text character.
	542	.PP
	543	.I
	544	Character classes.
	545	.R
	546	Classes of characters can be specified using the operator pair [\\|].
	547	The construction
	548	.I [abc]
	549	matches a
	550	single character, which may be
	551	.I a ,
	552	.I b ,
	553	or
	554	.I c .
	555	Within square brackets,
	556	most operator meanings are ignored.
	557	Only three characters are special:
	558	these are \e \(mi and ^. The \(mi character
	559	indicates ranges. For example,
	560	.TS
	561	center;
	562	l.
	563	[a\(miz0\(mi9<>_]
	564	.TE
	565	indicates the character class containing all the lower case letters,
	566	the digits,
	567	the angle brackets, and underline.
	568	Ranges may be given in either order.
	569	Using \(mi between any pair of characters which are
	570	not both upper case letters, both lower case letters, or both digits
	571	is implementation dependent and will get a warning message.
	572	(E.g., [0\-z] in ASCII is many more characters
	573	than it is in EBCDIC).
	574	If it is desired to include the
	575	character \(mi in a character class, it should be first or
	576	last; thus
	577	.TS
	578	center;
	579	l.
	580	[\(mi+0\(mi9]
	581	.TE
	582	matches all the digits and the two signs.
	583	.PP
	584	In character classes,
	585	the ^ operator must appear as the first character
	586	after the left bracket; it indicates that the resulting string
	587	is to be complemented with respect to the computer character set.
	588	Thus
	589	.TS
	590	center;
	591	l.
	592	[^abc]
	593	.TE
	594	matches all characters except a, b, or c, including
	595	all special or control characters; or
	596	.TS
	597	center;
	598	l.
	599	[^a\-zA\-Z]
	600	.TE
	601	is any character which is not a letter.
	602	The \e character provides the usual escapes within
	603	character class brackets.
	604	.PP
	605	.I
	606	Arbitrary character.
	607	.R
	608	To match almost any character, the operator character
	609	.TS
	610	center;
	611	l.
	612	\&.
	613	.TE
	614	is the class of all characters except newline.
	615	Escaping into octal is possible although non-portable:
	616	.TS
	617	center;
	618	l.
	619	[\e40\-\e176]
	620	.TE
	621	matches all printable characters in the ASCII character set, from octal
	622	40 (blank) to octal 176 (tilde).
	623	.PP
	624	.I
	625	Optional expressions.
	626	.R
	627	The operator
	628	.I ?
	629	indicates
	630	an optional element of an expression.
	631	Thus
	632	.TS
	633	center;
	634	l.
	635	ab?c
	636	.TE
	637	matches either
	638	.I ac
	639	or
	640	.I abc .
	641	.PP
	642	.I
	643	Repeated expressions.
	644	.R
	645	Repetitions of classes are indicated by the operators
	646	.I \(**
	647	and
	648	.I + .
	649	.TS
	650	center;
	651	l.
	652	\f2a\(**\f1
	653	.TE
	654	is any number of consecutive
	655	.I a
	656	characters, including zero; while
	657	.TS
	658	center;
	659	l.
	660	a+
	661	.TE
	662	is one or more instances of
	663	.I a.
	664	For example,
	665	.TS
	666	center;
	667	l.
	668	[a\-z]+
	669	.TE
	670	is all strings of lower case letters.
	671	And
	672	.TS
	673	center;
	674	l.
	675	[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
	676	.TE
	677	indicates all alphanumeric strings with a leading
	678	alphabetic character.
	679	This is a typical expression for recognizing identifiers in
	680	computer languages.
	681	.PP
	682	.I
	683	Alternation and Grouping.
	684	.R
	685	The operator \|
	686	indicates alternation:
	687	.TS
	688	center;
	689	l.
	690	(ab\\|\|\\|cd)
	691	.TE
	692	matches either
	693	.ul
	694	ab
	695	or
	696	.ul
	697	cd.
	698	Note that parentheses are used for grouping, although
	699	they are
	700	not necessary on the outside level;
	701	.TS
	702	center;
	703	l.
	704	ab\\|\|\\|cd
	705	.TE
	706	would have sufficed.
	707	Parentheses
	708	can be used for more complex expressions:
	709	.TS
	710	center;
	711	l.
	712	(ab\\|\|\\|cd+)?(ef)\(**
	713	.TE
	714	matches such strings as
	715	.I abefef ,
	716	.I efefef ,
	717	.I cdef ,
	718	or
	719	.I cddd\\| ;
	720	but not
	721	.I abc ,
	722	.I abcd ,
	723	or
	724	.I abcdef .
	725	.PP
	726	.I
	727	Context sensitivity.
	728	.R
	729	Lex will recognize a small amount of surrounding
	730	context. The two simplest operators for this are
	731	.I ^
	732	and
	733	.I $ .
	734	If the first character of an expression is
	735	.I ^ ,
	736	the expression will only be matched at the beginning
	737	of a line (after a newline character, or at the beginning of
	738	the input stream).
	739	This can never conflict with the other meaning of
	740	.I ^ ,
	741	complementation
	742	of character classes, since that only applies within
	743	the [\\|] operators.
	744	If the very last character is
	745	.I $ ,
	746	the expression will only be matched at the end of a line (when
	747	immediately followed by newline).
	748	The latter operator is a special case of the
	749	.I /
	750	operator character,
	751	which indicates trailing context.
	752	The expression
	753	.TS
	754	center;
	755	l.
	756	ab/cd
	757	.TE
	758	matches the string
	759	.I ab ,
	760	but only if followed by
	761	.ul
	762	cd.
	763	Thus
	764	.TS
	765	center;
	766	l.
	767	ab$
	768	.TE
	769	is the same as
	770	.TS
	771	center;
	772	l.
	773	ab/\en
	774	.TE
	775	Left context is handled in Lex by
	776	.I
	777	start conditions
	778	.R
	779	as explained in section 10. If a rule is only to be executed
	780	when the Lex automaton interpreter is in start condition
	781	.I
	782	x,
	783	.R
	784	the rule should be prefixed by
	785	.TS
	786	center;
	787	l.
	788	<x>
	789	.TE
	790	using the angle bracket operator characters.
	791	If we considered ``being at the beginning of a line'' to be
	792	start condition
	793	.I ONE ,
	794	then the ^ operator
	795	would be equivalent to
	796	.TS
	797	center;
	798	l.
	799	<ONE>
	800	.TE
	801	Start conditions are explained more fully later.
	802	.PP
	803	.I
	804	Repetitions and Definitions.
	805	.R
	806	The operators {} specify
	807	either repetitions (if they enclose numbers)
	808	or
	809	definition expansion (if they enclose a name). For example
	810	.TS
	811	center;
	812	l.
	813	{digit}
	814	.TE
	815	looks for a predefined string named
	816	.I digit
	817	and inserts it
	818	at that point in the expression.
	819	The definitions are given in the first part of the Lex
	820	input, before the rules.
	821	In contrast,
	822	.TS
	823	center;
	824	l.
	825	a{1,5}
	826	.TE
	827	looks for 1 to 5 occurrences of
	828	.I a .
	829	.PP
	830	Finally, initial
	831	.I %
	832	is special, being the separator
	833	for Lex source segments.
	834	.2C
	835	.NH
	836	Lex Actions.
	837	.PP
	838	When an expression written as above is matched, Lex
	839	executes the corresponding action. This section describes
	840	some features of Lex which aid in writing actions. Note
	841	that there is a default action, which
	842	consists of copying the input to the output. This
	843	is performed on all strings not otherwise matched. Thus
	844	the Lex user who wishes to absorb the entire input, without
	845	producing any output, must provide rules to match everything.
	846	When Lex is being used with Yacc, this is the normal
	847	situation.
	848	One may consider that actions are what is done instead of
	849	copying the input to the output; thus, in general,
	850	a rule which merely copies can be omitted.
	851	Also, a character combination
	852	which is omitted from the rules
	853	and which appears as input
	854	is likely to be printed on the output, thus calling
	855	attention to the gap in the rules.
	856	.PP
	857	One of the simplest things that can be done is to ignore
	858	the input. Specifying a C null statement, \fI;\fR as an action
	859	causes this result. A frequent rule is
	860	.TS
	861	center;
	862	l l.
	863	[ \et\en] ;
	864	.TE
	865	which causes the three spacing characters (blank, tab, and newline)
	866	to be ignored.
	867	.PP
	868	Another easy way to avoid writing actions is the action character
	869	\|, which indicates that the action for this rule is the action
	870	for the next rule.
	871	The previous example could also have been written
	872	.TS
	873	center;
	874	l l.
	875	" " \|
	876	"\et" \|
	877	"\en" ;
	878	.TE
	879	with the same result, although in different style.
	880	The quotes around \en and \et are not required.
	881	.PP
	882	In more complex actions, the user
	883	will
	884	often want to know the actual text that matched some expression
	885	like
	886	.I [a\(miz]+ .
	887	Lex leaves this text in an external character
	888	array named
	889	.I
	890	yytext.
	891	.R
	892	Thus, to print the name found,
	893	a rule like
	894	.TS
	895	center;
	896	l l.
	897	[a\-z]+ printf("%s", yytext);
	898	.TE
	899	will print
	900	the string in
	901	.I
	902	yytext.
	903	.R
	904	The C function
	905	.I
	906	printf
	907	.R
	908	accepts a format argument and data to be printed;
	909	in this case, the format is ``print string'' (% indicating
	910	data conversion, and
	911	.I s
	912	indicating string type),
	913	and the data are the characters
	914	in
	915	.I
	916	yytext.
	917	.R
	918	So this just places
	919	the matched string
	920	on the output.
	921	This action
	922	is so common that
	923	it may be written as ECHO:
	924	.TS
	925	center;
	926	l l.
	927	[a\-z]+ ECHO;
	928	.TE
	929	is the same as the above.
	930	Since the default action is just to
	931	print the characters found, one might ask why
	932	give a rule, like this one, which merely specifies
	933	the default action?
	934	Such rules are often required
	935	to avoid matching some other rule
	936	which is not desired. For example, if there is a rule
	937	which matches
	938	.I read
	939	it will normally match the instances of
	940	.I read
	941	contained in
	942	.I bread
	943	or
	944	.I readjust ;
	945	to avoid
	946	this,
	947	a rule
	948	of the form
	949	.I [a\(miz]+
	950	is needed.
	951	This is explained further below.
	952	.PP
	953	Sometimes it is more convenient to know the end of what
	954	has been found; hence Lex also provides a count
	955	.I
	956	yyleng
	957	.R
	958	of the number of characters matched.
	959	To count both the number
	960	of words and the number of characters in words in the input, the user might write
	961	.TS
	962	center;
	963	l l.
	964	[a\-zA\-Z]+ {words++; chars += yyleng;}
	965	.TE
	966	which accumulates in
	967	.ul
	968	chars
	969	the number
	970	of characters in the words recognized.
	971	The last character in the string matched can
	972	be accessed by
	973	.TS
	974	center;
	975	l.
	976	yytext[yyleng\-1]
	977	.TE
	978	.PP
	979	Occasionally, a Lex
	980	action may decide that a rule has not recognized the correct
	981	span of characters.
	982	Two routines are provided to aid with this situation.
	983	First,
	984	.I
	985	yymore()
	986	.R
	987	can be called to indicate that the next input expression recognized is to be
	988	tacked on to the end of this input. Normally,
	989	the next input string would overwrite the current
	990	entry in
	991	.I
	992	yytext.
	993	.R
	994	Second,
	995	.I
	996	yyless (n)
	997	.R
	998	may be called to indicate that not all the characters matched
	999	by the currently successful expression are wanted right now.
	1000	The argument
	1001	.I
	1002	n
	1003	.R
	1004	indicates the number of characters
	1005	in
	1006	.I
	1007	yytext
	1008	.R
	1009	to be retained.
	1010	Further characters previously matched
	1011	are
	1012	returned to the input. This provides the same sort of
	1013	look~ahead offered by the / operator,
	1014	but in a different form.
	1015	.PP
	1016	.I
	1017	Example:
	1018	.R
	1019	Consider a language which defines
	1020	a string as a set of characters between quotation (") marks, and provides that
	1021	to include a " in a string it must be preceded by a \e. The
	1022	regular expression which matches that is somewhat confusing,
	1023	so that it might be preferable to write
	1024	.TS
	1025	center;
	1026	l l.
	1027	\e"[^"]\(** {
	1028	if (yytext[yyleng\-1] == \(fm\e\e\(fm)
	1029	yymore();
	1030	else
	1031	... normal user processing
	1032	}
	1033	.TE
	1034	which will, when faced with a string such as
	1035	.I
	1036	"abc\e"def\\|"
	1037	.R
	1038	first match
	1039	the five characters
	1040	\fI"abc\e\\|\fR;
	1041	then
	1042	the call to
	1043	.I yymore()
	1044	will
	1045	cause the next part of the string,
	1046	\fI"def\\|\fR,
	1047	to be tacked on the end.
	1048	Note that the final quote terminating the string should be picked
	1049	up in the code labeled ``normal processing''.
	1050	.PP
	1051	The function
	1052	.I
	1053	yyless()
	1054	.R
	1055	might be used to reprocess
	1056	text in various circumstances. Consider the C problem of distinguishing
	1057	the ambiguity of ``=\(mia''.
	1058	Suppose it is desired to treat this as ``=\(mi a''
	1059	but print a message. A rule might be
	1060	.TS
	1061	center;
	1062	l l.
	1063	=\(mi[a\-zA\-Z] {
	1064	printf("Operator (=\(mi) ambiguous\en");
	1065	yyless(yyleng\-1);
	1066	... action for =\(mi ...
	1067	}
	1068	.TE
	1069	which prints a message, returns the letter after the
	1070	operator to the input stream, and treats the operator as ``=\(mi''.
	1071	Alternatively it might be desired to treat this as ``= \(mia''.
	1072	To do this, just return the minus
	1073	sign as well as the letter to the input:
	1074	.TS
	1075	center;
	1076	l l.
	1077	=\(mi[a\-zA\-Z] {
	1078	printf("Operator (=\(mi) ambiguous\en");
	1079	yyless(yyleng\-2);
	1080	... action for = ...
	1081	}
	1082	.TE
	1083	will perform the other interpretation.
	1084	Note that the expressions for the two cases might more easily
	1085	be written
	1086	.TS
	1087	center;
	1088	l l.
	1089	=\(mi/[A\-Za\-z]
	1090	.TE
	1091	in the first case and
	1092	.TS
	1093	center;
	1094	l.
	1095	=/\-[A\-Za\-z]
	1096	.TE
	1097	in the second;
	1098	no backup would be required in the rule action.
	1099	It is not necessary to recognize the whole identifier
	1100	to observe the ambiguity.
	1101	The
	1102	possibility of ``=\(mi3'', however, makes
	1103	.TS
	1104	center;
	1105	l.
	1106	=\(mi/[^ \et\en]
	1107	.TE
	1108	a still better rule.
	1109	.PP
	1110	In addition to these routines, Lex also permits
	1111	access to the I/O routines
	1112	it uses.
	1113	They are:
	1114	.IP 1)
	1115	.I
	1116	input()
	1117	.R
	1118	which returns the next input character;
	1119	.IP 2)
	1120	.I
	1121	output(c)
	1122	.R
	1123	which writes the character
	1124	.I
	1125	c
	1126	.R
	1127	on the output; and
	1128	.IP 3)
	1129	.I
	1130	unput(c)
	1131	.R
	1132	pushes the character
	1133	.I
	1134	c
	1135	.R
	1136	back onto the input stream to be read later by
	1137	.I
	1138	input().
	1139	.R
	1140	.LP
	1141	By default these routines are provided as macro definitions,
	1142	but the user can override them and supply private versions.
	1143	These routines
	1144	define the relationship between external files and
	1145	internal characters, and must all be retained
	1146	or modified consistently.
	1147	They may be redefined, to
	1148	cause input or output to be transmitted to or from strange
	1149	places, including other programs or internal memory;
	1150	but the character set used must be consistent in all routines;
	1151	a value of zero returned by
	1152	.I
	1153	input
	1154	.R
	1155	must mean end of file; and
	1156	the relationship between
	1157	.I
	1158	unput
	1159	.R
	1160	and
	1161	.I
	1162	input
	1163	.R
	1164	must be retained
	1165	or the Lex look~ahead will not work.
	1166	Lex does not look ahead at all if it does not have to,
	1167	but every rule ending in
	1168	.ft I
	1169	+ \(** ?
	1170	.ft R
	1171	or
	1172	.ft I
	1173	$
	1174	.ft R
	1175	or containing
	1176	.ft I
	1177	/
	1178	.ft R
	1179	implies look~ahead.
	1180	Look~ahead is also necessary to match an expression that is a prefix
	1181	of another expression.
	1182	See below for a discussion of the character set used by Lex.
	1183	The standard Lex library imposes
	1184	a 100 character limit on backup.
	1185	.PP
	1186	Another Lex library routine that the user will sometimes want
	1187	to redefine is
	1188	.I
	1189	yywrap()
	1190	.R
	1191	which is called whenever Lex reaches an end-of-file.
	1192	If
	1193	.I
	1194	yywrap
	1195	.R
	1196	returns a 1, Lex continues with the normal wrapup on end of input.
	1197	Sometimes, however, it is convenient to arrange for more
	1198	input to arrive
	1199	from a new source.
	1200	In this case, the user should provide
	1201	a
	1202	.I
	1203	yywrap
	1204	.R
	1205	which
	1206	arranges for new input and
	1207	returns 0. This instructs Lex to continue processing.
	1208	The default
	1209	.I
	1210	yywrap
	1211	.R
	1212	always returns 1.
	1213	.PP
	1214	This routine is also a convenient place
	1215	to print tables, summaries, etc. at the end
	1216	of a program. Note that it is not
	1217	possible to write a normal rule which recognizes
	1218	end-of-file; the only access to this condition is
	1219	through
	1220	.I
	1221	yywrap.
	1222	.R
	1223	In fact, unless a private version of
	1224	.I
	1225	input()
	1226	.R
	1227	is supplied
	1228	a file containing nulls
	1229	cannot be handled,
	1230	since a value of 0 returned by
	1231	.I
	1232	input
	1233	.R
	1234	is taken to be end-of-file.
	1235	.PP
	1236	.2C
	1237	.NH
	1238	Ambiguous Source Rules.
	1239	.PP
	1240	Lex can handle ambiguous specifications.
	1241	When more than one expression can match the
	1242	current input, Lex chooses as follows:
	1243	.IP 1)
	1244	The longest match is preferred.
	1245	.IP 2)
	1246	Among rules which matched the same number of characters,
	1247	the rule given first is preferred.
	1248	.LP
	1249	Thus, suppose the rules
	1250	.TS
	1251	center;
	1252	l l.
	1253	integer keyword action ...;
	1254	[a\-z]+ identifier action ...;
	1255	.TE
	1256	to be given in that order. If the input is
	1257	.I integers ,
	1258	it is taken as an identifier, because
	1259	.I [a\-z]+
	1260	matches 8 characters while
	1261	.I integer
	1262	matches only 7.
	1263	If the input is
	1264	.I integer ,
	1265	both rules match 7 characters, and
	1266	the keyword rule is selected because it was given first.
	1267	Anything shorter (e.g. \fIint\fR\\|) will
	1268	not match the expression
	1269	.I integer
	1270	and so the identifier interpretation is used.
	1271	.PP
	1272	The principle of preferring the longest
	1273	match makes rules containing
	1274	expressions like
	1275	.I \&.\(**
	1276	dangerous.
	1277	For example,
	1278	.TS
	1279	center;
	1280	l.
	1281	\&\(fm.\(**\(fm
	1282	.TE
	1283	might seem a good way of recognizing
	1284	a string in single quotes.
	1285	But it is an invitation for the program to read far
	1286	ahead, looking for a distant
	1287	single quote.
	1288	Presented with the input
	1289	.TS
	1290	center;
	1291	l l.
	1292	\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
	1293	.TE
	1294	the above expression will match
	1295	.TS
	1296	center;
	1297	l l.
	1298	\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
	1299	.TE
	1300	which is probably not what was wanted.
	1301	A better rule is of the form
	1302	.TS
	1303	center;
	1304	l.
	1305	\&\(fm[^\(fm\en]\(**\(fm
	1306	.TE
	1307	which, on the above input, will stop
	1308	after
	1309	.I \(fmfirst\(fm .
	1310	The consequences
	1311	of errors like this are mitigated by the fact
	1312	that the
	1313	.I \&.
	1314	operator will not match newline.
	1315	Thus expressions like
	1316	.I \&.\(**
	1317	stop on the
	1318	current line.
	1319	Don't try to defeat this with expressions like
	1320	.I [.\en]+
	1321	or
	1322	equivalents;
	1323	the Lex generated program will try to read
	1324	the entire input file, causing
	1325	internal buffer overflows.
	1326	.PP
	1327	Note that Lex is normally partitioning
	1328	the input stream, not searching for all possible matches
	1329	of each expression.
	1330	This means that each character is accounted for
	1331	once and only once.
	1332	For example, suppose it is desired to
	1333	count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
	1334	Some Lex rules to do this might be
	1335	.TS
	1336	center;
	1337	l l.
	1338	she s++;
	1339	he h++;
	1340	\en \|
	1341	\&. ;
	1342	.TE
	1343	where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
	1344	Remember that . does not include newline.
	1345	Since \fIshe\fR includes \fIhe\fR, Lex will normally
	1346	.I
	1347	not
	1348	.R
	1349	recognize
	1350	the instances of \fIhe\fR included in \fIshe\fR,
	1351	since once it has passed a \fIshe\fR those characters are gone.
	1352	.PP
	1353	Sometimes the user would like to override this choice. The action
	1354	REJECT
	1355	means ``go do the next alternative.''
	1356	It causes whatever rule was second choice after the current
	1357	rule to be executed.
	1358	The position of the input pointer is adjusted accordingly.
	1359	Suppose the user really wants to count the included instances of \fIhe\fR:
	1360	.TS
	1361	center;
	1362	l l.
	1363	she {s++; REJECT;}
	1364	he {h++; REJECT;}
	1365	\en \|
	1366	\&. ;
	1367	.TE
	1368	these rules are one way of changing the previous example
	1369	to do just that.
	1370	After counting each expression, it is rejected; whenever appropriate,
	1371	the other expression will then be counted. In this example, of course,
	1372	the user could note that \fIshe\fR includes \fIhe\fR but not
	1373	vice versa, and omit the REJECT action on \fIhe\fR;
	1374	in other cases, however, it
	1375	would not be possible a priori to tell
	1376	which input characters
	1377	were in both classes.
	1378	.PP
	1379	Consider the two rules
	1380	.TS
	1381	center;
	1382	l l.
	1383	a[bc]+ { ... ; REJECT;}
	1384	a[cd]+ { ... ; REJECT;}
	1385	.TE
	1386	If the input is
	1387	.I ab ,
	1388	only the first rule matches,
	1389	and on
	1390	.I ad
	1391	only the second matches.
	1392	The input string
	1393	.I accb
	1394	matches the first rule for four characters
	1395	and then the second rule for three characters.
	1396	In contrast, the input
	1397	.I accd
	1398	agrees with
	1399	the second rule for four characters and then the first
	1400	rule for three.
	1401	.PP
	1402	In general, REJECT is useful whenever
	1403	the purpose of Lex is not to partition the input
	1404	stream but to detect all examples of some items
	1405	in the input, and the instances of these items
	1406	may overlap or include each other.
	1407	Suppose a digram table of the input is desired;
	1408	normally the digrams overlap, that is the word
	1409	.I the
	1410	is considered to contain
	1411	both
	1412	.I th
	1413	and
	1414	.I he .
	1415	Assuming a two-dimensional array named
	1416	.ul
	1417	digram
	1418	to be incremented, the appropriate
	1419	source is
	1420	.TS
	1421	center;
	1422	l l.
	1423	%%
	1424	[a\-z][a\-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
	1425	. ;
	1426	\en ;
	1427	.TE
	1428	where the REJECT is necessary to pick up
	1429	a letter pair beginning at every character, rather than at every
	1430	other character.
	1431	.2C
	1432	.NH
	1433	Lex Source Definitions.
	1434	.PP
	1435	Remember the format of the Lex
	1436	source:
	1437	.TS
	1438	center;
	1439	l.
	1440	{definitions}
	1441	%%
	1442	{rules}
	1443	%%
	1444	{user routines}
	1445	.TE
	1446	So far only the rules have been described. The user needs
	1447	additional options,
	1448	though, to define variables for use in his program and for use
	1449	by Lex.
	1450	These can go either in the definitions section
	1451	or in the rules section.
	1452	.PP
	1453	Remember that Lex is turning the rules into a program.
	1454	Any source not intercepted by Lex is copied
	1455	into the generated program. There are three classes
	1456	of such things.
	1457	.IP 1)
	1458	Any line which is not part of a Lex rule or action
	1459	which begins with a blank or tab is copied into
	1460	the Lex generated program.
	1461	Such source input prior to the first %% delimiter will be external
	1462	to any function in the code; if it appears immediately after the first
	1463	%%,
	1464	it appears in an appropriate place for declarations
	1465	in the function written by Lex which contains the actions.
	1466	This material must look like program fragments,
	1467	and should precede the first Lex rule.
	1468	.IP
	1469	As a side effect of the above, lines which begin with a blank
	1470	or tab, and which contain a comment,
	1471	are passed through to the generated program.
	1472	This can be used to include comments in either the Lex source or
	1473	the generated code. The comments should follow the host
	1474	language convention.
	1475	.IP 2)
	1476	Anything included between lines containing
	1477	only
	1478	.I %{
	1479	and
	1480	.I %}
	1481	is
	1482	copied out as above. The delimiters are discarded.
	1483	This format permits entering text like preprocessor statements that
	1484	must begin in column 1,
	1485	or copying lines that do not look like programs.
	1486	.IP 3)
	1487	Anything after the third %% delimiter, regardless of formats, etc.,
	1488	is copied out after the Lex output.
	1489	.PP
	1490	Definitions intended for Lex are given
	1491	before the first %% delimiter. Any line in this section
	1492	not contained between %{ and %}, and begining
	1493	in column 1, is assumed to define Lex substitution strings.
	1494	The format of such lines is
	1495	.TS
	1496	center;
	1497	l l.
	1498	name translation
	1499	.TE
	1500	and it
	1501	causes the string given as a translation to
	1502	be associated with the name.
	1503	The name and translation
	1504	must be separated by at least one blank or tab, and the name must begin with a letter.
	1505	The translation can then be called out
	1506	by the {name} syntax in a rule.
	1507	Using {D} for the digits and {E} for an exponent field,
	1508	for example, might abbreviate rules to recognize numbers:
	1509	.TS
	1510	center;
	1511	l l.
	1512	D [0\-9]
	1513	E [DEde][\-+]?{D}+
	1514	%%
	1515	{D}+ printf("integer");
	1516	{D}+"."{D}\(**({E})? \|
	1517	{D}\(**"."{D}+({E})? \|
	1518	{D}+{E} printf("real");
	1519	.TE
	1520	Note the first two rules for real numbers;
	1521	both require a decimal point and contain
	1522	an optional exponent field,
	1523	but the first requires at least one digit before the
	1524	decimal point and the second requires at least one
	1525	digit after the decimal point.
	1526	To correctly handle the problem
	1527	posed by a Fortran expression such as
	1528	.I 35.EQ.I ,
	1529	which does not contain a real number, a context-sensitive
	1530	rule such as
	1531	.TS
	1532	center;
	1533	l l.
	1534	[0\-9]+/"."EQ printf("integer");
	1535	.TE
	1536	could be used in addition to the normal rule for integers.
	1537	.PP
	1538	The definitions
	1539	section may also contain other commands, including the
	1540	selection of a host language, a character set table,
	1541	a list of start conditions, or adjustments to the default
	1542	size of arrays within Lex itself for larger source programs.
	1543	These possibilities
	1544	are discussed below under ``Summary of Source Format,''
	1545	section 12.
	1546	.2C
	1547	.NH
	1548	Usage.
	1549	.PP
	1550	There are two steps in
	1551	compiling a Lex source program.
	1552	First, the Lex source must be turned into a generated program
	1553	in the host general purpose language.
	1554	Then this program must be compiled and loaded, usually with
	1555	a library of Lex subroutines.
	1556	The generated program
	1557	is on a file named
	1558	.I lex.yy.c .
	1559	The I/O library is defined in terms of the C standard
	1560	library [6].
	1561	.PP
	1562	The C programs generated by Lex are slightly different
	1563	on OS/370, because the
	1564	OS compiler is less powerful than the UNIX or GCOS compilers,
	1565	and does less at compile time.
	1566	C programs generated on GCOS and UNIX are the same.
	1567	.PP
	1568	.I
	1569	UNIX.
	1570	.R
	1571	The library is accessed by the loader flag
	1572	.I \-ll .
	1573	So an appropriate
	1574	set of commands is
	1575	.KS
	1576	.in 5
	1577	lex source
	1578	cc lex.yy.c \-ll
	1579	.in 0
	1580	.KE
	1581	The resulting program is placed on the usual file
	1582	.I
	1583	a.out
	1584	.R
	1585	for later execution.
	1586	To use Lex with Yacc see below.
	1587	Although the default Lex I/O routines use the C standard library,
	1588	the Lex automata themselves do not do so;
	1589	if private versions of
	1590	.I
	1591	input,
	1592	output
	1593	.R
	1594	and
	1595	.I unput
	1596	are given, the library can be avoided.
	1597	.PP
	1598	.2C
	1599	.NH
	1600	Lex and Yacc.
	1601	.PP
	1602	If you want to use Lex with Yacc, note that what Lex writes is a program
	1603	named
	1604	.I
	1605	yylex(),
	1606	.R
	1607	the name required by Yacc for its analyzer.
	1608	Normally, the default main program on the Lex library
	1609	calls this routine, but if Yacc is loaded, and its main
	1610	program is used, Yacc will call
	1611	.I
	1612	yylex().
	1613	.R
	1614	In this case each Lex rule should end with
	1615	.TS
	1616	center;
	1617	l.
	1618	return(token);
	1619	.TE
	1620	where the appropriate token value is returned.
	1621	An easy way to get access
	1622	to Yacc's names for tokens is to
	1623	compile the Lex output file as part of
	1624	the Yacc output file by placing the line
	1625	.TS
	1626	center;
	1627	l.
	1628	# include "lex.yy.c"
	1629	.TE
	1630	in the last section of Yacc input.
	1631	Supposing the grammar to be
	1632	named ``good'' and the lexical rules to be named ``better''
	1633	the UNIX command sequence can just be:
	1634	.TS
	1635	center;
	1636	l.
	1637	yacc good
	1638	lex better
	1639	cc y.tab.c \-ly \-ll
	1640	.TE
	1641	The Yacc library (\-ly) should be loaded before the Lex library,
	1642	to obtain a main program which invokes the Yacc parser.
	1643	The generations of Lex and Yacc programs can be done in
	1644	either order.
	1645	.2C
	1646	.NH
	1647	Examples.
	1648	.PP
	1649	As a trivial problem, consider copying an input file while
	1650	adding 3 to every positive number divisible by 7.
	1651	Here is a suitable Lex source program
	1652	.TS
	1653	center;
	1654	l l.
	1655	%%
	1656	int k;
	1657	[0\-9]+ {
	1658	k = atoi(yytext);
	1659	if (k%7 == 0)
	1660	printf("%d", k+3);
	1661	else
	1662	printf("%d",k);
	1663	}
	1664	.TE
	1665	to do just that.
	1666	The rule [0\-9]+ recognizes strings of digits;
	1667	.I
	1668	atoi
	1669	.R
	1670	converts the digits to binary
	1671	and stores the result in
	1672	.ul
	1673	k.
	1674	The operator % (remainder) is used to check whether
	1675	.ul
	1676	k
	1677	is divisible by 7; if it is,
	1678	it is incremented by 3 as it is written out.
	1679	It may be objected that this program will alter such
	1680	input items as
	1681	.I 49.63
	1682	or
	1683	.I X7 .
	1684	Furthermore, it increments the absolute value
	1685	of all negative numbers divisible by 7.
	1686	To avoid this, just add a few more rules after the active one,
	1687	as here:
	1688	.TS
	1689	center;
	1690	l l.
	1691	%%
	1692	int k;
	1693	\-?[0\-9]+ {
	1694	k = atoi(yytext);
	1695	printf("%d", k%7 == 0 ? k+3 : k);
	1696	}
	1697	\-?[0\-9.]+ ECHO;
	1698	[A-Za-z][A-Za-z0-9]+ ECHO;
	1699	.TE
	1700	Numerical strings containing
	1701	a ``.'' or preceded by a letter will be picked up by
	1702	one of the last two rules, and not changed.
	1703	The
	1704	.I if\-else
	1705	has been replaced by
	1706	a C conditional expression to save space;
	1707	the form
	1708	.ul
	1709	a?b:c
	1710	means ``if
	1711	.I a
	1712	then
	1713	.I b
	1714	else
	1715	.I c ''.
	1716	.PP
	1717	For an example of statistics gathering, here
	1718	is a program which histograms the lengths
	1719	of words, where a word is defined as a string of letters.
	1720	.TS
	1721	center;
	1722	l l.
	1723	int lengs[100];
	1724	%%
	1725	[a\-z]+ lengs[yyleng]++;
	1726	\&. \|
	1727	\en ;
	1728	%%
	1729	.T&
	1730	l s.
	1731	yywrap()
	1732	{
	1733	int i;
	1734	printf("Length No. words\en");
	1735	for(i=0; i<100; i++)
	1736	if (lengs[i] > 0)
	1737	printf("%5d%10d\en",i,lengs[i]);
	1738	return(1);
	1739	}
	1740	.TE
	1741	This program
	1742	accumulates the histogram, while producing no output. At the end
	1743	of the input it prints the table.
	1744	The final statement
	1745	.I
	1746	return(1);
	1747	.R
	1748	indicates that Lex is to perform wrapup. If
	1749	.I
	1750	yywrap
	1751	.R
	1752	returns zero (false)
	1753	it implies that further input is available
	1754	and the program is
	1755	to continue reading and processing.
	1756	To provide a
	1757	.I
	1758	yywrap
	1759	.R
	1760	that never
	1761	returns true causes an infinite loop.
	1762	.PP
	1763	As a larger example,
	1764	here are some parts of a program written by N. L. Schryer
	1765	to convert double precision Fortran to single precision Fortran.
	1766	Because Fortran does not distinguish upper and lower case letters,
	1767	this routine begins by defining a set of classes including
	1768	both cases of each letter:
	1769	.TS
	1770	center;
	1771	l l.
	1772	a [aA]
	1773	b [bB]
	1774	c [cC]
	1775	\&...
	1776	z [zZ]
	1777	.TE
	1778	An additional class recognizes white space:
	1779	.TS
	1780	center;
	1781	l l.
	1782	W [ \et]\(**
	1783	.TE
	1784	The first rule changes
	1785	``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
	1786	.TS
	1787	center;
	1788	l.
	1789	{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
	1790	printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
	1791	}
	1792	.TE
	1793	Care is taken throughout this program to preserve the case
	1794	(upper or lower)
	1795	of the original program.
	1796	The conditional operator is used to
	1797	select the proper form of the keyword.
	1798	The next rule copies continuation card indications to
	1799	avoid confusing them with constants:
	1800	.TS
	1801	center;
	1802	l l.
	1803	^" "[^ 0] ECHO;
	1804	.TE
	1805	In the regular expression, the quotes surround the
	1806	blanks.
	1807	It is interpreted as
	1808	``beginning of line, then five blanks, then
	1809	anything but blank or zero.''
	1810	Note the two different meanings of
	1811	.I ^ .
	1812	There follow some rules to change double precision
	1813	constants to ordinary floating constants.
	1814	.TS
	1815	center;
	1816	l.
	1817	[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ \|
	1818	[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ \|
	1819	"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
	1820	/\( convert constants \(/
	1821	for(p=yytext; \(**p != 0; p++)
	1822	{
	1823	if (\(p == \(fmd\(fm \|\| \(p == \(fmD\(fm)
	1824	\(**p=+ \(fme\(fm\- \(fmd\(fm;
	1825	ECHO;
	1826	}
	1827	.TE
	1828	After the floating point constant is recognized, it is
	1829	scanned by the
	1830	.ul
	1831	for
	1832	loop
	1833	to find the letter
	1834	.I d
	1835	or
	1836	.I D .
	1837	The program than adds
	1838	.c
	1839	.I \(fme\(fm\-\(fmd\(fm ,
	1840	which converts
	1841	it to the next letter of the alphabet.
	1842	The modified constant, now single-precision,
	1843	is written out again.
	1844	There follow a series of names which must be respelled to remove
	1845	their initial \fId\fR.
	1846	By using the
	1847	array
	1848	.I
	1849	yytext
	1850	.R
	1851	the same action suffices for all the names (only a sample of
	1852	a rather long list is given here).
	1853	.TS
	1854	center;
	1855	l l.
	1856	{d}{s}{i}{n} \|
	1857	{d}{c}{o}{s} \|
	1858	{d}{s}{q}{r}{t} \|
	1859	{d}{a}{t}{a}{n} \|
	1860	\&...
	1861	{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
	1862	.TE
	1863	Another list of names must have initial \fId\fR changed to initial \fIa\fR:
	1864	.TS
	1865	center;
	1866	l l.
	1867	{d}{l}{o}{g} \|
	1868	{d}{l}{o}{g}10 \|
	1869	{d}{m}{i}{n}1 \|
	1870	{d}{m}{a}{x}1 {
	1871	yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
	1872	ECHO;
	1873	}
	1874	.TE
	1875	And one routine
	1876	must have initial \fId\fR changed to initial \fIr\fR:
	1877	.TS
	1878	center;
	1879	l l.
	1880	{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
	1881	ECHO;
	1882	}
	1883	.TE
	1884	To avoid such names as \fIdsinx\fR being detected as instances
	1885	of \fIdsin\fR, some final rules pick up longer words as identifiers
	1886	and copy some surviving characters:
	1887	.TS
	1888	center;
	1889	l l.
	1890	[A\-Za\-z][A\-Za\-z0\-9]\(** \|
	1891	[0\-9]+ \|
	1892	\en \|
	1893	\&. ECHO;
	1894	.TE
	1895	Note that this program is not complete; it
	1896	does not deal with the spacing problems in Fortran or
	1897	with the use of keywords as identifiers.
	1898	.br
	1899	.2C
	1900	.NH
	1901	Left Context Sensitivity.
	1902	.PP
	1903	Sometimes
	1904	it is desirable to have several sets of lexical rules
	1905	to be applied at different times in the input.
	1906	For example, a compiler preprocessor might distinguish
	1907	preprocessor statements and analyze them differently
	1908	from ordinary statements.
	1909	This requires
	1910	sensitivity
	1911	to prior context, and there are several ways of handling
	1912	such problems.
	1913	The \fI^\fR operator, for example, is a prior context operator,
	1914	recognizing immediately preceding left context just as \fI$\fR recognizes
	1915	immediately following right context.
	1916	Adjacent left context could be extended, to produce a facility similar to
	1917	that for adjacent right context, but it is unlikely
	1918	to be as useful, since often the relevant left context
	1919	appeared some time earlier, such as at the beginning of a line.
	1920	.PP
	1921	This section describes three means of dealing
	1922	with different environments: a simple use of flags,
	1923	when only a few rules change from one environment to another,
	1924	the use of
	1925	.I
	1926	start conditions
	1927	.R
	1928	on rules,
	1929	and the possibility of making multiple lexical analyzers all run
	1930	together.
	1931	In each case, there are rules which recognize the need to change the
	1932	environment in which the
	1933	following input text is analyzed, and set some parameter
	1934	to reflect the change. This may be a flag explicitly tested by
	1935	the user's action code; such a flag is the simplest way of dealing
	1936	with the problem, since Lex is not involved at all.
	1937	It may be more convenient,
	1938	however,
	1939	to have Lex remember the flags as initial conditions on the rules.
	1940	Any rule may be associated with a start condition. It will only
	1941	be recognized when Lex is in
	1942	that start condition.
	1943	The current start condition may be changed at any time.
	1944	Finally, if the sets of rules for the different environments
	1945	are very dissimilar,
	1946	clarity may be best achieved by writing several distinct lexical
	1947	analyzers, and switching from one to another as desired.
	1948	.PP
	1949	Consider the following problem: copy the input to the output,
	1950	changing the word \fImagic\fR to \fIfirst\fR on every line which began
	1951	with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
	1952	which began with the letter \fIb\fR, and changing
	1953	\fImagic\fR to \fIthird\fR on every line which began
	1954	with the letter \fIc\fR. All other words and all other lines
	1955	are left unchanged.
	1956	.PP
	1957	These rules are so simple that the easiest way
	1958	to do this job is with a flag:
	1959	.TS
	1960	center;
	1961	l l.
	1962	int flag;
	1963	%%
	1964	^a {flag = \(fma\(fm; ECHO;}
	1965	^b {flag = \(fmb\(fm; ECHO;}
	1966	^c {flag = \(fmc\(fm; ECHO;}
	1967	\en {flag = 0 ; ECHO;}
	1968	magic {
	1969	switch (flag)
	1970	{
	1971	case \(fma\(fm: printf("first"); break;
	1972	case \(fmb\(fm: printf("second"); break;
	1973	\11\11 case \(fmc\(fm: printf("third"); break;
	1974	default: ECHO; break;
	1975	}
	1976	}
	1977	.TE
	1978	should be adequate.
	1979	.PP
	1980	To handle the same problem with start conditions, each
	1981	start condition must be introduced to Lex in the definitions section
	1982	with a line reading
	1983	.TS
	1984	center;
	1985	l l.
	1986	%Start name1 name2 ...
	1987	.TE
	1988	where the conditions may be named in any order.
	1989	The word \fIStart\fR may be abbreviated to \fIs\fR or \fIS\fR.
	1990	The conditions may be referenced at the
	1991	head of a rule with the <> brackets:
	1992	.TS
	1993	center;
	1994	l.
	1995	<name1>expression
	1996	.TE
	1997	is a rule which is only recognized when Lex is in the
	1998	start condition \fIname1\fR.
	1999	To enter a start condition,
	2000	execute the action statement
	2001	.TS
	2002	center;
	2003	l.
	2004	BEGIN name1;
	2005	.TE
	2006	which changes the start condition to \fIname1\fR.
	2007	To resume the normal state,
	2008	.TS
	2009	center;
	2010	l.
	2011	BEGIN 0;
	2012	.TE
	2013	resets the initial condition
	2014	of the Lex automaton interpreter.
	2015	A rule may be active in several
	2016	start conditions:
	2017	.TS
	2018	center;
	2019	l.
	2020	<name1,name2,name3>
	2021	.TE
	2022	is a legal prefix. Any rule not beginning with the
	2023	<> prefix operator is always active.
	2024	.PP
	2025	The same example as before can be written:
	2026	.TS
	2027	center;
	2028	l l.
	2029	%START AA BB CC
	2030	%%
	2031	^a {ECHO; BEGIN AA;}
	2032	^b {ECHO; BEGIN BB;}
	2033	^c {ECHO; BEGIN CC;}
	2034	\en {ECHO; BEGIN 0;}
	2035	<AA>magic printf("first");
	2036	<BB>magic printf("second");
	2037	<CC>magic printf("third");
	2038	.TE
	2039	where the logic is exactly the same as in the previous
	2040	method of handling the problem, but Lex does the work
	2041	rather than the user's code.
	2042	.2C
	2043	.NH
	2044	Character Set.
	2045	.PP
	2046	The programs generated by Lex handle
	2047	character I/O only through the routines
	2048	.I
	2049	input,
	2050	output,
	2051	.R
	2052	and
	2053	.I
	2054	unput.
	2055	.R
	2056	Thus the character representation
	2057	provided in these routines
	2058	is accepted by Lex and employed to return
	2059	values in
	2060	.I
	2061	yytext.
	2062	.R
	2063	For internal use
	2064	a character is represented as a small integer
	2065	which, if the standard library is used,
	2066	has a value equal to the integer value of the bit
	2067	pattern representing the character on the host computer.
	2068	Normally, the letter
	2069	.I a
	2070	is represented as the same form as the character constant
	2071	.I \(fma\(fm .
	2072	If this interpretation is changed, by providing I/O
	2073	routines which translate the characters,
	2074	Lex must be told about
	2075	it, by giving a translation table.
	2076	This table must be in the definitions section,
	2077	and must be bracketed by lines containing only
	2078	``%T''.
	2079	The table contains lines of the form
	2080	.TS
	2081	center;
	2082	l.
	2083	{integer} {character string}
	2084	.TE
	2085	which indicate the value associated with each character.
	2086	Thus the next example
	2087	.GS 2
	2088	.TS
	2089	center;
	2090	l l.
	2091	%T
	2092	1 Aa
	2093	2 Bb
	2094	\&...
	2095	26 Zz
	2096	27 \en
	2097	28 +
	2098	29 \-
	2099	30 0
	2100	31 1
	2101	\&...
	2102	39 9
	2103	%T
	2104	.TE
	2105	.sp
	2106	.ce 1
	2107	Sample character table.
	2108	.GE
	2109	maps the lower and upper case letters together into the integers 1 through 26,
	2110	newline into 27, + and \- into 28 and 29, and the
	2111	digits into 30 through 39.
	2112	Note the escape for newline.
	2113	If a table is supplied, every character that is to appear either
	2114	in the rules or in any valid input must be included
	2115	in the table.
	2116	No character
	2117	may be assigned the number 0, and no character may be
	2118	assigned a bigger number than the size of the hardware character set.
	2119	.2C
	2120	.NH
	2121	Summary of Source Format.
	2122	.PP
	2123	The general form of a Lex source file is:
	2124	.TS
	2125	center;
	2126	l.
	2127	{definitions}
	2128	%%
	2129	{rules}
	2130	%%
	2131	{user subroutines}
	2132	.TE
	2133	The definitions section contains
	2134	a combination of
	2135	.IP 1)
	2136	Definitions, in the form ``name space translation''.
	2137	.IP 2)
	2138	Included code, in the form ``space code''.
	2139	.IP 3)
	2140	Included code, in the form
	2141	.TS
	2142	center;
	2143	l.
	2144	%{
	2145	code
	2146	%}
	2147	.TE
	2148	.ns
	2149	.IP 4)
	2150	Start conditions, given in the form
	2151	.TS
	2152	center;
	2153	l.
	2154	%S name1 name2 ...
	2155	.TE
	2156	.ns
	2157	.IP 5)
	2158	Character set tables, in the form
	2159	.TS
	2160	center;
	2161	l.
	2162	%T
	2163	number space character-string
	2164	\&...
	2165	%T
	2166	.TE
	2167	.ns
	2168	.IP 6)
	2169	Changes to internal array sizes, in the form
	2170	.TS
	2171	center;
	2172	l.
	2173	%\fIx\fR\0\0\fInnn\fR
	2174	.TE
	2175	where \fInnn\fR is a decimal integer representing an array size
	2176	and \fIx\fR selects the parameter as follows:
	2177	.TS
	2178	center;
	2179	c c
	2180	c l.
	2181	Letter Parameter
	2182	p positions
	2183	n states
	2184	e tree nodes
	2185	a transitions
	2186	k packed character classes
	2187	o output array size
	2188	.TE
	2189	.LP
	2190	Lines in the rules section have the form ``expression action''
	2191	where the action may be continued on succeeding
	2192	lines by using braces to delimit it.
	2193	.PP
	2194	Regular expressions in Lex use the following
	2195	operators:
	2196	.br
	2197	.TS
	2198	center;
	2199	l l.
	2200	x the character "x"
	2201	"x" an "x", even if x is an operator.
	2202	\ex an "x", even if x is an operator.
	2203	[xy] the character x or y.
	2204	[x\-z] the characters x, y or z.
	2205	[^x] any character but x.
	2206	\&. any character but newline.
	2207	^x an x at the beginning of a line.
	2208	<y>x an x when Lex is in start condition y.
	2209	x$ an x at the end of a line.
	2210	x? an optional x.
	2211	x\(** 0,1,2, ... instances of x.
	2212	x+ 1,2,3, ... instances of x.
	2213	x\|y an x or a y.
	2214	(x) an x.
	2215	x/y an x but only if followed by y.
	2216	{xx} the translation of xx from the definitions section.
	2217	x{m,n} \fIm\fR through \fIn\fR occurrences of x
	2218	.TE
	2219	.NH
	2220	Caveats and Bugs.
	2221	.PP
	2222	There are pathological expressions which
	2223	produce exponential growth of the tables when
	2224	converted to deterministic machines;
	2225	fortunately, they are rare.
	2226	.PP
	2227	REJECT does not rescan the input; instead it remembers the results of the previous
	2228	scan. This means that if a rule with trailing context is found, and
	2229	REJECT executed, the user
	2230	must not have used
	2231	.ul
	2232	unput
	2233	to change the characters forthcoming
	2234	from the input stream.
	2235	This is the only restriction on the user's ability to manipulate
	2236	the not-yet-processed input.
	2237	.PP
	2238	.2C
	2239	.NH
	2240	Acknowledgments.
	2241	.PP
	2242	As should
	2243	be obvious from the above, the outside of Lex
	2244	is patterned
	2245	on Yacc and the inside on Aho's string matching routines.
	2246	Therefore, both S. C. Johnson and A. V. Aho
	2247	are really originators
	2248	of much of Lex,
	2249	as well as debuggers of it.
	2250	Many thanks are due to both.
	2251	.PP
	2252	The code of the current version of Lex was designed, written,
	2253	and debugged by Eric Schmidt.
	2254	.SG MH-1274-MEL-unix
	2255	.sp 1
	2256	.2C
	2257	.NH
	2258	References.
	2259	.SP 1v
	2260	.IP 1.
	2261	B. W. Kernighan and D. M. Ritchie,
	2262	.I
	2263	The C Programming Language,
	2264	.R
	2265	Prentice-Hall, N. J. (1978).
	2266	.IP 2.
	2267	B. W. Kernighan,
	2268	.I
	2269	Ratfor: A Preprocessor for a Rational Fortran,
	2270	.R
	2271	Software \- Practice and Experience,
	2272	\fB5\fR, pp. 395-496 (1975).
	2273	.IP 3.
	2274	S. C. Johnson,
	2275	.I
	2276	Yacc: Yet Another Compiler Compiler,
	2277	.R
	2278	Computing Science Technical Report No. 32,
	2279	1975,
	2280	.MH
	2281	.if \n(tm (also TM 75-1273-6)
	2282	.IP 4.
	2283	A. V. Aho and M. J. Corasick,
	2284	.I
	2285	Efficient String Matching: An Aid to Bibliographic Search,
	2286	.R
	2287	Comm. ACM
	2288	.B
	2289	18,
	2290	.R
	2291	333-340 (1975).
	2292	.IP 5.
	2293	B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
	2294	.I
	2295	QED Text Editor,
	2296	.R
	2297	Computing Science Technical Report No. 5,
	2298	1972,
	2299	.MH
	2300	.IP 6.
	2301	D. M. Ritchie,
	2302	private communication.
	2303	See also
	2304	M. E. Lesk,
	2305	.I
	2306	The Portable C Library,
	2307	.R
	2308	Computing Science Technical Report No. 31,
	2309	.MH
	2310	.if \n(tm (also TM 75-1274-11)