4.4BSD snapshot (revision 8.1)
[unix-history] / usr / src / lib / libc / regex / regex.3
CommitLineData
d596f7c0
KB
1.\" Copyright (c) 1992 Henry Spencer.
2.\" Copyright (c) 1992 The Regents of the University of California.
3.\" All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer of the University of Toronto.
7.\"
8.\" %sccs.include.redist.roff%
9.\"
a586e915 10.\" @(#)regex.3 8.1 (Berkeley) %G%
d596f7c0
KB
11.\"
12.TH REGEX 3 ""
13.SH NAME
14regcomp, regexec, regerror, regfree \- regular-expression library
6175ca7c
KB
15.de ZR
16.\" one other place knows this name: the SEE ALSO section
17.IR re_format (7) \\$1
18..
19.SH NAME
20regcomp, regexec, regerror, regfree \- regular-expression library
d596f7c0
KB
21.SH SYNOPSIS
22.ft B
6175ca7c 23.\".na
d596f7c0
KB
24#include <sys/types.h>
25.br
26#include <regex.h>
6175ca7c
KB
27.HP 10
28int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
29.HP
30int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
31size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
32.HP
33size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
34char\ *errbuf, size_t\ errbuf_size);
35.HP
36void\ regfree(regex_t\ *preg);
37.\".ad
d596f7c0
KB
38.ft
39.SH DESCRIPTION
40These routines implement POSIX 1003.2 regular expressions (``RE''s);
41see
6175ca7c 42.ZR .
d596f7c0
KB
43.I Regcomp
44compiles an RE written as a string into an internal form,
45.I regexec
46matches that internal form against a string and reports results,
47.I regerror
48transforms error codes from either into human-readable messages,
49and
50.I regfree
51frees any dynamically-allocated storage used by the internal form
52of an RE.
53.PP
54The header
55.I <regex.h>
56declares two structure types,
57.I regex_t
58and
59.IR regmatch_t ,
60the former for compiled internal forms and the latter for match reporting.
61It also declares the four functions,
62a type
63.IR regoff_t ,
64and a number of constants with names starting with ``REG_''.
65.PP
66.I Regcomp
67compiles the regular expression contained in the
68.I pattern
69string,
70subject to the flags in
71.IR cflags ,
72and places the results in the
73.I regex_t
74structure pointed to by
75.IR preg .
76.I Cflags
77is the bitwise OR of zero or more of the following flags:
78.IP REG_EXTENDED \w'REG_EXTENDED'u+2n
79Compile modern (``extended'') REs,
80rather than the obsolete (``basic'') REs that
81are the default.
6175ca7c
KB
82.IP REG_BASIC
83This is a synonym for 0,
84provided as a counterpart to REG_EXTENDED to improve readability.
85.IP REG_NOSPEC
86Compile with recognition of all special characters turned off.
87All characters are thus considered ordinary,
88so the ``RE'' is a literal string.
89This is an extension,
90compatible with but not specified by POSIX 1003.2,
91and should be used with
92caution in software intended to be portable to other systems.
93REG_EXTENDED and REG_NOSPEC may not be used
94in the same call to
95.IR regcomp .
d596f7c0
KB
96.IP REG_ICASE
97Compile for matching that ignores upper/lower case distinctions.
98See
6175ca7c 99.ZR .
d596f7c0
KB
100.IP REG_NOSUB
101Compile for matching that need only report success or failure,
102not what was matched.
103.IP REG_NEWLINE
104Compile for newline-sensitive matching.
105By default, newline is a completely ordinary character with no special
106meaning in either REs or strings.
107With this flag,
108`[^' bracket expressions and `.' never match newline,
109a `^' anchor matches the null string after any newline in the string
110in addition to its normal function,
111and the `$' anchor matches the null string before any newline in the
112string in addition to its normal function.
6175ca7c
KB
113.IP REG_PEND
114The regular expression ends,
115not at the first NUL,
116but just before the character pointed to by the
117.I re_endp
118member of the structure pointed to by
119.IR preg .
120The
121.I re_endp
122member is of type
123.IR const\ char\ * .
124This flag permits inclusion of NULs in the RE;
125they are considered ordinary characters.
126This is an extension,
127compatible with but not specified by POSIX 1003.2,
128and should be used with
129caution in software intended to be portable to other systems.
d596f7c0
KB
130.PP
131When successful,
132.I regcomp
133returns 0 and fills in the structure pointed to by
134.IR preg .
6175ca7c
KB
135One member of that structure
136(other than
137.IR re_endp )
138is publicized:
d596f7c0
KB
139.IR re_nsub ,
140of type
141.IR size_t ,
142contains the number of parenthesized subexpressions within the RE
143(except that the value of this member is undefined if the
144REG_NOSUB flag was used).
145If
146.I regcomp
147fails, it returns a non-zero error code;
148see DIAGNOSTICS.
149.PP
150.I Regexec
151matches the compiled RE pointed to by
152.I preg
153against the
154.IR string ,
155subject to the flags in
156.IR eflags ,
157and reports results using
158.IR nmatch ,
159.IR pmatch ,
160and the returned value.
161The RE must have been compiled by a previous invocation of
162.IR regcomp .
163The compiled form is not altered during execution of
164.IR regexec ,
165so a single compiled RE can be used simultaneously by multiple threads.
166.PP
167By default,
168the NUL-terminated string pointed to by
169.I string
170is considered to be the text of an entire line, minus any terminating
171newline.
172The
173.I eflags
174argument is the bitwise OR of zero or more of the following flags:
175.IP REG_NOTBOL \w'REG_STARTEND'u+2n
176The first character of
177the string
178is not the beginning of a line, so the `^' anchor should not match before it.
179This does not affect the behavior of newlines under REG_NEWLINE.
180.IP REG_NOTEOL
181The NUL terminating
182the string
183does not end a line, so the `$' anchor should not match before it.
184This does not affect the behavior of newlines under REG_NEWLINE.
185.IP REG_STARTEND
186The string is considered to start at
187\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
188and to have a terminating NUL located at
189\fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
190(there need not actually be a NUL at that location),
191regardless of the value of
192.IR nmatch .
193See below for the definition of
194.IR pmatch
195and
196.IR nmatch .
197This is an extension,
198compatible with but not specified by POSIX 1003.2,
199and should be used with
200caution in software intended to be portable to other systems.
6175ca7c
KB
201Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
202REG_STARTEND affects only the location of the string,
203not how it is matched.
d596f7c0
KB
204.PP
205See
6175ca7c 206.ZR
d596f7c0
KB
207for a discussion of what is matched in situations where an RE or a
208portion thereof could match any of several substrings of
209.IR string .
210.PP
211Normally,
212.I regexec
213returns 0 for success and the non-zero code REG_NOMATCH for failure.
214Other non-zero error codes may be returned in exceptional situations;
215see DIAGNOSTICS.
216.PP
217If REG_NOSUB was specified in the compilation of the RE,
218or if
219.I nmatch
220is 0,
221.I regexec
222ignores the
223.I pmatch
224argument (but see below for the case where REG_STARTEND is specified).
225Otherwise,
226.I pmatch
227points to an array of
228.I nmatch
229structures of type
230.IR regmatch_t .
231Such a structure has at least the members
232.I rm_so
233and
234.IR rm_eo ,
235both of type
236.I regoff_t
237(a signed arithmetic type at least as large as an
238.I off_t
239and a
240.IR ssize_t ),
241containing respectively the offset of the first character of a substring
242and the offset of the first character after the end of the substring.
243Offsets are measured from the beginning of the
244.I string
245argument given to
246.IR regexec .
247An empty substring is denoted by equal offsets,
248both indicating the character following the empty substring.
249.PP
250The 0th member of the
251.I pmatch
252array is filled in to indicate what substring of
253.I string
254was matched by the entire RE.
255Remaining members report what substring was matched by parenthesized
256subexpressions within the RE;
257member
258.I i
259reports subexpression
260.IR i ,
261with subexpressions counted (starting at 1) by the order of their opening
262parentheses in the RE, left to right.
263Unused entries in the array\(emcorresponding either to subexpressions that
264did not participate in the match at all, or to subexpressions that do not
265exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
266.I rm_so
267and
268.I rm_eo
269set to \-1.
270If a subexpression participated in the match several times,
271the reported substring is the last one it matched.
272(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
273the parenthesized subexpression matches each of the three `b's and then
274an infinite number of empty strings following the last `b',
275so the reported substring is one of the empties.)
276.PP
277If REG_STARTEND is specified,
278.I pmatch
279must point to at least one
280.I regmatch_t
6175ca7c 281(even if
d596f7c0 282.I nmatch
6175ca7c
KB
283is 0 or REG_NOSUB was specified),
284to hold the input offsets for REG_STARTEND.
285Use for output is still entirely controlled by
286.IR nmatch ;
287if
288.I nmatch
289is 0 or REG_NOSUB was specified,
290the value of
d596f7c0 291.IR pmatch [0]
6175ca7c 292will not be changed by a successful
d596f7c0
KB
293.IR regexec .
294.PP
295.I Regerror
296maps a non-zero
297.I errcode
298from either
299.I regcomp
300or
301.I regexec
6175ca7c 302to a human-readable, printable message.
d596f7c0
KB
303If
304.I preg
305is non-NULL,
306the error code should have arisen from use of
307the
308.I regex_t
309pointed to by
310.IR preg ,
311and if the error code came from
312.IR regcomp ,
313it should have been the result from the most recent
314.I regcomp
315using that
316.IR regex_t .
317.RI ( Regerror
318may be able to supply a more detailed message using information
319from the
320.IR regex_t .)
321.I Regerror
322places the NUL-terminated message into the buffer pointed to by
323.IR errbuf ,
324limiting the length (including the NUL) to at most
325.I errbuf_size
326bytes.
327If the whole message won't fit,
328as much of it as will fit before the terminating NUL is supplied.
329In any case,
330the returned value is the size of buffer needed to hold the whole
331message (including terminating NUL).
332If
333.I errbuf_size
334is 0,
335.I errbuf
336is ignored but the return value is still correct.
337.PP
6175ca7c
KB
338If the
339.I errcode
340given to
341.I regerror
342is first ORed with REG_ITOA,
343the ``message'' that results is the printable name of the error code,
344e.g. ``REG_NOMATCH'',
345rather than an explanation thereof.
346If
347.I errcode
348is REG_ATOI,
349then
350.I preg
351shall be non-NULL and the
352.I re_endp
353member of the structure it points to
354must point to the printable name of an error code;
355in this case, the result in
356.I errbuf
357is the decimal digits of
358the numeric value of the error code
359(0 if the name is not recognized).
360REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
361they are extensions,
362compatible with but not specified by POSIX 1003.2,
363and should be used with
364caution in software intended to be portable to other systems.
365Be warned also that they are considered experimental and changes are possible.
366.PP
d596f7c0
KB
367.I Regfree
368frees any dynamically-allocated storage associated with the compiled RE
369pointed to by
370.IR preg .
371The remaining
372.I regex_t
373is no longer a valid compiled RE
374and the effect of supplying it to
375.I regexec
376or
377.I regerror
378is undefined.
379.PP
380None of these functions references global variables except for tables
381of constants;
382all are safe for use from multiple threads if the arguments are safe.
383.SH IMPLEMENTATION CHOICES
384There are a number of decisions that 1003.2 leaves up to the implementor,
385either by explicitly saying ``undefined'' or by virtue of them being
386forbidden by the RE grammar.
387This implementation treats them as follows.
388.PP
389See
6175ca7c 390.ZR
d596f7c0
KB
391for a discussion of the definition of case-independent matching.
392.PP
393There is no particular limit on the length of REs,
394except insofar as memory is limited.
395Memory usage is approximately linear in RE size, and largely insensitive
396to RE complexity, except for bounded repetitions.
397See BUGS for one short RE using them
398that will run almost any system out of memory.
399.PP
6175ca7c
KB
400A backslashed character other than one specifically given a magic meaning
401by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
402is taken as an ordinary character.
d596f7c0
KB
403.PP
404Any unmatched [ is a REG_EBRACK error.
405.PP
406Equivalence classes cannot begin or end bracket-expression ranges.
407The endpoint of one range cannot begin another.
408.PP
409RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
410.PP
411A repetition operator (?, *, +, or bounds) cannot follow another
412repetition operator.
413A repetition operator cannot begin an expression or subexpression
414or follow `^' or `|'.
415.PP
416`|' cannot appear first or last in a (sub)expression or after another `|',
417i.e. an operand of `|' cannot be an empty subexpression.
418An empty parenthesized subexpression, `()', is legal and matches an
419empty (sub)string.
420An empty string is not a legal RE.
421.PP
422A `{' followed by a digit is considered the beginning of bounds for a
423bounded repetition, which must then follow the syntax for bounds.
424A `{' \fInot\fR followed by a digit is considered an ordinary character.
425.PP
426`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
427REs are anchors, not ordinary characters.
428.SH SEE ALSO
429grep(1), re_format(7)
430.PP
431POSIX 1003.2, sections 2.8 (Regular Expression Notation)
432and
433B.5 (C Binding for Regular Expression Matching).
434.SH DIAGNOSTICS
435Non-zero error codes from
436.I regcomp
437and
438.I regexec
439include the following:
440.PP
441.nf
442.ta \w'REG_ECOLLATE'u+3n
443REG_NOMATCH regexec() failed to match
444REG_BADPAT invalid regular expression
445REG_ECOLLATE invalid collating element
446REG_ECTYPE invalid character class
447REG_EESCAPE \e applied to unescapable character
448REG_ESUBREG invalid backreference number
449REG_EBRACK brackets [ ] not balanced
450REG_EPAREN parentheses ( ) not balanced
451REG_EBRACE braces { } not balanced
452REG_BADBR invalid repetition count(s) in { }
453REG_ERANGE invalid character range in [ ]
454REG_ESPACE ran out of memory
455REG_BADRPT ?, *, or + operand invalid
456REG_EMPTY empty (sub)expression
457REG_ASSERT ``can't happen''\(emyou found a bug
6175ca7c 458REG_INVARG invalid argument, e.g. negative-length string
d596f7c0
KB
459.fi
460.SH HISTORY
461Written by Henry Spencer at University of Toronto,
462henry@zoo.toronto.edu.
463.SH BUGS
464This is an alpha release with known defects.
465Please report problems.
466.PP
467There is one known functionality bug.
468The implementation of internationalization is incomplete:
469the locale is always assumed to be the default one of 1003.2,
470and only the collating elements etc. of that locale are available.
471.PP
472The back-reference code is subtle and doubts linger about its correctness
473in complex cases.
474.PP
475.I Regexec
476performance is poor.
477This will improve with later releases.
478.I Nmatch
479exceeding 0 is expensive;
480.I nmatch
481exceeding 1 is worse.
482.I Regexec
483is largely insensitive to RE complexity \fIexcept\fR that back
484references are massively expensive.
485RE length does matter; in particular, there is a strong speed bonus
486for keeping RE length under about 30 characters,
487with most special characters counting roughly double.
488.PP
489.I Regcomp
490implements bounded repetitions by macro expansion,
491which is costly in time and space if counts are large
492or bounded repetitions are nested.
493An RE like, say,
494`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
495will (eventually) run almost any existing machine out of swap space.
496.PP
497There are suspected problems with response to obscure error conditions.
498Notably,
499certain kinds of internal overflow,
500produced only by truly enormous REs or by multiply nested bounded repetitions,
501are probably not handled well.
502.PP
503Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
504a special character only in the presence of a previous unmatched `('.
505This can't be fixed until the spec is fixed.
506.PP
507The standard's definition of back references is vague.
508For example, does
509`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
510Until the standard is clarified,
511behavior in such cases should not be relied on.
6175ca7c
KB
512.PP
513The implementation of word-boundary matching is a bit of a kludge,
514and bugs may lurk in combinations of word-boundary matching and anchoring.