Commit | Line | Data |
---|---|---|
d596f7c0 KB |
1 | .\" Copyright (c) 1992 Henry Spencer. |
2 | .\" Copyright (c) 1992 The Regents of the University of California. | |
3 | .\" All rights reserved. | |
4 | .\" | |
5 | .\" This code is derived from software contributed to Berkeley by | |
6 | .\" Henry Spencer of the University of Toronto. | |
7 | .\" | |
8 | .\" %sccs.include.redist.roff% | |
9 | .\" | |
6175ca7c | 10 | .\" @(#)regex.3 5.2 (Berkeley) %G% |
d596f7c0 KB |
11 | .\" |
12 | .TH REGEX 3 "" | |
13 | .SH NAME | |
14 | regcomp, regexec, regerror, regfree \- regular-expression library | |
6175ca7c KB |
15 | .de ZR |
16 | .\" one other place knows this name: the SEE ALSO section | |
17 | .IR re_format (7) \\$1 | |
18 | .. | |
19 | .SH NAME | |
20 | regcomp, regexec, regerror, regfree \- regular-expression library | |
d596f7c0 KB |
21 | .SH SYNOPSIS |
22 | .ft B | |
6175ca7c | 23 | .\".na |
d596f7c0 KB |
24 | #include <sys/types.h> |
25 | .br | |
26 | #include <regex.h> | |
6175ca7c KB |
27 | .HP 10 |
28 | int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags); | |
29 | .HP | |
30 | int\ regexec(const\ regex_t\ *preg, const\ char\ *string, | |
31 | size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags); | |
32 | .HP | |
33 | size_t\ regerror(int\ errcode, const\ regex_t\ *preg, | |
34 | char\ *errbuf, size_t\ errbuf_size); | |
35 | .HP | |
36 | void\ regfree(regex_t\ *preg); | |
37 | .\".ad | |
d596f7c0 KB |
38 | .ft |
39 | .SH DESCRIPTION | |
40 | These routines implement POSIX 1003.2 regular expressions (``RE''s); | |
41 | see | |
6175ca7c | 42 | .ZR . |
d596f7c0 KB |
43 | .I Regcomp |
44 | compiles an RE written as a string into an internal form, | |
45 | .I regexec | |
46 | matches that internal form against a string and reports results, | |
47 | .I regerror | |
48 | transforms error codes from either into human-readable messages, | |
49 | and | |
50 | .I regfree | |
51 | frees any dynamically-allocated storage used by the internal form | |
52 | of an RE. | |
53 | .PP | |
54 | The header | |
55 | .I <regex.h> | |
56 | declares two structure types, | |
57 | .I regex_t | |
58 | and | |
59 | .IR regmatch_t , | |
60 | the former for compiled internal forms and the latter for match reporting. | |
61 | It also declares the four functions, | |
62 | a type | |
63 | .IR regoff_t , | |
64 | and a number of constants with names starting with ``REG_''. | |
65 | .PP | |
66 | .I Regcomp | |
67 | compiles the regular expression contained in the | |
68 | .I pattern | |
69 | string, | |
70 | subject to the flags in | |
71 | .IR cflags , | |
72 | and places the results in the | |
73 | .I regex_t | |
74 | structure pointed to by | |
75 | .IR preg . | |
76 | .I Cflags | |
77 | is the bitwise OR of zero or more of the following flags: | |
78 | .IP REG_EXTENDED \w'REG_EXTENDED'u+2n | |
79 | Compile modern (``extended'') REs, | |
80 | rather than the obsolete (``basic'') REs that | |
81 | are the default. | |
6175ca7c KB |
82 | .IP REG_BASIC |
83 | This is a synonym for 0, | |
84 | provided as a counterpart to REG_EXTENDED to improve readability. | |
85 | .IP REG_NOSPEC | |
86 | Compile with recognition of all special characters turned off. | |
87 | All characters are thus considered ordinary, | |
88 | so the ``RE'' is a literal string. | |
89 | This is an extension, | |
90 | compatible with but not specified by POSIX 1003.2, | |
91 | and should be used with | |
92 | caution in software intended to be portable to other systems. | |
93 | REG_EXTENDED and REG_NOSPEC may not be used | |
94 | in the same call to | |
95 | .IR regcomp . | |
d596f7c0 KB |
96 | .IP REG_ICASE |
97 | Compile for matching that ignores upper/lower case distinctions. | |
98 | See | |
6175ca7c | 99 | .ZR . |
d596f7c0 KB |
100 | .IP REG_NOSUB |
101 | Compile for matching that need only report success or failure, | |
102 | not what was matched. | |
103 | .IP REG_NEWLINE | |
104 | Compile for newline-sensitive matching. | |
105 | By default, newline is a completely ordinary character with no special | |
106 | meaning in either REs or strings. | |
107 | With this flag, | |
108 | `[^' bracket expressions and `.' never match newline, | |
109 | a `^' anchor matches the null string after any newline in the string | |
110 | in addition to its normal function, | |
111 | and the `$' anchor matches the null string before any newline in the | |
112 | string in addition to its normal function. | |
6175ca7c KB |
113 | .IP REG_PEND |
114 | The regular expression ends, | |
115 | not at the first NUL, | |
116 | but just before the character pointed to by the | |
117 | .I re_endp | |
118 | member of the structure pointed to by | |
119 | .IR preg . | |
120 | The | |
121 | .I re_endp | |
122 | member is of type | |
123 | .IR const\ char\ * . | |
124 | This flag permits inclusion of NULs in the RE; | |
125 | they are considered ordinary characters. | |
126 | This is an extension, | |
127 | compatible with but not specified by POSIX 1003.2, | |
128 | and should be used with | |
129 | caution in software intended to be portable to other systems. | |
d596f7c0 KB |
130 | .PP |
131 | When successful, | |
132 | .I regcomp | |
133 | returns 0 and fills in the structure pointed to by | |
134 | .IR preg . | |
6175ca7c KB |
135 | One member of that structure |
136 | (other than | |
137 | .IR re_endp ) | |
138 | is publicized: | |
d596f7c0 KB |
139 | .IR re_nsub , |
140 | of type | |
141 | .IR size_t , | |
142 | contains the number of parenthesized subexpressions within the RE | |
143 | (except that the value of this member is undefined if the | |
144 | REG_NOSUB flag was used). | |
145 | If | |
146 | .I regcomp | |
147 | fails, it returns a non-zero error code; | |
148 | see DIAGNOSTICS. | |
149 | .PP | |
150 | .I Regexec | |
151 | matches the compiled RE pointed to by | |
152 | .I preg | |
153 | against the | |
154 | .IR string , | |
155 | subject to the flags in | |
156 | .IR eflags , | |
157 | and reports results using | |
158 | .IR nmatch , | |
159 | .IR pmatch , | |
160 | and the returned value. | |
161 | The RE must have been compiled by a previous invocation of | |
162 | .IR regcomp . | |
163 | The compiled form is not altered during execution of | |
164 | .IR regexec , | |
165 | so a single compiled RE can be used simultaneously by multiple threads. | |
166 | .PP | |
167 | By default, | |
168 | the NUL-terminated string pointed to by | |
169 | .I string | |
170 | is considered to be the text of an entire line, minus any terminating | |
171 | newline. | |
172 | The | |
173 | .I eflags | |
174 | argument is the bitwise OR of zero or more of the following flags: | |
175 | .IP REG_NOTBOL \w'REG_STARTEND'u+2n | |
176 | The first character of | |
177 | the string | |
178 | is not the beginning of a line, so the `^' anchor should not match before it. | |
179 | This does not affect the behavior of newlines under REG_NEWLINE. | |
180 | .IP REG_NOTEOL | |
181 | The NUL terminating | |
182 | the string | |
183 | does not end a line, so the `$' anchor should not match before it. | |
184 | This does not affect the behavior of newlines under REG_NEWLINE. | |
185 | .IP REG_STARTEND | |
186 | The string is considered to start at | |
187 | \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR | |
188 | and to have a terminating NUL located at | |
189 | \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR | |
190 | (there need not actually be a NUL at that location), | |
191 | regardless of the value of | |
192 | .IR nmatch . | |
193 | See below for the definition of | |
194 | .IR pmatch | |
195 | and | |
196 | .IR nmatch . | |
197 | This is an extension, | |
198 | compatible with but not specified by POSIX 1003.2, | |
199 | and should be used with | |
200 | caution in software intended to be portable to other systems. | |
6175ca7c KB |
201 | Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL; |
202 | REG_STARTEND affects only the location of the string, | |
203 | not how it is matched. | |
d596f7c0 KB |
204 | .PP |
205 | See | |
6175ca7c | 206 | .ZR |
d596f7c0 KB |
207 | for a discussion of what is matched in situations where an RE or a |
208 | portion thereof could match any of several substrings of | |
209 | .IR string . | |
210 | .PP | |
211 | Normally, | |
212 | .I regexec | |
213 | returns 0 for success and the non-zero code REG_NOMATCH for failure. | |
214 | Other non-zero error codes may be returned in exceptional situations; | |
215 | see DIAGNOSTICS. | |
216 | .PP | |
217 | If REG_NOSUB was specified in the compilation of the RE, | |
218 | or if | |
219 | .I nmatch | |
220 | is 0, | |
221 | .I regexec | |
222 | ignores the | |
223 | .I pmatch | |
224 | argument (but see below for the case where REG_STARTEND is specified). | |
225 | Otherwise, | |
226 | .I pmatch | |
227 | points to an array of | |
228 | .I nmatch | |
229 | structures of type | |
230 | .IR regmatch_t . | |
231 | Such a structure has at least the members | |
232 | .I rm_so | |
233 | and | |
234 | .IR rm_eo , | |
235 | both of type | |
236 | .I regoff_t | |
237 | (a signed arithmetic type at least as large as an | |
238 | .I off_t | |
239 | and a | |
240 | .IR ssize_t ), | |
241 | containing respectively the offset of the first character of a substring | |
242 | and the offset of the first character after the end of the substring. | |
243 | Offsets are measured from the beginning of the | |
244 | .I string | |
245 | argument given to | |
246 | .IR regexec . | |
247 | An empty substring is denoted by equal offsets, | |
248 | both indicating the character following the empty substring. | |
249 | .PP | |
250 | The 0th member of the | |
251 | .I pmatch | |
252 | array is filled in to indicate what substring of | |
253 | .I string | |
254 | was matched by the entire RE. | |
255 | Remaining members report what substring was matched by parenthesized | |
256 | subexpressions within the RE; | |
257 | member | |
258 | .I i | |
259 | reports subexpression | |
260 | .IR i , | |
261 | with subexpressions counted (starting at 1) by the order of their opening | |
262 | parentheses in the RE, left to right. | |
263 | Unused entries in the array\(emcorresponding either to subexpressions that | |
264 | did not participate in the match at all, or to subexpressions that do not | |
265 | exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both | |
266 | .I rm_so | |
267 | and | |
268 | .I rm_eo | |
269 | set to \-1. | |
270 | If a subexpression participated in the match several times, | |
271 | the reported substring is the last one it matched. | |
272 | (Note, as an example in particular, that when the RE `(b*)+' matches `bbb', | |
273 | the parenthesized subexpression matches each of the three `b's and then | |
274 | an infinite number of empty strings following the last `b', | |
275 | so the reported substring is one of the empties.) | |
276 | .PP | |
277 | If REG_STARTEND is specified, | |
278 | .I pmatch | |
279 | must point to at least one | |
280 | .I regmatch_t | |
6175ca7c | 281 | (even if |
d596f7c0 | 282 | .I nmatch |
6175ca7c KB |
283 | is 0 or REG_NOSUB was specified), |
284 | to hold the input offsets for REG_STARTEND. | |
285 | Use for output is still entirely controlled by | |
286 | .IR nmatch ; | |
287 | if | |
288 | .I nmatch | |
289 | is 0 or REG_NOSUB was specified, | |
290 | the value of | |
d596f7c0 | 291 | .IR pmatch [0] |
6175ca7c | 292 | will not be changed by a successful |
d596f7c0 KB |
293 | .IR regexec . |
294 | .PP | |
295 | .I Regerror | |
296 | maps a non-zero | |
297 | .I errcode | |
298 | from either | |
299 | .I regcomp | |
300 | or | |
301 | .I regexec | |
6175ca7c | 302 | to a human-readable, printable message. |
d596f7c0 KB |
303 | If |
304 | .I preg | |
305 | is non-NULL, | |
306 | the error code should have arisen from use of | |
307 | the | |
308 | .I regex_t | |
309 | pointed to by | |
310 | .IR preg , | |
311 | and if the error code came from | |
312 | .IR regcomp , | |
313 | it should have been the result from the most recent | |
314 | .I regcomp | |
315 | using that | |
316 | .IR regex_t . | |
317 | .RI ( Regerror | |
318 | may be able to supply a more detailed message using information | |
319 | from the | |
320 | .IR regex_t .) | |
321 | .I Regerror | |
322 | places the NUL-terminated message into the buffer pointed to by | |
323 | .IR errbuf , | |
324 | limiting the length (including the NUL) to at most | |
325 | .I errbuf_size | |
326 | bytes. | |
327 | If the whole message won't fit, | |
328 | as much of it as will fit before the terminating NUL is supplied. | |
329 | In any case, | |
330 | the returned value is the size of buffer needed to hold the whole | |
331 | message (including terminating NUL). | |
332 | If | |
333 | .I errbuf_size | |
334 | is 0, | |
335 | .I errbuf | |
336 | is ignored but the return value is still correct. | |
337 | .PP | |
6175ca7c KB |
338 | If the |
339 | .I errcode | |
340 | given to | |
341 | .I regerror | |
342 | is first ORed with REG_ITOA, | |
343 | the ``message'' that results is the printable name of the error code, | |
344 | e.g. ``REG_NOMATCH'', | |
345 | rather than an explanation thereof. | |
346 | If | |
347 | .I errcode | |
348 | is REG_ATOI, | |
349 | then | |
350 | .I preg | |
351 | shall be non-NULL and the | |
352 | .I re_endp | |
353 | member of the structure it points to | |
354 | must point to the printable name of an error code; | |
355 | in this case, the result in | |
356 | .I errbuf | |
357 | is the decimal digits of | |
358 | the numeric value of the error code | |
359 | (0 if the name is not recognized). | |
360 | REG_ITOA and REG_ATOI are intended primarily as debugging facilities; | |
361 | they are extensions, | |
362 | compatible with but not specified by POSIX 1003.2, | |
363 | and should be used with | |
364 | caution in software intended to be portable to other systems. | |
365 | Be warned also that they are considered experimental and changes are possible. | |
366 | .PP | |
d596f7c0 KB |
367 | .I Regfree |
368 | frees any dynamically-allocated storage associated with the compiled RE | |
369 | pointed to by | |
370 | .IR preg . | |
371 | The remaining | |
372 | .I regex_t | |
373 | is no longer a valid compiled RE | |
374 | and the effect of supplying it to | |
375 | .I regexec | |
376 | or | |
377 | .I regerror | |
378 | is undefined. | |
379 | .PP | |
380 | None of these functions references global variables except for tables | |
381 | of constants; | |
382 | all are safe for use from multiple threads if the arguments are safe. | |
383 | .SH IMPLEMENTATION CHOICES | |
384 | There are a number of decisions that 1003.2 leaves up to the implementor, | |
385 | either by explicitly saying ``undefined'' or by virtue of them being | |
386 | forbidden by the RE grammar. | |
387 | This implementation treats them as follows. | |
388 | .PP | |
389 | See | |
6175ca7c | 390 | .ZR |
d596f7c0 KB |
391 | for a discussion of the definition of case-independent matching. |
392 | .PP | |
393 | There is no particular limit on the length of REs, | |
394 | except insofar as memory is limited. | |
395 | Memory usage is approximately linear in RE size, and largely insensitive | |
396 | to RE complexity, except for bounded repetitions. | |
397 | See BUGS for one short RE using them | |
398 | that will run almost any system out of memory. | |
399 | .PP | |
6175ca7c KB |
400 | A backslashed character other than one specifically given a magic meaning |
401 | by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs) | |
402 | is taken as an ordinary character. | |
d596f7c0 KB |
403 | .PP |
404 | Any unmatched [ is a REG_EBRACK error. | |
405 | .PP | |
406 | Equivalence classes cannot begin or end bracket-expression ranges. | |
407 | The endpoint of one range cannot begin another. | |
408 | .PP | |
409 | RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255. | |
410 | .PP | |
411 | A repetition operator (?, *, +, or bounds) cannot follow another | |
412 | repetition operator. | |
413 | A repetition operator cannot begin an expression or subexpression | |
414 | or follow `^' or `|'. | |
415 | .PP | |
416 | `|' cannot appear first or last in a (sub)expression or after another `|', | |
417 | i.e. an operand of `|' cannot be an empty subexpression. | |
418 | An empty parenthesized subexpression, `()', is legal and matches an | |
419 | empty (sub)string. | |
420 | An empty string is not a legal RE. | |
421 | .PP | |
422 | A `{' followed by a digit is considered the beginning of bounds for a | |
423 | bounded repetition, which must then follow the syntax for bounds. | |
424 | A `{' \fInot\fR followed by a digit is considered an ordinary character. | |
425 | .PP | |
426 | `^' and `$' beginning and ending subexpressions in obsolete (``basic'') | |
427 | REs are anchors, not ordinary characters. | |
428 | .SH SEE ALSO | |
429 | grep(1), re_format(7) | |
430 | .PP | |
431 | POSIX 1003.2, sections 2.8 (Regular Expression Notation) | |
432 | and | |
433 | B.5 (C Binding for Regular Expression Matching). | |
434 | .SH DIAGNOSTICS | |
435 | Non-zero error codes from | |
436 | .I regcomp | |
437 | and | |
438 | .I regexec | |
439 | include the following: | |
440 | .PP | |
441 | .nf | |
442 | .ta \w'REG_ECOLLATE'u+3n | |
443 | REG_NOMATCH regexec() failed to match | |
444 | REG_BADPAT invalid regular expression | |
445 | REG_ECOLLATE invalid collating element | |
446 | REG_ECTYPE invalid character class | |
447 | REG_EESCAPE \e applied to unescapable character | |
448 | REG_ESUBREG invalid backreference number | |
449 | REG_EBRACK brackets [ ] not balanced | |
450 | REG_EPAREN parentheses ( ) not balanced | |
451 | REG_EBRACE braces { } not balanced | |
452 | REG_BADBR invalid repetition count(s) in { } | |
453 | REG_ERANGE invalid character range in [ ] | |
454 | REG_ESPACE ran out of memory | |
455 | REG_BADRPT ?, *, or + operand invalid | |
456 | REG_EMPTY empty (sub)expression | |
457 | REG_ASSERT ``can't happen''\(emyou found a bug | |
6175ca7c | 458 | REG_INVARG invalid argument, e.g. negative-length string |
d596f7c0 KB |
459 | .fi |
460 | .SH HISTORY | |
461 | Written by Henry Spencer at University of Toronto, | |
462 | henry@zoo.toronto.edu. | |
463 | .SH BUGS | |
464 | This is an alpha release with known defects. | |
465 | Please report problems. | |
466 | .PP | |
467 | There is one known functionality bug. | |
468 | The implementation of internationalization is incomplete: | |
469 | the locale is always assumed to be the default one of 1003.2, | |
470 | and only the collating elements etc. of that locale are available. | |
471 | .PP | |
472 | The back-reference code is subtle and doubts linger about its correctness | |
473 | in complex cases. | |
474 | .PP | |
475 | .I Regexec | |
476 | performance is poor. | |
477 | This will improve with later releases. | |
478 | .I Nmatch | |
479 | exceeding 0 is expensive; | |
480 | .I nmatch | |
481 | exceeding 1 is worse. | |
482 | .I Regexec | |
483 | is largely insensitive to RE complexity \fIexcept\fR that back | |
484 | references are massively expensive. | |
485 | RE length does matter; in particular, there is a strong speed bonus | |
486 | for keeping RE length under about 30 characters, | |
487 | with most special characters counting roughly double. | |
488 | .PP | |
489 | .I Regcomp | |
490 | implements bounded repetitions by macro expansion, | |
491 | which is costly in time and space if counts are large | |
492 | or bounded repetitions are nested. | |
493 | An RE like, say, | |
494 | `((((a{1,100}){1,100}){1,100}){1,100}){1,100}' | |
495 | will (eventually) run almost any existing machine out of swap space. | |
496 | .PP | |
497 | There are suspected problems with response to obscure error conditions. | |
498 | Notably, | |
499 | certain kinds of internal overflow, | |
500 | produced only by truly enormous REs or by multiply nested bounded repetitions, | |
501 | are probably not handled well. | |
502 | .PP | |
503 | Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is | |
504 | a special character only in the presence of a previous unmatched `('. | |
505 | This can't be fixed until the spec is fixed. | |
506 | .PP | |
507 | The standard's definition of back references is vague. | |
508 | For example, does | |
509 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'? | |
510 | Until the standard is clarified, | |
511 | behavior in such cases should not be relied on. | |
6175ca7c KB |
512 | .PP |
513 | The implementation of word-boundary matching is a bit of a kludge, | |
514 | and bugs may lurk in combinations of word-boundary matching and anchoring. |