Commit | Line | Data |
---|---|---|
be8ee4f0 C |
1 | REGEXP(3) BSD Programmer's Manual REGEXP(3) |
2 | ||
3 | N\bNA\bAM\bME\bE | |
4 | r\bre\beg\bgc\bco\bom\bmp\bp, r\bre\beg\bge\bex\bxe\bec\bc, r\bre\beg\bgs\bsu\bub\bb, r\bre\beg\bge\ber\brr\bro\bor\br - regular expression handlers | |
5 | ||
6 | S\bSY\bYN\bNO\bOP\bPS\bSI\bIS\bS | |
7 | #\b#i\bin\bnc\bcl\blu\bud\bde\be <\b<r\bre\beg\bge\bex\bxp\bp.\b.h\bh>\b> | |
8 | ||
9 | _\br_\be_\bg_\be_\bx_\bp _\b* | |
10 | r\bre\beg\bgc\bco\bom\bmp\bp(_\bc_\bo_\bn_\bs_\bt _\bc_\bh_\ba_\br _\b*_\be_\bx_\bp); | |
11 | ||
12 | _\bi_\bn_\bt | |
13 | r\bre\beg\bge\bex\bxe\bec\bc(_\bc_\bo_\bn_\bs_\bt _\br_\be_\bg_\be_\bx_\bp _\b*_\bp_\br_\bo_\bg, _\bc_\bo_\bn_\bs_\bt _\bc_\bh_\ba_\br _\b*_\bs_\bt_\br_\bi_\bn_\bg); | |
14 | ||
15 | _\bv_\bo_\bi_\bd | |
16 | r\bre\beg\bgs\bsu\bub\bb(_\bc_\bo_\bn_\bs_\bt _\br_\be_\bg_\be_\bx_\bp _\b*_\bp_\br_\bo_\bg, _\bc_\bo_\bn_\bs_\bt _\bc_\bh_\ba_\br _\b*_\bs_\bo_\bu_\br_\bc_\be, _\bc_\bh_\ba_\br _\b*_\bd_\be_\bs_\bt); | |
17 | ||
18 | D\bDE\bES\bSC\bCR\bRI\bIP\bPT\bTI\bIO\bON\bN | |
19 | This interface is made obsolete by regex(3). | |
20 | ||
21 | The r\bre\beg\bgc\bco\bom\bmp\bp(), r\bre\beg\bge\bex\bxe\bec\bc(), r\bre\beg\bgs\bsu\bub\bb(), and r\bre\beg\bge\ber\brr\bro\bor\br() functions implement | |
22 | egrep(1)-style regular expressions and supporting facilities. | |
23 | ||
24 | The r\bre\beg\bgc\bco\bom\bmp\bp() function compiles a regular expression into a structure of | |
25 | type regexp, and returns a pointer to it. The space has been allocated | |
26 | using malloc(3) and may be released by free. | |
27 | ||
28 | The r\bre\beg\bge\bex\bxe\bec\bc() function matches a NUL-terminated _\bs_\bt_\br_\bi_\bn_\bg against the com- | |
29 | piled regular expression in _\bp_\br_\bo_\bg. It returns 1 for success and 0 for | |
30 | failure, and adjusts the contents of _\bp_\br_\bo_\bg's _\bs_\bt_\ba_\br_\bt_\bp and _\be_\bn_\bd_\bp (see below) | |
31 | accordingly. | |
32 | ||
33 | The members of a regexp structure include at least the following (not | |
34 | necessarily in order): | |
35 | ||
36 | char *startp[NSUBEXP]; | |
37 | char *endp[NSUBEXP]; | |
38 | ||
39 | where NSUBEXP is defined (as 10) in the header file. Once a successful | |
40 | r\bre\beg\bge\bex\bxe\bec\bc() has been done using the r\bre\beg\bge\bex\bxp\bp(), each _\bs_\bt_\ba_\br_\bt_\bp- _\be_\bn_\bd_\bp pair de- | |
41 | scribes one substring within the _\bs_\bt_\br_\bi_\bn_\bg, with the _\bs_\bt_\ba_\br_\bt_\bp pointing to the | |
42 | first character of the substring and the _\be_\bn_\bd_\bp pointing to the first char- | |
43 | acter following the substring. The 0th substring is the substring of | |
44 | _\bs_\bt_\br_\bi_\bn_\bg that matched the whole regular expression. The others are those | |
45 | substrings that matched parenthesized expressions within the regular ex- | |
46 | pression, with parenthesized expressions numbered in left-to-right order | |
47 | of their opening parentheses. | |
48 | ||
49 | The r\bre\beg\bgs\bsu\bub\bb() function copies _\bs_\bo_\bu_\br_\bc_\be to _\bd_\be_\bs_\bt, making substitutions accord- | |
50 | ing to the most recent r\bre\beg\bge\bex\bxe\bec\bc() performed using _\bp_\br_\bo_\bg. Each instance of | |
51 | `&' in _\bs_\bo_\bu_\br_\bc_\be is replaced by the substring indicated by _\bs_\bt_\ba_\br_\bt_\bp[] and | |
52 | _\be_\bn_\bd_\bp[]. Each instance of `\_\bn', where _\bn is a digit, is replaced by the | |
53 | substring indicated by _\bs_\bt_\ba_\br_\bt_\bp[_\bn] and _\be_\bn_\bd_\bp[_\bn]. To get a literal `&' or | |
54 | `\_\bn' into _\bd_\be_\bs_\bt, prefix it with `\'; to get a literal `\' preceding `&' or | |
55 | `\_\bn', prefix it with another `\'. | |
56 | ||
57 | The r\bre\beg\bge\ber\brr\bro\bor\br() function is called whenever an error is detected in | |
58 | r\bre\beg\bgc\bco\bom\bmp\bp(), r\bre\beg\bge\bex\bxe\bec\bc(), or r\bre\beg\bgs\bsu\bub\bb(). The default r\bre\beg\bge\ber\brr\bro\bor\br() writes the | |
59 | string _\bm_\bs_\bg, with a suitable indicator of origin, on the standard error | |
60 | output and invokes exit(2). The r\bre\beg\bge\ber\brr\bro\bor\br() function can be replaced by | |
61 | the user if other actions are desirable. | |
62 | ||
63 | R\bRE\bEG\bGU\bUL\bLA\bAR\bR E\bEX\bXP\bPR\bRE\bES\bSS\bSI\bIO\bON\bN S\bSY\bYN\bNT\bTA\bAX\bX | |
64 | A regular expression is zero or more _\bb_\br_\ba_\bn_\bc_\bh_\be_\bs, separated by `|'. It | |
65 | matches anything that matches one of the branches. | |
66 | ||
67 | A branch is zero or more _\bp_\bi_\be_\bc_\be_\bs, concatenated. It matches a match for | |
68 | the first, followed by a match for the second, etc. | |
69 | ||
70 | A piece is an _\ba_\bt_\bo_\bm possibly followed by `*', `+', or `?'. An atom fol- | |
71 | lowed by `*' matches a sequence of 0 or more matches of the atom. An | |
72 | atom followed by `+' matches a sequence of 1 or more matches of the atom. | |
73 | An atom followed by `?' matches a match of the atom, or the null string. | |
74 | ||
75 | An atom is a regular expression in parentheses (matching a match for the | |
76 | regular expression), a _\br_\ba_\bn_\bg_\be (see below), `.' (matching any single char- | |
77 | acter), `^' (matching the null string at the beginning of the input | |
78 | string), `$' (matching the null string at the end of the input string), a | |
79 | `\' followed by a single character (matching that character), or a single | |
80 | character with no other significance (matching that character). | |
81 | ||
82 | A _\br_\ba_\bn_\bg_\be is a sequence of characters enclosed in `[]'. It normally match- | |
83 | es any single character from the sequence. If the sequence begins with | |
84 | `^', it matches any single character _\bn_\bo_\bt from the rest of the sequence. | |
85 | If two characters in the sequence are separated by `-', this is shorthand | |
86 | for the full list of ASCII characters between them (e.g. `[0-9]' matches | |
87 | any decimal digit). To include a literal `]' in the sequence, make it | |
88 | the first character (following a possible `^'). To include a literal | |
89 | `-', make it the first or last character. | |
90 | ||
91 | A\bAM\bMB\bBI\bIG\bGU\bUI\bIT\bTY\bY | |
92 | If a regular expression could match two different parts of the input | |
93 | string, it will match the one which begins earliest. If both begin in | |
94 | the same place but match different lengths, or match the same length in | |
95 | different ways, life gets messier, as follows. | |
96 | ||
97 | In general, the possibilities in a list of branches are considered in | |
98 | left-to-right order, the possibilities for `*', `+', and `?' are consid- | |
99 | ered longest-first, nested constructs are considered from the outermost | |
100 | in, and concatenated constructs are considered leftmost-first. The match | |
101 | that will be chosen is the one that uses the earliest possibility in the | |
102 | first choice that has to be made. If there is more than one choice, the | |
103 | next will be made in the same manner (earliest possibility) subject to | |
104 | the decision on the first choice. And so forth. | |
105 | ||
106 | For example, `(ab|a)b*c' could match `abc' in one of two ways. The first | |
107 | choice is between `ab' and `a'; since `ab' is earlier, and does lead to a | |
108 | successful overall match, it is chosen. Since the `b' is already spoken | |
109 | for, the `b*' must match its last possibility--the empty string--since it | |
110 | must respect the earlier choice. | |
111 | ||
112 | In the particular case where no `|'s are present and there is only one | |
113 | `*', `+', or `?', the net effect is that the longest possible match will | |
114 | be chosen. So `ab*', presented with `xabbbby', will match `abbbb'. Note | |
115 | that if `ab*', is tried against `xabyabbbz', it will match `ab' just af- | |
116 | ter `x', due to the begins-earliest rule. (In effect, the decision on | |
117 | where to start the match is the first choice to be made, hence subsequent | |
118 | choices must respect it even if this leads them to less-preferred alter- | |
119 | natives.) | |
120 | ||
121 | R\bRE\bET\bTU\bUR\bRN\bN V\bVA\bAL\bLU\bUE\bES\bS | |
122 | The r\bre\beg\bgc\bco\bom\bmp\bp() function returns NULL for a failure (r\bre\beg\bge\ber\brr\bro\bor\br() permit- | |
123 | ting), where failures are syntax errors, exceeding implementation limits, | |
124 | or applying `+' or `*' to a possibly-null operand. | |
125 | ||
126 | S\bSE\bEE\bE A\bAL\bLS\bSO\bO | |
127 | ed(1), ex(1), expr(1), egrep(1), fgrep(1), grep(1), regex(3) | |
128 | ||
129 | H\bHI\bIS\bST\bTO\bOR\bRY\bY | |
130 | Both code and manual page for r\bre\beg\bgc\bco\bom\bmp\bp(), r\bre\beg\bge\bex\bxe\bec\bc(), r\bre\beg\bgs\bsu\bub\bb(), and | |
131 | r\bre\beg\bge\ber\brr\bro\bor\br() were written at the University of Toronto and appeared in | |
132 | 4.3BSD-Tahoe. They are intended to be compatible with the Bell V8 | |
133 | regexp(3), but are not derived from Bell code. | |
134 | ||
135 | B\bBU\bUG\bGS\bS | |
136 | Empty branches and empty regular expressions are not portable to V8. | |
137 | ||
138 | The restriction against applying `*' or `+' to a possibly-null operand is | |
139 | an artifact of the simplistic implementation. | |
140 | ||
141 | Does not support egrep's newline-separated branches; neither does the V8 | |
142 | regexp(3), though. | |
143 | ||
144 | Due to emphasis on compactness and simplicity, it's not strikingly fast. | |
145 | It does give special attention to handling simple cases quickly. | |
146 | ||
147 | 4.4BSD June 4, 1993 3 |