Commit | Line | Data |
---|---|---|
920dae64 AT |
1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
2 | <html> | |
3 | <head> | |
4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> | |
5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> | |
6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> | |
7 | <link rel="first" href="lib.html" title='Python Library Reference' /> | |
8 | <link rel='contents' href='contents.html' title="Contents" /> | |
9 | <link rel='index' href='genindex.html' title='Index' /> | |
10 | <link rel='last' href='about.html' title='About this document...' /> | |
11 | <link rel='help' href='about.html' title='About this document...' /> | |
12 | <link rel="next" href="matching-searching.html" /> | |
13 | <link rel="prev" href="module-re.html" /> | |
14 | <link rel="parent" href="module-re.html" /> | |
15 | <link rel="next" href="matching-searching.html" /> | |
16 | <meta name='aesop' content='information' /> | |
17 | <title>4.2.1 Regular Expression Syntax </title> | |
18 | </head> | |
19 | <body> | |
20 | <DIV CLASS="navigation"> | |
21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> | |
22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
23 | <tr> | |
24 | <td class='online-navigation'><a rel="prev" title="4.2 re " | |
25 | href="module-re.html"><img src='../icons/previous.png' | |
26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
27 | <td class='online-navigation'><a rel="parent" title="4.2 re " | |
28 | href="module-re.html"><img src='../icons/up.png' | |
29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
30 | <td class='online-navigation'><a rel="next" title="4.2.2 Matching vs Searching" | |
31 | href="matching-searching.html"><img src='../icons/next.png' | |
32 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
33 | <td align="center" width="100%">Python Library Reference</td> | |
34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
35 | href="contents.html"><img src='../icons/contents.png' | |
36 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
38 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
39 | <td class='online-navigation'><a rel="index" title="Index" | |
40 | href="genindex.html"><img src='../icons/index.png' | |
41 | border='0' height='32' alt='Index' width='32' /></A></td> | |
42 | </tr></table> | |
43 | <div class='online-navigation'> | |
44 | <b class="navlabel">Previous:</b> | |
45 | <a class="sectref" rel="prev" href="module-re.html">4.2 re </A> | |
46 | <b class="navlabel">Up:</b> | |
47 | <a class="sectref" rel="parent" href="module-re.html">4.2 re </A> | |
48 | <b class="navlabel">Next:</b> | |
49 | <a class="sectref" rel="next" href="matching-searching.html">4.2.2 Matching vs Searching</A> | |
50 | </div> | |
51 | <hr /></div> | |
52 | </DIV> | |
53 | <!--End of Navigation Panel--> | |
54 | ||
55 | <H2><A NAME="SECTION006210000000000000000"></A><A NAME="re-syntax"></A> | |
56 | <BR> | |
57 | 4.2.1 Regular Expression Syntax | |
58 | </H2> | |
59 | ||
60 | <P> | |
61 | A regular expression (or RE) specifies a set of strings that matches | |
62 | it; the functions in this module let you check if a particular string | |
63 | matches a given regular expression (or if a given regular expression | |
64 | matches a particular string, which comes down to the same thing). | |
65 | ||
66 | <P> | |
67 | Regular expressions can be concatenated to form new regular | |
68 | expressions; if <em>A</em> and <em>B</em> are both regular expressions, | |
69 | then <em>AB</em> is also a regular expression. In general, if a string | |
70 | <em>p</em> matches <em>A</em> and another string <em>q</em> matches <em>B</em>, | |
71 | the string <em>pq</em> will match AB. This holds unless <em>A</em> or | |
72 | <em>B</em> contain low precedence operations; boundary conditions between | |
73 | <em>A</em> and <em>B</em>; or have numbered group references. Thus, complex | |
74 | expressions can easily be constructed from simpler primitive | |
75 | expressions like the ones described here. For details of the theory | |
76 | and implementation of regular expressions, consult the Friedl book | |
77 | referenced above, or almost any textbook about compiler construction. | |
78 | ||
79 | <P> | |
80 | A brief explanation of the format of regular expressions follows. For | |
81 | further information and a gentler presentation, consult the Regular | |
82 | Expression HOWTO, accessible from <a class="url" href="http://www.python.org/doc/howto/">http://www.python.org/doc/howto/</a>. | |
83 | ||
84 | <P> | |
85 | Regular expressions can contain both special and ordinary characters. | |
86 | Most ordinary characters, like "<tt class="character">A</tt>", "<tt class="character">a</tt>", or | |
87 | "<tt class="character">0</tt>", are the simplest regular expressions; they simply match | |
88 | themselves. You can concatenate ordinary characters, so <tt class="regexp">last</tt> | |
89 | matches the string <code>'last'</code>. (In the rest of this section, we'll | |
90 | write RE's in <tt class="regexp">this special style</tt>, usually without quotes, and | |
91 | strings to be matched <code>'in single quotes'</code>.) | |
92 | ||
93 | <P> | |
94 | Some characters, like "<tt class="character">|</tt>" or "<tt class="character">(</tt>", are special. | |
95 | Special characters either stand for classes of ordinary characters, or | |
96 | affect how the regular expressions around them are interpreted. | |
97 | ||
98 | <P> | |
99 | The special characters are: | |
100 | <DL> | |
101 | <DT><STRONG>"<tt class="character">.</tt>"</STRONG></DT> | |
102 | <DD>(Dot.) In the default mode, this matches any | |
103 | character except a newline. If the <tt class="constant">DOTALL</tt> flag has been | |
104 | specified, this matches any character including a newline. | |
105 | ||
106 | <P> | |
107 | </DD> | |
108 | <DT><STRONG>"<tt class="character">^</tt>"</STRONG></DT> | |
109 | <DD>(Caret.) Matches the start of the | |
110 | string, and in <tt class="constant">MULTILINE</tt> mode also matches immediately | |
111 | after each newline. | |
112 | ||
113 | <P> | |
114 | </DD> | |
115 | <DT><STRONG>"<tt class="character">$</tt>"</STRONG></DT> | |
116 | <DD>Matches the end of the string or just before the | |
117 | newline at the end of the string, and in <tt class="constant">MULTILINE</tt> mode | |
118 | also matches before a newline. <tt class="regexp">foo</tt> matches both 'foo' and | |
119 | 'foobar', while the regular expression <tt class="regexp">foo$</tt> matches only | |
120 | 'foo'. More interestingly, searching for <tt class="regexp">foo.$</tt> in | |
121 | 'foo1\nfoo2\n' matches 'foo2' normally, | |
122 | but 'foo1' in <tt class="constant">MULTILINE</tt> mode. | |
123 | ||
124 | <P> | |
125 | </DD> | |
126 | <DT><STRONG>"<tt class="character">*</tt>"</STRONG></DT> | |
127 | <DD>Causes the resulting RE to | |
128 | match 0 or more repetitions of the preceding RE, as many repetitions | |
129 | as are possible. <tt class="regexp">ab*</tt> will | |
130 | match 'a', 'ab', or 'a' followed by any number of 'b's. | |
131 | ||
132 | <P> | |
133 | </DD> | |
134 | <DT><STRONG>"<tt class="character">+</tt>"</STRONG></DT> | |
135 | <DD>Causes the | |
136 | resulting RE to match 1 or more repetitions of the preceding RE. | |
137 | <tt class="regexp">ab+</tt> will match 'a' followed by any non-zero number of 'b's; it | |
138 | will not match just 'a'. | |
139 | ||
140 | <P> | |
141 | </DD> | |
142 | <DT><STRONG>"<tt class="character">?</tt>"</STRONG></DT> | |
143 | <DD>Causes the resulting RE to | |
144 | match 0 or 1 repetitions of the preceding RE. <tt class="regexp">ab?</tt> will | |
145 | match either 'a' or 'ab'. | |
146 | ||
147 | <P> | |
148 | </DD> | |
149 | <DT><STRONG><code>*?</code>, <code>+?</code>, <code>??</code></STRONG></DT> | |
150 | <DD>The "<tt class="character">*</tt>", | |
151 | "<tt class="character">+</tt>", and "<tt class="character">?</tt>" qualifiers are all <i class="dfn">greedy</i>; they | |
152 | match as much text as possible. Sometimes this behaviour isn't | |
153 | desired; if the RE <tt class="regexp"><.*></tt> is matched against | |
154 | <code>'<H1>title</H1>'</code>, it will match the entire string, and not just | |
155 | <code>'<H1>'</code>. Adding "<tt class="character">?</tt>" after the qualifier makes it | |
156 | perform the match in <i class="dfn">non-greedy</i> or <i class="dfn">minimal</i> fashion; as | |
157 | <em>few</em> characters as possible will be matched. Using <tt class="regexp">.*?</tt> | |
158 | in the previous expression will match only <code>'<H1>'</code>. | |
159 | ||
160 | <P> | |
161 | </DD> | |
162 | <DT><STRONG><code>{<var>m</var>}</code></STRONG></DT> | |
163 | <DD>Specifies that exactly <var>m</var> copies of the previous RE should be | |
164 | matched; fewer matches cause the entire RE not to match. For example, | |
165 | <tt class="regexp">a{6}</tt> will match exactly six "<tt class="character">a</tt>" characters, but | |
166 | not five. | |
167 | ||
168 | <P> | |
169 | </DD> | |
170 | <DT><STRONG><code>{<var>m</var>,<var>n</var>}</code></STRONG></DT> | |
171 | <DD>Causes the resulting RE to match from | |
172 | <var>m</var> to <var>n</var> repetitions of the preceding RE, attempting to | |
173 | match as many repetitions as possible. For example, <tt class="regexp">a{3,5}</tt> | |
174 | will match from 3 to 5 "<tt class="character">a</tt>" characters. Omitting <var>m</var> | |
175 | specifies a lower bound of zero, | |
176 | and omitting <var>n</var> specifies an infinite upper bound. As an | |
177 | example, <tt class="regexp">a{4,}b</tt> will match <code>aaaab</code> or a thousand | |
178 | "<tt class="character">a</tt>" characters followed by a <code>b</code>, but not <code>aaab</code>. | |
179 | The comma may not be omitted or the modifier would be confused with | |
180 | the previously described form. | |
181 | ||
182 | <P> | |
183 | </DD> | |
184 | <DT><STRONG><code>{<var>m</var>,<var>n</var>}?</code></STRONG></DT> | |
185 | <DD>Causes the resulting RE to | |
186 | match from <var>m</var> to <var>n</var> repetitions of the preceding RE, | |
187 | attempting to match as <em>few</em> repetitions as possible. This is | |
188 | the non-greedy version of the previous qualifier. For example, on the | |
189 | 6-character string <code>'aaaaaa'</code>, <tt class="regexp">a{3,5}</tt> will match 5 | |
190 | "<tt class="character">a</tt>" characters, while <tt class="regexp">a{3,5}?</tt> will only match 3 | |
191 | characters. | |
192 | ||
193 | <P> | |
194 | </DD> | |
195 | <DT><STRONG>"<tt class="character">\</tt>"</STRONG></DT> | |
196 | <DD>Either escapes special characters (permitting | |
197 | you to match characters like "<tt class="character">*</tt>", "<tt class="character">?</tt>", and so | |
198 | forth), or signals a special sequence; special sequences are discussed | |
199 | below. | |
200 | ||
201 | <P> | |
202 | If you're not using a raw string to | |
203 | express the pattern, remember that Python also uses the | |
204 | backslash as an escape sequence in string literals; if the escape | |
205 | sequence isn't recognized by Python's parser, the backslash and | |
206 | subsequent character are included in the resulting string. However, | |
207 | if Python would recognize the resulting sequence, the backslash should | |
208 | be repeated twice. This is complicated and hard to understand, so | |
209 | it's highly recommended that you use raw strings for all but the | |
210 | simplest expressions. | |
211 | ||
212 | <P> | |
213 | </DD> | |
214 | <DT><STRONG><code>[]</code></STRONG></DT> | |
215 | <DD>Used to indicate a set of characters. Characters can | |
216 | be listed individually, or a range of characters can be indicated by | |
217 | giving two characters and separating them by a "<tt class="character">-</tt>". Special | |
218 | characters are not active inside sets. For example, <tt class="regexp">[akm$]</tt> | |
219 | will match any of the characters "<tt class="character">a</tt>", "<tt class="character">k</tt>", | |
220 | "<tt class="character">m</tt>", or "<tt class="character">$</tt>"; <tt class="regexp">[a-z]</tt> | |
221 | will match any lowercase letter, and <code>[a-zA-Z0-9]</code> matches any | |
222 | letter or digit. Character classes such as <code>\w</code> or <code>\S</code> | |
223 | (defined below) are also acceptable inside a range. If you want to | |
224 | include a "<tt class="character">]</tt>" or a "<tt class="character">-</tt>" inside a set, precede it with a | |
225 | backslash, or place it as the first character. The | |
226 | pattern <tt class="regexp">[]]</tt> will match <code>']'</code>, for example. | |
227 | ||
228 | <P> | |
229 | You can match the characters not within a range by <i class="dfn">complementing</i> | |
230 | the set. This is indicated by including a | |
231 | "<tt class="character">^</tt>" as the first character of the set; | |
232 | "<tt class="character">^</tt>" elsewhere will simply match the | |
233 | "<tt class="character">^</tt>" character. For example, | |
234 | <tt class="regexp">[^5]</tt> will match | |
235 | any character except "<tt class="character">5</tt>", and | |
236 | <tt class="regexp">[^<code>^</code>]</tt> will match any character | |
237 | except "<tt class="character">^</tt>". | |
238 | ||
239 | <P> | |
240 | </DD> | |
241 | <DT><STRONG>"<tt class="character">|</tt>"</STRONG></DT> | |
242 | <DD><code>A|B</code>, where A and B can be arbitrary REs, | |
243 | creates a regular expression that will match either A or B. An | |
244 | arbitrary number of REs can be separated by the "<tt class="character">|</tt>" in this | |
245 | way. This can be used inside groups (see below) as well. As the target | |
246 | string is scanned, REs separated by "<tt class="character">|</tt>" are tried from left to | |
247 | right. When one pattern completely matches, that branch is accepted. | |
248 | This means that once <code>A</code> matches, <code>B</code> will not be tested further, | |
249 | even if it would produce a longer overall match. In other words, the | |
250 | "<tt class="character">|</tt>" operator is never greedy. To match a literal "<tt class="character">|</tt>", | |
251 | use <tt class="regexp">\|</tt>, or enclose it inside a character class, as in <tt class="regexp">[|]</tt>. | |
252 | ||
253 | <P> | |
254 | </DD> | |
255 | <DT><STRONG><code>(...)</code></STRONG></DT> | |
256 | <DD>Matches whatever regular expression is inside the | |
257 | parentheses, and indicates the start and end of a group; the contents | |
258 | of a group can be retrieved after a match has been performed, and can | |
259 | be matched later in the string with the <tt class="regexp">\<var>number</var></tt> special | |
260 | sequence, described below. To match the literals "<tt class="character">(</tt>" or | |
261 | "<tt class="character">)</tt>", use <tt class="regexp">\(</tt> or <tt class="regexp">\)</tt>, or enclose them | |
262 | inside a character class: <tt class="regexp">[(] [)]</tt>. | |
263 | ||
264 | <P> | |
265 | </DD> | |
266 | <DT><STRONG><code>(?...)</code></STRONG></DT> | |
267 | <DD>This is an extension notation (a "<tt class="character">?</tt>" | |
268 | following a "<tt class="character">(</tt>" is not meaningful otherwise). The first | |
269 | character after the "<tt class="character">?</tt>" | |
270 | determines what the meaning and further syntax of the construct is. | |
271 | Extensions usually do not create a new group; | |
272 | <tt class="regexp">(?P<<var>name</var>>...)</tt> is the only exception to this rule. | |
273 | Following are the currently supported extensions. | |
274 | ||
275 | <P> | |
276 | </DD> | |
277 | <DT><STRONG><code>(?iLmsux)</code></STRONG></DT> | |
278 | <DD>(One or more letters from the set "<tt class="character">i</tt>", | |
279 | "<tt class="character">L</tt>", "<tt class="character">m</tt>", "<tt class="character">s</tt>", "<tt class="character">u</tt>", | |
280 | "<tt class="character">x</tt>".) The group matches the empty string; the letters set | |
281 | the corresponding flags (<tt class="constant">re.I</tt>, <tt class="constant">re.L</tt>, | |
282 | <tt class="constant">re.M</tt>, <tt class="constant">re.S</tt>, <tt class="constant">re.U</tt>, <tt class="constant">re.X</tt>) | |
283 | for the entire regular expression. This is useful if you wish to | |
284 | include the flags as part of the regular expression, instead of | |
285 | passing a <var>flag</var> argument to the <tt class="function">compile()</tt> function. | |
286 | ||
287 | <P> | |
288 | Note that the <tt class="regexp">(?x)</tt> flag changes how the expression is parsed. | |
289 | It should be used first in the expression string, or after one or more | |
290 | whitespace characters. If there are non-whitespace characters before | |
291 | the flag, the results are undefined. | |
292 | ||
293 | <P> | |
294 | </DD> | |
295 | <DT><STRONG><code>(?:...)</code></STRONG></DT> | |
296 | <DD>A non-grouping version of regular parentheses. | |
297 | Matches whatever regular expression is inside the parentheses, but the | |
298 | substring matched by the | |
299 | group <em>cannot</em> be retrieved after performing a match or | |
300 | referenced later in the pattern. | |
301 | ||
302 | <P> | |
303 | </DD> | |
304 | <DT><STRONG><code>(?P<<var>name</var>>...)</code></STRONG></DT> | |
305 | <DD>Similar to regular parentheses, but | |
306 | the substring matched by the group is accessible via the symbolic group | |
307 | name <var>name</var>. Group names must be valid Python identifiers, and | |
308 | each group name must be defined only once within a regular expression. A | |
309 | symbolic group is also a numbered group, just as if the group were not | |
310 | named. So the group named 'id' in the example above can also be | |
311 | referenced as the numbered group 1. | |
312 | ||
313 | <P> | |
314 | For example, if the pattern is | |
315 | <tt class="regexp">(?P<id>[a-zA-Z_]\w*)</tt>, the group can be referenced by its | |
316 | name in arguments to methods of match objects, such as | |
317 | <code>m.group('id')</code> or <code>m.end('id')</code>, and also by name in | |
318 | pattern text (for example, <tt class="regexp">(?P=id)</tt>) and replacement text | |
319 | (such as <code>\g<id></code>). | |
320 | ||
321 | <P> | |
322 | </DD> | |
323 | <DT><STRONG><code>(?P=<var>name</var>)</code></STRONG></DT> | |
324 | <DD>Matches whatever text was matched by the | |
325 | earlier group named <var>name</var>. | |
326 | ||
327 | <P> | |
328 | </DD> | |
329 | <DT><STRONG><code>(?#...)</code></STRONG></DT> | |
330 | <DD>A comment; the contents of the parentheses are | |
331 | simply ignored. | |
332 | ||
333 | <P> | |
334 | </DD> | |
335 | <DT><STRONG><code>(?=...)</code></STRONG></DT> | |
336 | <DD>Matches if <tt class="regexp">...</tt> matches next, but doesn't | |
337 | consume any of the string. This is called a lookahead assertion. For | |
338 | example, <tt class="regexp">Isaac (?=Asimov)</tt> will match <code>'Isaac '</code> only if it's | |
339 | followed by <code>'Asimov'</code>. | |
340 | ||
341 | <P> | |
342 | </DD> | |
343 | <DT><STRONG><code>(?!...)</code></STRONG></DT> | |
344 | <DD>Matches if <tt class="regexp">...</tt> doesn't match next. This | |
345 | is a negative lookahead assertion. For example, | |
346 | <tt class="regexp">Isaac (?!Asimov)</tt> will match <code>'Isaac '</code> only if it's <em>not</em> | |
347 | followed by <code>'Asimov'</code>. | |
348 | ||
349 | <P> | |
350 | </DD> | |
351 | <DT><STRONG><code>(?<=...)</code></STRONG></DT> | |
352 | <DD>Matches if the current position in the string | |
353 | is preceded by a match for <tt class="regexp">...</tt> that ends at the current | |
354 | position. This is called a <i class="dfn">positive lookbehind assertion</i>. | |
355 | <tt class="regexp">(?<=abc)def</tt> will find a match in "<tt class="samp">abcdef</tt>", since the | |
356 | lookbehind will back up 3 characters and check if the contained | |
357 | pattern matches. The contained pattern must only match strings of | |
358 | some fixed length, meaning that <tt class="regexp">abc</tt> or <tt class="regexp">a|b</tt> are | |
359 | allowed, but <tt class="regexp">a*</tt> and <tt class="regexp">a{3,4}</tt> are not. Note that | |
360 | patterns which start with positive lookbehind assertions will never | |
361 | match at the beginning of the string being searched; you will most | |
362 | likely want to use the <tt class="function">search()</tt> function rather than the | |
363 | <tt class="function">match()</tt> function: | |
364 | ||
365 | <P> | |
366 | <div class="verbatim"><pre> | |
367 | >>> import re | |
368 | >>> m = re.search('(?<=abc)def', 'abcdef') | |
369 | >>> m.group(0) | |
370 | 'def' | |
371 | </pre></div> | |
372 | ||
373 | <P> | |
374 | This example looks for a word following a hyphen: | |
375 | ||
376 | <P> | |
377 | <div class="verbatim"><pre> | |
378 | >>> m = re.search('(?<=-)\w+', 'spam-egg') | |
379 | >>> m.group(0) | |
380 | 'egg' | |
381 | </pre></div> | |
382 | ||
383 | <P> | |
384 | </DD> | |
385 | <DT><STRONG><code>(?<!...)</code></STRONG></DT> | |
386 | <DD>Matches if the current position in the string | |
387 | is not preceded by a match for <tt class="regexp">...</tt>. This is called a | |
388 | <i class="dfn">negative lookbehind assertion</i>. Similar to positive lookbehind | |
389 | assertions, the contained pattern must only match strings of some | |
390 | fixed length. Patterns which start with negative lookbehind | |
391 | assertions may match at the beginning of the string being searched. | |
392 | ||
393 | <P> | |
394 | </DD> | |
395 | <DT><STRONG><code>(?(<var>id/name</var>)yes-pattern|no-pattern)</code></STRONG></DT> | |
396 | <DD>Will try to match | |
397 | with <tt class="regexp">yes-pattern</tt> if the group with given <var>id</var> or <var>name</var> | |
398 | exists, and with <tt class="regexp">no-pattern</tt> if it doesn't. <tt class="regexp">|no-pattern</tt> | |
399 | is optional and can be omitted. For example, | |
400 | <tt class="regexp">(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)</tt> is a poor email matching | |
401 | pattern, which will match with <code>'<user@host.com>'</code> as well as | |
402 | <code>'user@host.com'</code>, but not with <code>'<user@host.com'</code>. | |
403 | ||
404 | <span class="versionnote">New in version 2.4.</span> | |
405 | ||
406 | <P> | |
407 | </DD> | |
408 | </DL> | |
409 | ||
410 | <P> | |
411 | The special sequences consist of "<tt class="character">\</tt>" and a character from the | |
412 | list below. If the ordinary character is not on the list, then the | |
413 | resulting RE will match the second character. For example, | |
414 | <tt class="regexp">\$</tt> matches the character "<tt class="character">$</tt>". | |
415 | <DL> | |
416 | <DT><STRONG><code>\<var>number</var></code></STRONG></DT> | |
417 | <DD>Matches the contents of the group of the | |
418 | same number. Groups are numbered starting from 1. For example, | |
419 | <tt class="regexp">(.+) \1</tt> matches <code>'the the'</code> or <code>'55 55'</code>, but not | |
420 | <code>'the end'</code> (note | |
421 | the space after the group). This special sequence can only be used to | |
422 | match one of the first 99 groups. If the first digit of <var>number</var> | |
423 | is 0, or <var>number</var> is 3 octal digits long, it will not be interpreted | |
424 | as a group match, but as the character with octal value <var>number</var>. | |
425 | Inside the "<tt class="character">[</tt>" and "<tt class="character">]</tt>" of a character class, all numeric | |
426 | escapes are treated as characters. | |
427 | ||
428 | <P> | |
429 | </DD> | |
430 | <DT><STRONG><code>\A</code></STRONG></DT> | |
431 | <DD>Matches only at the start of the string. | |
432 | ||
433 | <P> | |
434 | </DD> | |
435 | <DT><STRONG><code>\b</code></STRONG></DT> | |
436 | <DD>Matches the empty string, but only at the | |
437 | beginning or end of a word. A word is defined as a sequence of | |
438 | alphanumeric or underscore characters, so the end of a word is indicated by | |
439 | whitespace or a non-alphanumeric, non-underscore character. Note that | |
440 | <code>\b</code> is defined as the boundary between <code>\w</code> and <code>\ | |
441 | W</code>, so the precise set of characters deemed to be alphanumeric depends on the | |
442 | values of the <code>UNICODE</code> and <code>LOCALE</code> flags. Inside a character | |
443 | range, <tt class="regexp">\b</tt> represents the backspace character, for compatibility | |
444 | with Python's string literals. | |
445 | ||
446 | <P> | |
447 | </DD> | |
448 | <DT><STRONG><code>\B</code></STRONG></DT> | |
449 | <DD>Matches the empty string, but only when it is <em>not</em> | |
450 | at the beginning or end of a word. This is just the opposite of <code>\ | |
451 | b</code>, so is also subject to the settings of <code>LOCALE</code> and <code>UNICODE</code>. | |
452 | ||
453 | <P> | |
454 | </DD> | |
455 | <DT><STRONG><code>\d</code></STRONG></DT> | |
456 | <DD>When the <tt class="constant">UNICODE</tt> flag is not specified, matches | |
457 | any decimal digit; this is equivalent to the set <tt class="regexp">[0-9]</tt>. | |
458 | With <tt class="constant">UNICODE</tt>, it will match whatever is classified as a digit | |
459 | in the Unicode character properties database. | |
460 | ||
461 | <P> | |
462 | </DD> | |
463 | <DT><STRONG><code>\D</code></STRONG></DT> | |
464 | <DD>When the <tt class="constant">UNICODE</tt> flag is not specified, matches | |
465 | any non-digit character; this is equivalent to the set | |
466 | <tt class="regexp">[^0-9]</tt>. With <tt class="constant">UNICODE</tt>, it will match | |
467 | anything other than character marked as digits in the Unicode character | |
468 | properties database. | |
469 | ||
470 | <P> | |
471 | </DD> | |
472 | <DT><STRONG><code>\s</code></STRONG></DT> | |
473 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> | |
474 | flags are not specified, matches any whitespace character; this is | |
475 | equivalent to the set <tt class="regexp">[ \t\n\r\f\v]</tt>. | |
476 | With <tt class="constant">LOCALE</tt>, it will match this set plus whatever characters | |
477 | are defined as space for the current locale. If <tt class="constant">UNICODE</tt> is set, | |
478 | this will match the characters <tt class="regexp">[ \t\n\r\f\v]</tt> plus | |
479 | whatever is classified as space in the Unicode character properties | |
480 | database. | |
481 | ||
482 | <P> | |
483 | </DD> | |
484 | <DT><STRONG><code>\S</code></STRONG></DT> | |
485 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> | |
486 | flags are not specified, matches any non-whitespace character; this is | |
487 | equivalent to the set <tt class="regexp">[^ \t\n\r\f\v]</tt> | |
488 | With <tt class="constant">LOCALE</tt>, it will match any character not in this set, | |
489 | and not defined as space in the current locale. If <tt class="constant">UNICODE</tt> | |
490 | is set, this will match anything other than <tt class="regexp">[ \t\n\r\f\v]</tt> | |
491 | and characters marked as space in the Unicode character properties database. | |
492 | ||
493 | <P> | |
494 | </DD> | |
495 | <DT><STRONG><code>\w</code></STRONG></DT> | |
496 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> | |
497 | flags are not specified, matches any alphanumeric character and the | |
498 | underscore; this is equivalent to the set | |
499 | <tt class="regexp">[a-zA-Z0-9_]</tt>. With <tt class="constant">LOCALE</tt>, it will match the set | |
500 | <tt class="regexp">[0-9_]</tt> plus whatever characters are defined as alphanumeric for | |
501 | the current locale. If <tt class="constant">UNICODE</tt> is set, this will match the | |
502 | characters <tt class="regexp">[0-9_]</tt> plus whatever is classified as alphanumeric | |
503 | in the Unicode character properties database. | |
504 | ||
505 | <P> | |
506 | </DD> | |
507 | <DT><STRONG><code>\W</code></STRONG></DT> | |
508 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> | |
509 | flags are not specified, matches any non-alphanumeric character; this | |
510 | is equivalent to the set <tt class="regexp">[^a-zA-Z0-9_]</tt>. With | |
511 | <tt class="constant">LOCALE</tt>, it will match any character not in the set | |
512 | <tt class="regexp">[0-9_]</tt>, and not defined as alphanumeric for the current locale. | |
513 | If <tt class="constant">UNICODE</tt> is set, this will match anything other than | |
514 | <tt class="regexp">[0-9_]</tt> and characters marked as alphanumeric in the Unicode | |
515 | character properties database. | |
516 | ||
517 | <P> | |
518 | </DD> | |
519 | <DT><STRONG><code>\Z</code></STRONG></DT> | |
520 | <DD>Matches only at the end of the string. | |
521 | ||
522 | <P> | |
523 | </DD> | |
524 | </DL> | |
525 | ||
526 | <P> | |
527 | Most of the standard escapes supported by Python string literals are | |
528 | also accepted by the regular expression parser: | |
529 | ||
530 | <P> | |
531 | <div class="verbatim"><pre> | |
532 | \a \b \f \n | |
533 | \r \t \v \x | |
534 | \\ | |
535 | </pre></div> | |
536 | ||
537 | <P> | |
538 | Octal escapes are included in a limited form: If the first digit is a | |
539 | 0, or if there are three octal digits, it is considered an octal | |
540 | escape. Otherwise, it is a group reference. As for string literals, | |
541 | octal escapes are always at most three digits in length. | |
542 | ||
543 | <P> | |
544 | ||
545 | <DIV CLASS="navigation"> | |
546 | <div class='online-navigation'> | |
547 | <p></p><hr /> | |
548 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> | |
549 | <tr> | |
550 | <td class='online-navigation'><a rel="prev" title="4.2 re " | |
551 | href="module-re.html"><img src='../icons/previous.png' | |
552 | border='0' height='32' alt='Previous Page' width='32' /></A></td> | |
553 | <td class='online-navigation'><a rel="parent" title="4.2 re " | |
554 | href="module-re.html"><img src='../icons/up.png' | |
555 | border='0' height='32' alt='Up One Level' width='32' /></A></td> | |
556 | <td class='online-navigation'><a rel="next" title="4.2.2 Matching vs Searching" | |
557 | href="matching-searching.html"><img src='../icons/next.png' | |
558 | border='0' height='32' alt='Next Page' width='32' /></A></td> | |
559 | <td align="center" width="100%">Python Library Reference</td> | |
560 | <td class='online-navigation'><a rel="contents" title="Table of Contents" | |
561 | href="contents.html"><img src='../icons/contents.png' | |
562 | border='0' height='32' alt='Contents' width='32' /></A></td> | |
563 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' | |
564 | border='0' height='32' alt='Module Index' width='32' /></a></td> | |
565 | <td class='online-navigation'><a rel="index" title="Index" | |
566 | href="genindex.html"><img src='../icons/index.png' | |
567 | border='0' height='32' alt='Index' width='32' /></A></td> | |
568 | </tr></table> | |
569 | <div class='online-navigation'> | |
570 | <b class="navlabel">Previous:</b> | |
571 | <a class="sectref" rel="prev" href="module-re.html">4.2 re </A> | |
572 | <b class="navlabel">Up:</b> | |
573 | <a class="sectref" rel="parent" href="module-re.html">4.2 re </A> | |
574 | <b class="navlabel">Next:</b> | |
575 | <a class="sectref" rel="next" href="matching-searching.html">4.2.2 Matching vs Searching</A> | |
576 | </div> | |
577 | </div> | |
578 | <hr /> | |
579 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> | |
580 | </DIV> | |
581 | <!--End of Navigation Panel--> | |
582 | <ADDRESS> | |
583 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. | |
584 | </ADDRESS> | |
585 | </BODY> | |
586 | </HTML> |