| 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> |
| 2 | <html> |
| 3 | <head> |
| 4 | <link rel="STYLESHEET" href="lib.css" type='text/css' /> |
| 5 | <link rel="SHORTCUT ICON" href="../icons/pyfav.png" type="image/png" /> |
| 6 | <link rel='start' href='../index.html' title='Python Documentation Index' /> |
| 7 | <link rel="first" href="lib.html" title='Python Library Reference' /> |
| 8 | <link rel='contents' href='contents.html' title="Contents" /> |
| 9 | <link rel='index' href='genindex.html' title='Index' /> |
| 10 | <link rel='last' href='about.html' title='About this document...' /> |
| 11 | <link rel='help' href='about.html' title='About this document...' /> |
| 12 | <link rel="next" href="matching-searching.html" /> |
| 13 | <link rel="prev" href="module-re.html" /> |
| 14 | <link rel="parent" href="module-re.html" /> |
| 15 | <link rel="next" href="matching-searching.html" /> |
| 16 | <meta name='aesop' content='information' /> |
| 17 | <title>4.2.1 Regular Expression Syntax </title> |
| 18 | </head> |
| 19 | <body> |
| 20 | <DIV CLASS="navigation"> |
| 21 | <div id='top-navigation-panel' xml:id='top-navigation-panel'> |
| 22 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 23 | <tr> |
| 24 | <td class='online-navigation'><a rel="prev" title="4.2 re " |
| 25 | href="module-re.html"><img src='../icons/previous.png' |
| 26 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 27 | <td class='online-navigation'><a rel="parent" title="4.2 re " |
| 28 | href="module-re.html"><img src='../icons/up.png' |
| 29 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 30 | <td class='online-navigation'><a rel="next" title="4.2.2 Matching vs Searching" |
| 31 | href="matching-searching.html"><img src='../icons/next.png' |
| 32 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 33 | <td align="center" width="100%">Python Library Reference</td> |
| 34 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 35 | href="contents.html"><img src='../icons/contents.png' |
| 36 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 37 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 38 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 39 | <td class='online-navigation'><a rel="index" title="Index" |
| 40 | href="genindex.html"><img src='../icons/index.png' |
| 41 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 42 | </tr></table> |
| 43 | <div class='online-navigation'> |
| 44 | <b class="navlabel">Previous:</b> |
| 45 | <a class="sectref" rel="prev" href="module-re.html">4.2 re </A> |
| 46 | <b class="navlabel">Up:</b> |
| 47 | <a class="sectref" rel="parent" href="module-re.html">4.2 re </A> |
| 48 | <b class="navlabel">Next:</b> |
| 49 | <a class="sectref" rel="next" href="matching-searching.html">4.2.2 Matching vs Searching</A> |
| 50 | </div> |
| 51 | <hr /></div> |
| 52 | </DIV> |
| 53 | <!--End of Navigation Panel--> |
| 54 | |
| 55 | <H2><A NAME="SECTION006210000000000000000"></A><A NAME="re-syntax"></A> |
| 56 | <BR> |
| 57 | 4.2.1 Regular Expression Syntax |
| 58 | </H2> |
| 59 | |
| 60 | <P> |
| 61 | A regular expression (or RE) specifies a set of strings that matches |
| 62 | it; the functions in this module let you check if a particular string |
| 63 | matches a given regular expression (or if a given regular expression |
| 64 | matches a particular string, which comes down to the same thing). |
| 65 | |
| 66 | <P> |
| 67 | Regular expressions can be concatenated to form new regular |
| 68 | expressions; if <em>A</em> and <em>B</em> are both regular expressions, |
| 69 | then <em>AB</em> is also a regular expression. In general, if a string |
| 70 | <em>p</em> matches <em>A</em> and another string <em>q</em> matches <em>B</em>, |
| 71 | the string <em>pq</em> will match AB. This holds unless <em>A</em> or |
| 72 | <em>B</em> contain low precedence operations; boundary conditions between |
| 73 | <em>A</em> and <em>B</em>; or have numbered group references. Thus, complex |
| 74 | expressions can easily be constructed from simpler primitive |
| 75 | expressions like the ones described here. For details of the theory |
| 76 | and implementation of regular expressions, consult the Friedl book |
| 77 | referenced above, or almost any textbook about compiler construction. |
| 78 | |
| 79 | <P> |
| 80 | A brief explanation of the format of regular expressions follows. For |
| 81 | further information and a gentler presentation, consult the Regular |
| 82 | Expression HOWTO, accessible from <a class="url" href="http://www.python.org/doc/howto/">http://www.python.org/doc/howto/</a>. |
| 83 | |
| 84 | <P> |
| 85 | Regular expressions can contain both special and ordinary characters. |
| 86 | Most ordinary characters, like "<tt class="character">A</tt>", "<tt class="character">a</tt>", or |
| 87 | "<tt class="character">0</tt>", are the simplest regular expressions; they simply match |
| 88 | themselves. You can concatenate ordinary characters, so <tt class="regexp">last</tt> |
| 89 | matches the string <code>'last'</code>. (In the rest of this section, we'll |
| 90 | write RE's in <tt class="regexp">this special style</tt>, usually without quotes, and |
| 91 | strings to be matched <code>'in single quotes'</code>.) |
| 92 | |
| 93 | <P> |
| 94 | Some characters, like "<tt class="character">|</tt>" or "<tt class="character">(</tt>", are special. |
| 95 | Special characters either stand for classes of ordinary characters, or |
| 96 | affect how the regular expressions around them are interpreted. |
| 97 | |
| 98 | <P> |
| 99 | The special characters are: |
| 100 | <DL> |
| 101 | <DT><STRONG>"<tt class="character">.</tt>"</STRONG></DT> |
| 102 | <DD>(Dot.) In the default mode, this matches any |
| 103 | character except a newline. If the <tt class="constant">DOTALL</tt> flag has been |
| 104 | specified, this matches any character including a newline. |
| 105 | |
| 106 | <P> |
| 107 | </DD> |
| 108 | <DT><STRONG>"<tt class="character">^</tt>"</STRONG></DT> |
| 109 | <DD>(Caret.) Matches the start of the |
| 110 | string, and in <tt class="constant">MULTILINE</tt> mode also matches immediately |
| 111 | after each newline. |
| 112 | |
| 113 | <P> |
| 114 | </DD> |
| 115 | <DT><STRONG>"<tt class="character">$</tt>"</STRONG></DT> |
| 116 | <DD>Matches the end of the string or just before the |
| 117 | newline at the end of the string, and in <tt class="constant">MULTILINE</tt> mode |
| 118 | also matches before a newline. <tt class="regexp">foo</tt> matches both 'foo' and |
| 119 | 'foobar', while the regular expression <tt class="regexp">foo$</tt> matches only |
| 120 | 'foo'. More interestingly, searching for <tt class="regexp">foo.$</tt> in |
| 121 | 'foo1\nfoo2\n' matches 'foo2' normally, |
| 122 | but 'foo1' in <tt class="constant">MULTILINE</tt> mode. |
| 123 | |
| 124 | <P> |
| 125 | </DD> |
| 126 | <DT><STRONG>"<tt class="character">*</tt>"</STRONG></DT> |
| 127 | <DD>Causes the resulting RE to |
| 128 | match 0 or more repetitions of the preceding RE, as many repetitions |
| 129 | as are possible. <tt class="regexp">ab*</tt> will |
| 130 | match 'a', 'ab', or 'a' followed by any number of 'b's. |
| 131 | |
| 132 | <P> |
| 133 | </DD> |
| 134 | <DT><STRONG>"<tt class="character">+</tt>"</STRONG></DT> |
| 135 | <DD>Causes the |
| 136 | resulting RE to match 1 or more repetitions of the preceding RE. |
| 137 | <tt class="regexp">ab+</tt> will match 'a' followed by any non-zero number of 'b's; it |
| 138 | will not match just 'a'. |
| 139 | |
| 140 | <P> |
| 141 | </DD> |
| 142 | <DT><STRONG>"<tt class="character">?</tt>"</STRONG></DT> |
| 143 | <DD>Causes the resulting RE to |
| 144 | match 0 or 1 repetitions of the preceding RE. <tt class="regexp">ab?</tt> will |
| 145 | match either 'a' or 'ab'. |
| 146 | |
| 147 | <P> |
| 148 | </DD> |
| 149 | <DT><STRONG><code>*?</code>, <code>+?</code>, <code>??</code></STRONG></DT> |
| 150 | <DD>The "<tt class="character">*</tt>", |
| 151 | "<tt class="character">+</tt>", and "<tt class="character">?</tt>" qualifiers are all <i class="dfn">greedy</i>; they |
| 152 | match as much text as possible. Sometimes this behaviour isn't |
| 153 | desired; if the RE <tt class="regexp"><.*></tt> is matched against |
| 154 | <code>'<H1>title</H1>'</code>, it will match the entire string, and not just |
| 155 | <code>'<H1>'</code>. Adding "<tt class="character">?</tt>" after the qualifier makes it |
| 156 | perform the match in <i class="dfn">non-greedy</i> or <i class="dfn">minimal</i> fashion; as |
| 157 | <em>few</em> characters as possible will be matched. Using <tt class="regexp">.*?</tt> |
| 158 | in the previous expression will match only <code>'<H1>'</code>. |
| 159 | |
| 160 | <P> |
| 161 | </DD> |
| 162 | <DT><STRONG><code>{<var>m</var>}</code></STRONG></DT> |
| 163 | <DD>Specifies that exactly <var>m</var> copies of the previous RE should be |
| 164 | matched; fewer matches cause the entire RE not to match. For example, |
| 165 | <tt class="regexp">a{6}</tt> will match exactly six "<tt class="character">a</tt>" characters, but |
| 166 | not five. |
| 167 | |
| 168 | <P> |
| 169 | </DD> |
| 170 | <DT><STRONG><code>{<var>m</var>,<var>n</var>}</code></STRONG></DT> |
| 171 | <DD>Causes the resulting RE to match from |
| 172 | <var>m</var> to <var>n</var> repetitions of the preceding RE, attempting to |
| 173 | match as many repetitions as possible. For example, <tt class="regexp">a{3,5}</tt> |
| 174 | will match from 3 to 5 "<tt class="character">a</tt>" characters. Omitting <var>m</var> |
| 175 | specifies a lower bound of zero, |
| 176 | and omitting <var>n</var> specifies an infinite upper bound. As an |
| 177 | example, <tt class="regexp">a{4,}b</tt> will match <code>aaaab</code> or a thousand |
| 178 | "<tt class="character">a</tt>" characters followed by a <code>b</code>, but not <code>aaab</code>. |
| 179 | The comma may not be omitted or the modifier would be confused with |
| 180 | the previously described form. |
| 181 | |
| 182 | <P> |
| 183 | </DD> |
| 184 | <DT><STRONG><code>{<var>m</var>,<var>n</var>}?</code></STRONG></DT> |
| 185 | <DD>Causes the resulting RE to |
| 186 | match from <var>m</var> to <var>n</var> repetitions of the preceding RE, |
| 187 | attempting to match as <em>few</em> repetitions as possible. This is |
| 188 | the non-greedy version of the previous qualifier. For example, on the |
| 189 | 6-character string <code>'aaaaaa'</code>, <tt class="regexp">a{3,5}</tt> will match 5 |
| 190 | "<tt class="character">a</tt>" characters, while <tt class="regexp">a{3,5}?</tt> will only match 3 |
| 191 | characters. |
| 192 | |
| 193 | <P> |
| 194 | </DD> |
| 195 | <DT><STRONG>"<tt class="character">\</tt>"</STRONG></DT> |
| 196 | <DD>Either escapes special characters (permitting |
| 197 | you to match characters like "<tt class="character">*</tt>", "<tt class="character">?</tt>", and so |
| 198 | forth), or signals a special sequence; special sequences are discussed |
| 199 | below. |
| 200 | |
| 201 | <P> |
| 202 | If you're not using a raw string to |
| 203 | express the pattern, remember that Python also uses the |
| 204 | backslash as an escape sequence in string literals; if the escape |
| 205 | sequence isn't recognized by Python's parser, the backslash and |
| 206 | subsequent character are included in the resulting string. However, |
| 207 | if Python would recognize the resulting sequence, the backslash should |
| 208 | be repeated twice. This is complicated and hard to understand, so |
| 209 | it's highly recommended that you use raw strings for all but the |
| 210 | simplest expressions. |
| 211 | |
| 212 | <P> |
| 213 | </DD> |
| 214 | <DT><STRONG><code>[]</code></STRONG></DT> |
| 215 | <DD>Used to indicate a set of characters. Characters can |
| 216 | be listed individually, or a range of characters can be indicated by |
| 217 | giving two characters and separating them by a "<tt class="character">-</tt>". Special |
| 218 | characters are not active inside sets. For example, <tt class="regexp">[akm$]</tt> |
| 219 | will match any of the characters "<tt class="character">a</tt>", "<tt class="character">k</tt>", |
| 220 | "<tt class="character">m</tt>", or "<tt class="character">$</tt>"; <tt class="regexp">[a-z]</tt> |
| 221 | will match any lowercase letter, and <code>[a-zA-Z0-9]</code> matches any |
| 222 | letter or digit. Character classes such as <code>\w</code> or <code>\S</code> |
| 223 | (defined below) are also acceptable inside a range. If you want to |
| 224 | include a "<tt class="character">]</tt>" or a "<tt class="character">-</tt>" inside a set, precede it with a |
| 225 | backslash, or place it as the first character. The |
| 226 | pattern <tt class="regexp">[]]</tt> will match <code>']'</code>, for example. |
| 227 | |
| 228 | <P> |
| 229 | You can match the characters not within a range by <i class="dfn">complementing</i> |
| 230 | the set. This is indicated by including a |
| 231 | "<tt class="character">^</tt>" as the first character of the set; |
| 232 | "<tt class="character">^</tt>" elsewhere will simply match the |
| 233 | "<tt class="character">^</tt>" character. For example, |
| 234 | <tt class="regexp">[^5]</tt> will match |
| 235 | any character except "<tt class="character">5</tt>", and |
| 236 | <tt class="regexp">[^<code>^</code>]</tt> will match any character |
| 237 | except "<tt class="character">^</tt>". |
| 238 | |
| 239 | <P> |
| 240 | </DD> |
| 241 | <DT><STRONG>"<tt class="character">|</tt>"</STRONG></DT> |
| 242 | <DD><code>A|B</code>, where A and B can be arbitrary REs, |
| 243 | creates a regular expression that will match either A or B. An |
| 244 | arbitrary number of REs can be separated by the "<tt class="character">|</tt>" in this |
| 245 | way. This can be used inside groups (see below) as well. As the target |
| 246 | string is scanned, REs separated by "<tt class="character">|</tt>" are tried from left to |
| 247 | right. When one pattern completely matches, that branch is accepted. |
| 248 | This means that once <code>A</code> matches, <code>B</code> will not be tested further, |
| 249 | even if it would produce a longer overall match. In other words, the |
| 250 | "<tt class="character">|</tt>" operator is never greedy. To match a literal "<tt class="character">|</tt>", |
| 251 | use <tt class="regexp">\|</tt>, or enclose it inside a character class, as in <tt class="regexp">[|]</tt>. |
| 252 | |
| 253 | <P> |
| 254 | </DD> |
| 255 | <DT><STRONG><code>(...)</code></STRONG></DT> |
| 256 | <DD>Matches whatever regular expression is inside the |
| 257 | parentheses, and indicates the start and end of a group; the contents |
| 258 | of a group can be retrieved after a match has been performed, and can |
| 259 | be matched later in the string with the <tt class="regexp">\<var>number</var></tt> special |
| 260 | sequence, described below. To match the literals "<tt class="character">(</tt>" or |
| 261 | "<tt class="character">)</tt>", use <tt class="regexp">\(</tt> or <tt class="regexp">\)</tt>, or enclose them |
| 262 | inside a character class: <tt class="regexp">[(] [)]</tt>. |
| 263 | |
| 264 | <P> |
| 265 | </DD> |
| 266 | <DT><STRONG><code>(?...)</code></STRONG></DT> |
| 267 | <DD>This is an extension notation (a "<tt class="character">?</tt>" |
| 268 | following a "<tt class="character">(</tt>" is not meaningful otherwise). The first |
| 269 | character after the "<tt class="character">?</tt>" |
| 270 | determines what the meaning and further syntax of the construct is. |
| 271 | Extensions usually do not create a new group; |
| 272 | <tt class="regexp">(?P<<var>name</var>>...)</tt> is the only exception to this rule. |
| 273 | Following are the currently supported extensions. |
| 274 | |
| 275 | <P> |
| 276 | </DD> |
| 277 | <DT><STRONG><code>(?iLmsux)</code></STRONG></DT> |
| 278 | <DD>(One or more letters from the set "<tt class="character">i</tt>", |
| 279 | "<tt class="character">L</tt>", "<tt class="character">m</tt>", "<tt class="character">s</tt>", "<tt class="character">u</tt>", |
| 280 | "<tt class="character">x</tt>".) The group matches the empty string; the letters set |
| 281 | the corresponding flags (<tt class="constant">re.I</tt>, <tt class="constant">re.L</tt>, |
| 282 | <tt class="constant">re.M</tt>, <tt class="constant">re.S</tt>, <tt class="constant">re.U</tt>, <tt class="constant">re.X</tt>) |
| 283 | for the entire regular expression. This is useful if you wish to |
| 284 | include the flags as part of the regular expression, instead of |
| 285 | passing a <var>flag</var> argument to the <tt class="function">compile()</tt> function. |
| 286 | |
| 287 | <P> |
| 288 | Note that the <tt class="regexp">(?x)</tt> flag changes how the expression is parsed. |
| 289 | It should be used first in the expression string, or after one or more |
| 290 | whitespace characters. If there are non-whitespace characters before |
| 291 | the flag, the results are undefined. |
| 292 | |
| 293 | <P> |
| 294 | </DD> |
| 295 | <DT><STRONG><code>(?:...)</code></STRONG></DT> |
| 296 | <DD>A non-grouping version of regular parentheses. |
| 297 | Matches whatever regular expression is inside the parentheses, but the |
| 298 | substring matched by the |
| 299 | group <em>cannot</em> be retrieved after performing a match or |
| 300 | referenced later in the pattern. |
| 301 | |
| 302 | <P> |
| 303 | </DD> |
| 304 | <DT><STRONG><code>(?P<<var>name</var>>...)</code></STRONG></DT> |
| 305 | <DD>Similar to regular parentheses, but |
| 306 | the substring matched by the group is accessible via the symbolic group |
| 307 | name <var>name</var>. Group names must be valid Python identifiers, and |
| 308 | each group name must be defined only once within a regular expression. A |
| 309 | symbolic group is also a numbered group, just as if the group were not |
| 310 | named. So the group named 'id' in the example above can also be |
| 311 | referenced as the numbered group 1. |
| 312 | |
| 313 | <P> |
| 314 | For example, if the pattern is |
| 315 | <tt class="regexp">(?P<id>[a-zA-Z_]\w*)</tt>, the group can be referenced by its |
| 316 | name in arguments to methods of match objects, such as |
| 317 | <code>m.group('id')</code> or <code>m.end('id')</code>, and also by name in |
| 318 | pattern text (for example, <tt class="regexp">(?P=id)</tt>) and replacement text |
| 319 | (such as <code>\g<id></code>). |
| 320 | |
| 321 | <P> |
| 322 | </DD> |
| 323 | <DT><STRONG><code>(?P=<var>name</var>)</code></STRONG></DT> |
| 324 | <DD>Matches whatever text was matched by the |
| 325 | earlier group named <var>name</var>. |
| 326 | |
| 327 | <P> |
| 328 | </DD> |
| 329 | <DT><STRONG><code>(?#...)</code></STRONG></DT> |
| 330 | <DD>A comment; the contents of the parentheses are |
| 331 | simply ignored. |
| 332 | |
| 333 | <P> |
| 334 | </DD> |
| 335 | <DT><STRONG><code>(?=...)</code></STRONG></DT> |
| 336 | <DD>Matches if <tt class="regexp">...</tt> matches next, but doesn't |
| 337 | consume any of the string. This is called a lookahead assertion. For |
| 338 | example, <tt class="regexp">Isaac (?=Asimov)</tt> will match <code>'Isaac '</code> only if it's |
| 339 | followed by <code>'Asimov'</code>. |
| 340 | |
| 341 | <P> |
| 342 | </DD> |
| 343 | <DT><STRONG><code>(?!...)</code></STRONG></DT> |
| 344 | <DD>Matches if <tt class="regexp">...</tt> doesn't match next. This |
| 345 | is a negative lookahead assertion. For example, |
| 346 | <tt class="regexp">Isaac (?!Asimov)</tt> will match <code>'Isaac '</code> only if it's <em>not</em> |
| 347 | followed by <code>'Asimov'</code>. |
| 348 | |
| 349 | <P> |
| 350 | </DD> |
| 351 | <DT><STRONG><code>(?<=...)</code></STRONG></DT> |
| 352 | <DD>Matches if the current position in the string |
| 353 | is preceded by a match for <tt class="regexp">...</tt> that ends at the current |
| 354 | position. This is called a <i class="dfn">positive lookbehind assertion</i>. |
| 355 | <tt class="regexp">(?<=abc)def</tt> will find a match in "<tt class="samp">abcdef</tt>", since the |
| 356 | lookbehind will back up 3 characters and check if the contained |
| 357 | pattern matches. The contained pattern must only match strings of |
| 358 | some fixed length, meaning that <tt class="regexp">abc</tt> or <tt class="regexp">a|b</tt> are |
| 359 | allowed, but <tt class="regexp">a*</tt> and <tt class="regexp">a{3,4}</tt> are not. Note that |
| 360 | patterns which start with positive lookbehind assertions will never |
| 361 | match at the beginning of the string being searched; you will most |
| 362 | likely want to use the <tt class="function">search()</tt> function rather than the |
| 363 | <tt class="function">match()</tt> function: |
| 364 | |
| 365 | <P> |
| 366 | <div class="verbatim"><pre> |
| 367 | >>> import re |
| 368 | >>> m = re.search('(?<=abc)def', 'abcdef') |
| 369 | >>> m.group(0) |
| 370 | 'def' |
| 371 | </pre></div> |
| 372 | |
| 373 | <P> |
| 374 | This example looks for a word following a hyphen: |
| 375 | |
| 376 | <P> |
| 377 | <div class="verbatim"><pre> |
| 378 | >>> m = re.search('(?<=-)\w+', 'spam-egg') |
| 379 | >>> m.group(0) |
| 380 | 'egg' |
| 381 | </pre></div> |
| 382 | |
| 383 | <P> |
| 384 | </DD> |
| 385 | <DT><STRONG><code>(?<!...)</code></STRONG></DT> |
| 386 | <DD>Matches if the current position in the string |
| 387 | is not preceded by a match for <tt class="regexp">...</tt>. This is called a |
| 388 | <i class="dfn">negative lookbehind assertion</i>. Similar to positive lookbehind |
| 389 | assertions, the contained pattern must only match strings of some |
| 390 | fixed length. Patterns which start with negative lookbehind |
| 391 | assertions may match at the beginning of the string being searched. |
| 392 | |
| 393 | <P> |
| 394 | </DD> |
| 395 | <DT><STRONG><code>(?(<var>id/name</var>)yes-pattern|no-pattern)</code></STRONG></DT> |
| 396 | <DD>Will try to match |
| 397 | with <tt class="regexp">yes-pattern</tt> if the group with given <var>id</var> or <var>name</var> |
| 398 | exists, and with <tt class="regexp">no-pattern</tt> if it doesn't. <tt class="regexp">|no-pattern</tt> |
| 399 | is optional and can be omitted. For example, |
| 400 | <tt class="regexp">(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)</tt> is a poor email matching |
| 401 | pattern, which will match with <code>'<user@host.com>'</code> as well as |
| 402 | <code>'user@host.com'</code>, but not with <code>'<user@host.com'</code>. |
| 403 | |
| 404 | <span class="versionnote">New in version 2.4.</span> |
| 405 | |
| 406 | <P> |
| 407 | </DD> |
| 408 | </DL> |
| 409 | |
| 410 | <P> |
| 411 | The special sequences consist of "<tt class="character">\</tt>" and a character from the |
| 412 | list below. If the ordinary character is not on the list, then the |
| 413 | resulting RE will match the second character. For example, |
| 414 | <tt class="regexp">\$</tt> matches the character "<tt class="character">$</tt>". |
| 415 | <DL> |
| 416 | <DT><STRONG><code>\<var>number</var></code></STRONG></DT> |
| 417 | <DD>Matches the contents of the group of the |
| 418 | same number. Groups are numbered starting from 1. For example, |
| 419 | <tt class="regexp">(.+) \1</tt> matches <code>'the the'</code> or <code>'55 55'</code>, but not |
| 420 | <code>'the end'</code> (note |
| 421 | the space after the group). This special sequence can only be used to |
| 422 | match one of the first 99 groups. If the first digit of <var>number</var> |
| 423 | is 0, or <var>number</var> is 3 octal digits long, it will not be interpreted |
| 424 | as a group match, but as the character with octal value <var>number</var>. |
| 425 | Inside the "<tt class="character">[</tt>" and "<tt class="character">]</tt>" of a character class, all numeric |
| 426 | escapes are treated as characters. |
| 427 | |
| 428 | <P> |
| 429 | </DD> |
| 430 | <DT><STRONG><code>\A</code></STRONG></DT> |
| 431 | <DD>Matches only at the start of the string. |
| 432 | |
| 433 | <P> |
| 434 | </DD> |
| 435 | <DT><STRONG><code>\b</code></STRONG></DT> |
| 436 | <DD>Matches the empty string, but only at the |
| 437 | beginning or end of a word. A word is defined as a sequence of |
| 438 | alphanumeric or underscore characters, so the end of a word is indicated by |
| 439 | whitespace or a non-alphanumeric, non-underscore character. Note that |
| 440 | <code>\b</code> is defined as the boundary between <code>\w</code> and <code>\ |
| 441 | W</code>, so the precise set of characters deemed to be alphanumeric depends on the |
| 442 | values of the <code>UNICODE</code> and <code>LOCALE</code> flags. Inside a character |
| 443 | range, <tt class="regexp">\b</tt> represents the backspace character, for compatibility |
| 444 | with Python's string literals. |
| 445 | |
| 446 | <P> |
| 447 | </DD> |
| 448 | <DT><STRONG><code>\B</code></STRONG></DT> |
| 449 | <DD>Matches the empty string, but only when it is <em>not</em> |
| 450 | at the beginning or end of a word. This is just the opposite of <code>\ |
| 451 | b</code>, so is also subject to the settings of <code>LOCALE</code> and <code>UNICODE</code>. |
| 452 | |
| 453 | <P> |
| 454 | </DD> |
| 455 | <DT><STRONG><code>\d</code></STRONG></DT> |
| 456 | <DD>When the <tt class="constant">UNICODE</tt> flag is not specified, matches |
| 457 | any decimal digit; this is equivalent to the set <tt class="regexp">[0-9]</tt>. |
| 458 | With <tt class="constant">UNICODE</tt>, it will match whatever is classified as a digit |
| 459 | in the Unicode character properties database. |
| 460 | |
| 461 | <P> |
| 462 | </DD> |
| 463 | <DT><STRONG><code>\D</code></STRONG></DT> |
| 464 | <DD>When the <tt class="constant">UNICODE</tt> flag is not specified, matches |
| 465 | any non-digit character; this is equivalent to the set |
| 466 | <tt class="regexp">[^0-9]</tt>. With <tt class="constant">UNICODE</tt>, it will match |
| 467 | anything other than character marked as digits in the Unicode character |
| 468 | properties database. |
| 469 | |
| 470 | <P> |
| 471 | </DD> |
| 472 | <DT><STRONG><code>\s</code></STRONG></DT> |
| 473 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> |
| 474 | flags are not specified, matches any whitespace character; this is |
| 475 | equivalent to the set <tt class="regexp">[ \t\n\r\f\v]</tt>. |
| 476 | With <tt class="constant">LOCALE</tt>, it will match this set plus whatever characters |
| 477 | are defined as space for the current locale. If <tt class="constant">UNICODE</tt> is set, |
| 478 | this will match the characters <tt class="regexp">[ \t\n\r\f\v]</tt> plus |
| 479 | whatever is classified as space in the Unicode character properties |
| 480 | database. |
| 481 | |
| 482 | <P> |
| 483 | </DD> |
| 484 | <DT><STRONG><code>\S</code></STRONG></DT> |
| 485 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> |
| 486 | flags are not specified, matches any non-whitespace character; this is |
| 487 | equivalent to the set <tt class="regexp">[^ \t\n\r\f\v]</tt> |
| 488 | With <tt class="constant">LOCALE</tt>, it will match any character not in this set, |
| 489 | and not defined as space in the current locale. If <tt class="constant">UNICODE</tt> |
| 490 | is set, this will match anything other than <tt class="regexp">[ \t\n\r\f\v]</tt> |
| 491 | and characters marked as space in the Unicode character properties database. |
| 492 | |
| 493 | <P> |
| 494 | </DD> |
| 495 | <DT><STRONG><code>\w</code></STRONG></DT> |
| 496 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> |
| 497 | flags are not specified, matches any alphanumeric character and the |
| 498 | underscore; this is equivalent to the set |
| 499 | <tt class="regexp">[a-zA-Z0-9_]</tt>. With <tt class="constant">LOCALE</tt>, it will match the set |
| 500 | <tt class="regexp">[0-9_]</tt> plus whatever characters are defined as alphanumeric for |
| 501 | the current locale. If <tt class="constant">UNICODE</tt> is set, this will match the |
| 502 | characters <tt class="regexp">[0-9_]</tt> plus whatever is classified as alphanumeric |
| 503 | in the Unicode character properties database. |
| 504 | |
| 505 | <P> |
| 506 | </DD> |
| 507 | <DT><STRONG><code>\W</code></STRONG></DT> |
| 508 | <DD>When the <tt class="constant">LOCALE</tt> and <tt class="constant">UNICODE</tt> |
| 509 | flags are not specified, matches any non-alphanumeric character; this |
| 510 | is equivalent to the set <tt class="regexp">[^a-zA-Z0-9_]</tt>. With |
| 511 | <tt class="constant">LOCALE</tt>, it will match any character not in the set |
| 512 | <tt class="regexp">[0-9_]</tt>, and not defined as alphanumeric for the current locale. |
| 513 | If <tt class="constant">UNICODE</tt> is set, this will match anything other than |
| 514 | <tt class="regexp">[0-9_]</tt> and characters marked as alphanumeric in the Unicode |
| 515 | character properties database. |
| 516 | |
| 517 | <P> |
| 518 | </DD> |
| 519 | <DT><STRONG><code>\Z</code></STRONG></DT> |
| 520 | <DD>Matches only at the end of the string. |
| 521 | |
| 522 | <P> |
| 523 | </DD> |
| 524 | </DL> |
| 525 | |
| 526 | <P> |
| 527 | Most of the standard escapes supported by Python string literals are |
| 528 | also accepted by the regular expression parser: |
| 529 | |
| 530 | <P> |
| 531 | <div class="verbatim"><pre> |
| 532 | \a \b \f \n |
| 533 | \r \t \v \x |
| 534 | \\ |
| 535 | </pre></div> |
| 536 | |
| 537 | <P> |
| 538 | Octal escapes are included in a limited form: If the first digit is a |
| 539 | 0, or if there are three octal digits, it is considered an octal |
| 540 | escape. Otherwise, it is a group reference. As for string literals, |
| 541 | octal escapes are always at most three digits in length. |
| 542 | |
| 543 | <P> |
| 544 | |
| 545 | <DIV CLASS="navigation"> |
| 546 | <div class='online-navigation'> |
| 547 | <p></p><hr /> |
| 548 | <table align="center" width="100%" cellpadding="0" cellspacing="2"> |
| 549 | <tr> |
| 550 | <td class='online-navigation'><a rel="prev" title="4.2 re " |
| 551 | href="module-re.html"><img src='../icons/previous.png' |
| 552 | border='0' height='32' alt='Previous Page' width='32' /></A></td> |
| 553 | <td class='online-navigation'><a rel="parent" title="4.2 re " |
| 554 | href="module-re.html"><img src='../icons/up.png' |
| 555 | border='0' height='32' alt='Up One Level' width='32' /></A></td> |
| 556 | <td class='online-navigation'><a rel="next" title="4.2.2 Matching vs Searching" |
| 557 | href="matching-searching.html"><img src='../icons/next.png' |
| 558 | border='0' height='32' alt='Next Page' width='32' /></A></td> |
| 559 | <td align="center" width="100%">Python Library Reference</td> |
| 560 | <td class='online-navigation'><a rel="contents" title="Table of Contents" |
| 561 | href="contents.html"><img src='../icons/contents.png' |
| 562 | border='0' height='32' alt='Contents' width='32' /></A></td> |
| 563 | <td class='online-navigation'><a href="modindex.html" title="Module Index"><img src='../icons/modules.png' |
| 564 | border='0' height='32' alt='Module Index' width='32' /></a></td> |
| 565 | <td class='online-navigation'><a rel="index" title="Index" |
| 566 | href="genindex.html"><img src='../icons/index.png' |
| 567 | border='0' height='32' alt='Index' width='32' /></A></td> |
| 568 | </tr></table> |
| 569 | <div class='online-navigation'> |
| 570 | <b class="navlabel">Previous:</b> |
| 571 | <a class="sectref" rel="prev" href="module-re.html">4.2 re </A> |
| 572 | <b class="navlabel">Up:</b> |
| 573 | <a class="sectref" rel="parent" href="module-re.html">4.2 re </A> |
| 574 | <b class="navlabel">Next:</b> |
| 575 | <a class="sectref" rel="next" href="matching-searching.html">4.2.2 Matching vs Searching</A> |
| 576 | </div> |
| 577 | </div> |
| 578 | <hr /> |
| 579 | <span class="release-info">Release 2.4.2, documentation updated on 28 September 2005.</span> |
| 580 | </DIV> |
| 581 | <!--End of Navigation Panel--> |
| 582 | <ADDRESS> |
| 583 | See <i><a href="about.html">About this document...</a></i> for information on suggesting changes. |
| 584 | </ADDRESS> |
| 585 | </BODY> |
| 586 | </HTML> |