| 1 | =head1 NAME |
| 2 | |
| 3 | perlrequick - Perl regular expressions quick start |
| 4 | |
| 5 | =head1 DESCRIPTION |
| 6 | |
| 7 | This page covers the very basics of understanding, creating and |
| 8 | using regular expressions ('regexes') in Perl. |
| 9 | |
| 10 | |
| 11 | =head1 The Guide |
| 12 | |
| 13 | =head2 Simple word matching |
| 14 | |
| 15 | The simplest regex is simply a word, or more generally, a string of |
| 16 | characters. A regex consisting of a word matches any string that |
| 17 | contains that word: |
| 18 | |
| 19 | "Hello World" =~ /World/; # matches |
| 20 | |
| 21 | In this statement, C<World> is a regex and the C<//> enclosing |
| 22 | C</World/> tells perl to search a string for a match. The operator |
| 23 | C<=~> associates the string with the regex match and produces a true |
| 24 | value if the regex matched, or false if the regex did not match. In |
| 25 | our case, C<World> matches the second word in C<"Hello World">, so the |
| 26 | expression is true. This idea has several variations. |
| 27 | |
| 28 | Expressions like this are useful in conditionals: |
| 29 | |
| 30 | print "It matches\n" if "Hello World" =~ /World/; |
| 31 | |
| 32 | The sense of the match can be reversed by using C<!~> operator: |
| 33 | |
| 34 | print "It doesn't match\n" if "Hello World" !~ /World/; |
| 35 | |
| 36 | The literal string in the regex can be replaced by a variable: |
| 37 | |
| 38 | $greeting = "World"; |
| 39 | print "It matches\n" if "Hello World" =~ /$greeting/; |
| 40 | |
| 41 | If you're matching against C<$_>, the C<$_ =~> part can be omitted: |
| 42 | |
| 43 | $_ = "Hello World"; |
| 44 | print "It matches\n" if /World/; |
| 45 | |
| 46 | Finally, the C<//> default delimiters for a match can be changed to |
| 47 | arbitrary delimiters by putting an C<'m'> out front: |
| 48 | |
| 49 | "Hello World" =~ m!World!; # matches, delimited by '!' |
| 50 | "Hello World" =~ m{World}; # matches, note the matching '{}' |
| 51 | "/usr/bin/perl" =~ m"/perl"; # matches after '/usr/bin', |
| 52 | # '/' becomes an ordinary char |
| 53 | |
| 54 | Regexes must match a part of the string I<exactly> in order for the |
| 55 | statement to be true: |
| 56 | |
| 57 | "Hello World" =~ /world/; # doesn't match, case sensitive |
| 58 | "Hello World" =~ /o W/; # matches, ' ' is an ordinary char |
| 59 | "Hello World" =~ /World /; # doesn't match, no ' ' at end |
| 60 | |
| 61 | perl will always match at the earliest possible point in the string: |
| 62 | |
| 63 | "Hello World" =~ /o/; # matches 'o' in 'Hello' |
| 64 | "That hat is red" =~ /hat/; # matches 'hat' in 'That' |
| 65 | |
| 66 | Not all characters can be used 'as is' in a match. Some characters, |
| 67 | called B<metacharacters>, are reserved for use in regex notation. |
| 68 | The metacharacters are |
| 69 | |
| 70 | {}[]()^$.|*+?\ |
| 71 | |
| 72 | A metacharacter can be matched by putting a backslash before it: |
| 73 | |
| 74 | "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter |
| 75 | "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + |
| 76 | 'C:\WIN32' =~ /C:\\WIN/; # matches |
| 77 | "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches |
| 78 | |
| 79 | In the last regex, the forward slash C<'/'> is also backslashed, |
| 80 | because it is used to delimit the regex. |
| 81 | |
| 82 | Non-printable ASCII characters are represented by B<escape sequences>. |
| 83 | Common examples are C<\t> for a tab, C<\n> for a newline, and C<\r> |
| 84 | for a carriage return. Arbitrary bytes are represented by octal |
| 85 | escape sequences, e.g., C<\033>, or hexadecimal escape sequences, |
| 86 | e.g., C<\x1B>: |
| 87 | |
| 88 | "1000\t2000" =~ m(0\t2) # matches |
| 89 | "cat" =~ /\143\x61\x74/ # matches, but a weird way to spell cat |
| 90 | |
| 91 | Regexes are treated mostly as double quoted strings, so variable |
| 92 | substitution works: |
| 93 | |
| 94 | $foo = 'house'; |
| 95 | 'cathouse' =~ /cat$foo/; # matches |
| 96 | 'housecat' =~ /${foo}cat/; # matches |
| 97 | |
| 98 | With all of the regexes above, if the regex matched anywhere in the |
| 99 | string, it was considered a match. To specify I<where> it should |
| 100 | match, we would use the B<anchor> metacharacters C<^> and C<$>. The |
| 101 | anchor C<^> means match at the beginning of the string and the anchor |
| 102 | C<$> means match at the end of the string, or before a newline at the |
| 103 | end of the string. Some examples: |
| 104 | |
| 105 | "housekeeper" =~ /keeper/; # matches |
| 106 | "housekeeper" =~ /^keeper/; # doesn't match |
| 107 | "housekeeper" =~ /keeper$/; # matches |
| 108 | "housekeeper\n" =~ /keeper$/; # matches |
| 109 | "housekeeper" =~ /^housekeeper$/; # matches |
| 110 | |
| 111 | =head2 Using character classes |
| 112 | |
| 113 | A B<character class> allows a set of possible characters, rather than |
| 114 | just a single character, to match at a particular point in a regex. |
| 115 | Character classes are denoted by brackets C<[...]>, with the set of |
| 116 | characters to be possibly matched inside. Here are some examples: |
| 117 | |
| 118 | /cat/; # matches 'cat' |
| 119 | /[bcr]at/; # matches 'bat', 'cat', or 'rat' |
| 120 | "abc" =~ /[cab]/; # matches 'a' |
| 121 | |
| 122 | In the last statement, even though C<'c'> is the first character in |
| 123 | the class, the earliest point at which the regex can match is C<'a'>. |
| 124 | |
| 125 | /[yY][eE][sS]/; # match 'yes' in a case-insensitive way |
| 126 | # 'yes', 'Yes', 'YES', etc. |
| 127 | /yes/i; # also match 'yes' in a case-insensitive way |
| 128 | |
| 129 | The last example shows a match with an C<'i'> B<modifier>, which makes |
| 130 | the match case-insensitive. |
| 131 | |
| 132 | Character classes also have ordinary and special characters, but the |
| 133 | sets of ordinary and special characters inside a character class are |
| 134 | different than those outside a character class. The special |
| 135 | characters for a character class are C<-]\^$> and are matched using an |
| 136 | escape: |
| 137 | |
| 138 | /[\]c]def/; # matches ']def' or 'cdef' |
| 139 | $x = 'bcr'; |
| 140 | /[$x]at/; # matches 'bat, 'cat', or 'rat' |
| 141 | /[\$x]at/; # matches '$at' or 'xat' |
| 142 | /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' |
| 143 | |
| 144 | The special character C<'-'> acts as a range operator within character |
| 145 | classes, so that the unwieldy C<[0123456789]> and C<[abc...xyz]> |
| 146 | become the svelte C<[0-9]> and C<[a-z]>: |
| 147 | |
| 148 | /item[0-9]/; # matches 'item0' or ... or 'item9' |
| 149 | /[0-9a-fA-F]/; # matches a hexadecimal digit |
| 150 | |
| 151 | If C<'-'> is the first or last character in a character class, it is |
| 152 | treated as an ordinary character. |
| 153 | |
| 154 | The special character C<^> in the first position of a character class |
| 155 | denotes a B<negated character class>, which matches any character but |
| 156 | those in the brackets. Both C<[...]> and C<[^...]> must match a |
| 157 | character, or the match fails. Then |
| 158 | |
| 159 | /[^a]at/; # doesn't match 'aat' or 'at', but matches |
| 160 | # all other 'bat', 'cat, '0at', '%at', etc. |
| 161 | /[^0-9]/; # matches a non-numeric character |
| 162 | /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary |
| 163 | |
| 164 | Perl has several abbreviations for common character classes: |
| 165 | |
| 166 | =over 4 |
| 167 | |
| 168 | =item * |
| 169 | |
| 170 | \d is a digit and represents [0-9] |
| 171 | |
| 172 | =item * |
| 173 | |
| 174 | \s is a whitespace character and represents [\ \t\r\n\f] |
| 175 | |
| 176 | =item * |
| 177 | |
| 178 | \w is a word character (alphanumeric or _) and represents [0-9a-zA-Z_] |
| 179 | |
| 180 | =item * |
| 181 | |
| 182 | \D is a negated \d; it represents any character but a digit [^0-9] |
| 183 | |
| 184 | =item * |
| 185 | |
| 186 | \S is a negated \s; it represents any non-whitespace character [^\s] |
| 187 | |
| 188 | =item * |
| 189 | |
| 190 | \W is a negated \w; it represents any non-word character [^\w] |
| 191 | |
| 192 | =item * |
| 193 | |
| 194 | The period '.' matches any character but "\n" |
| 195 | |
| 196 | =back |
| 197 | |
| 198 | The C<\d\s\w\D\S\W> abbreviations can be used both inside and outside |
| 199 | of character classes. Here are some in use: |
| 200 | |
| 201 | /\d\d:\d\d:\d\d/; # matches a hh:mm:ss time format |
| 202 | /[\d\s]/; # matches any digit or whitespace character |
| 203 | /\w\W\w/; # matches a word char, followed by a |
| 204 | # non-word char, followed by a word char |
| 205 | /..rt/; # matches any two chars, followed by 'rt' |
| 206 | /end\./; # matches 'end.' |
| 207 | /end[.]/; # same thing, matches 'end.' |
| 208 | |
| 209 | The S<B<word anchor> > C<\b> matches a boundary between a word |
| 210 | character and a non-word character C<\w\W> or C<\W\w>: |
| 211 | |
| 212 | $x = "Housecat catenates house and cat"; |
| 213 | $x =~ /\bcat/; # matches cat in 'catenates' |
| 214 | $x =~ /cat\b/; # matches cat in 'housecat' |
| 215 | $x =~ /\bcat\b/; # matches 'cat' at end of string |
| 216 | |
| 217 | In the last example, the end of the string is considered a word |
| 218 | boundary. |
| 219 | |
| 220 | =head2 Matching this or that |
| 221 | |
| 222 | We can match different character strings with the B<alternation> |
| 223 | metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regex |
| 224 | C<dog|cat>. As before, perl will try to match the regex at the |
| 225 | earliest possible point in the string. At each character position, |
| 226 | perl will first try to match the first alternative, C<dog>. If |
| 227 | C<dog> doesn't match, perl will then try the next alternative, C<cat>. |
| 228 | If C<cat> doesn't match either, then the match fails and perl moves to |
| 229 | the next position in the string. Some examples: |
| 230 | |
| 231 | "cats and dogs" =~ /cat|dog|bird/; # matches "cat" |
| 232 | "cats and dogs" =~ /dog|cat|bird/; # matches "cat" |
| 233 | |
| 234 | Even though C<dog> is the first alternative in the second regex, |
| 235 | C<cat> is able to match earlier in the string. |
| 236 | |
| 237 | "cats" =~ /c|ca|cat|cats/; # matches "c" |
| 238 | "cats" =~ /cats|cat|ca|c/; # matches "cats" |
| 239 | |
| 240 | At a given character position, the first alternative that allows the |
| 241 | regex match to succeed will be the one that matches. Here, all the |
| 242 | alternatives match at the first string position, so th first matches. |
| 243 | |
| 244 | =head2 Grouping things and hierarchical matching |
| 245 | |
| 246 | The B<grouping> metacharacters C<()> allow a part of a regex to be |
| 247 | treated as a single unit. Parts of a regex are grouped by enclosing |
| 248 | them in parentheses. The regex C<house(cat|keeper)> means match |
| 249 | C<house> followed by either C<cat> or C<keeper>. Some more examples |
| 250 | are |
| 251 | |
| 252 | /(a|b)b/; # matches 'ab' or 'bb' |
| 253 | /(^a|b)c/; # matches 'ac' at start of string or 'bc' anywhere |
| 254 | |
| 255 | /house(cat|)/; # matches either 'housecat' or 'house' |
| 256 | /house(cat(s|)|)/; # matches either 'housecats' or 'housecat' or |
| 257 | # 'house'. Note groups can be nested. |
| 258 | |
| 259 | "20" =~ /(19|20|)\d\d/; # matches the null alternative '()\d\d', |
| 260 | # because '20\d\d' can't match |
| 261 | |
| 262 | =head2 Extracting matches |
| 263 | |
| 264 | The grouping metacharacters C<()> also allow the extraction of the |
| 265 | parts of a string that matched. For each grouping, the part that |
| 266 | matched inside goes into the special variables C<$1>, C<$2>, etc. |
| 267 | They can be used just as ordinary variables: |
| 268 | |
| 269 | # extract hours, minutes, seconds |
| 270 | $time =~ /(\d\d):(\d\d):(\d\d)/; # match hh:mm:ss format |
| 271 | $hours = $1; |
| 272 | $minutes = $2; |
| 273 | $seconds = $3; |
| 274 | |
| 275 | In list context, a match C</regex/> with groupings will return the |
| 276 | list of matched values C<($1,$2,...)>. So we could rewrite it as |
| 277 | |
| 278 | ($hours, $minutes, $second) = ($time =~ /(\d\d):(\d\d):(\d\d)/); |
| 279 | |
| 280 | If the groupings in a regex are nested, C<$1> gets the group with the |
| 281 | leftmost opening parenthesis, C<$2> the next opening parenthesis, |
| 282 | etc. For example, here is a complex regex and the matching variables |
| 283 | indicated below it: |
| 284 | |
| 285 | /(ab(cd|ef)((gi)|j))/; |
| 286 | 1 2 34 |
| 287 | |
| 288 | Associated with the matching variables C<$1>, C<$2>, ... are |
| 289 | the B<backreferences> C<\1>, C<\2>, ... Backreferences are |
| 290 | matching variables that can be used I<inside> a regex: |
| 291 | |
| 292 | /(\w\w\w)\s\1/; # find sequences like 'the the' in string |
| 293 | |
| 294 | C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, |
| 295 | C<\2>, ... only inside a regex. |
| 296 | |
| 297 | =head2 Matching repetitions |
| 298 | |
| 299 | The B<quantifier> metacharacters C<?>, C<*>, C<+>, and C<{}> allow us |
| 300 | to determine the number of repeats of a portion of a regex we |
| 301 | consider to be a match. Quantifiers are put immediately after the |
| 302 | character, character class, or grouping that we want to specify. They |
| 303 | have the following meanings: |
| 304 | |
| 305 | =over 4 |
| 306 | |
| 307 | =item * |
| 308 | |
| 309 | C<a?> = match 'a' 1 or 0 times |
| 310 | |
| 311 | =item * |
| 312 | |
| 313 | C<a*> = match 'a' 0 or more times, i.e., any number of times |
| 314 | |
| 315 | =item * |
| 316 | |
| 317 | C<a+> = match 'a' 1 or more times, i.e., at least once |
| 318 | |
| 319 | =item * |
| 320 | |
| 321 | C<a{n,m}> = match at least C<n> times, but not more than C<m> |
| 322 | times. |
| 323 | |
| 324 | =item * |
| 325 | |
| 326 | C<a{n,}> = match at least C<n> or more times |
| 327 | |
| 328 | =item * |
| 329 | |
| 330 | C<a{n}> = match exactly C<n> times |
| 331 | |
| 332 | =back |
| 333 | |
| 334 | Here are some examples: |
| 335 | |
| 336 | /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and |
| 337 | # any number of digits |
| 338 | /(\w+)\s+\1/; # match doubled words of arbitrary length |
| 339 | $year =~ /\d{2,4}/; # make sure year is at least 2 but not more |
| 340 | # than 4 digits |
| 341 | $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates |
| 342 | |
| 343 | These quantifiers will try to match as much of the string as possible, |
| 344 | while still allowing the regex to match. So we have |
| 345 | |
| 346 | $x = 'the cat in the hat'; |
| 347 | $x =~ /^(.*)(at)(.*)$/; # matches, |
| 348 | # $1 = 'the cat in the h' |
| 349 | # $2 = 'at' |
| 350 | # $3 = '' (0 matches) |
| 351 | |
| 352 | The first quantifier C<.*> grabs as much of the string as possible |
| 353 | while still having the regex match. The second quantifier C<.*> has |
| 354 | no string left to it, so it matches 0 times. |
| 355 | |
| 356 | =head2 More matching |
| 357 | |
| 358 | There are a few more things you might want to know about matching |
| 359 | operators. In the code |
| 360 | |
| 361 | $pattern = 'Seuss'; |
| 362 | while (<>) { |
| 363 | print if /$pattern/; |
| 364 | } |
| 365 | |
| 366 | perl has to re-evaluate C<$pattern> each time through the loop. If |
| 367 | C<$pattern> won't be changing, use the C<//o> modifier, to only |
| 368 | perform variable substitutions once. If you don't want any |
| 369 | substitutions at all, use the special delimiter C<m''>: |
| 370 | |
| 371 | $pattern = 'Seuss'; |
| 372 | m'$pattern'; # matches '$pattern', not 'Seuss' |
| 373 | |
| 374 | The global modifier C<//g> allows the matching operator to match |
| 375 | within a string as many times as possible. In scalar context, |
| 376 | successive matches against a string will have C<//g> jump from match |
| 377 | to match, keeping track of position in the string as it goes along. |
| 378 | You can get or set the position with the C<pos()> function. |
| 379 | For example, |
| 380 | |
| 381 | $x = "cat dog house"; # 3 words |
| 382 | while ($x =~ /(\w+)/g) { |
| 383 | print "Word is $1, ends at position ", pos $x, "\n"; |
| 384 | } |
| 385 | |
| 386 | prints |
| 387 | |
| 388 | Word is cat, ends at position 3 |
| 389 | Word is dog, ends at position 7 |
| 390 | Word is house, ends at position 13 |
| 391 | |
| 392 | A failed match or changing the target string resets the position. If |
| 393 | you don't want the position reset after failure to match, add the |
| 394 | C<//c>, as in C</regex/gc>. |
| 395 | |
| 396 | In list context, C<//g> returns a list of matched groupings, or if |
| 397 | there are no groupings, a list of matches to the whole regex. So |
| 398 | |
| 399 | @words = ($x =~ /(\w+)/g); # matches, |
| 400 | # $word[0] = 'cat' |
| 401 | # $word[1] = 'dog' |
| 402 | # $word[2] = 'house' |
| 403 | |
| 404 | =head2 Search and replace |
| 405 | |
| 406 | Search and replace is performed using C<s/regex/replacement/modifiers>. |
| 407 | The C<replacement> is a Perl double quoted string that replaces in the |
| 408 | string whatever is matched with the C<regex>. The operator C<=~> is |
| 409 | also used here to associate a string with C<s///>. If matching |
| 410 | against C<$_>, the S<C<$_ =~> > can be dropped. If there is a match, |
| 411 | C<s///> returns the number of substitutions made, otherwise it returns |
| 412 | false. Here are a few examples: |
| 413 | |
| 414 | $x = "Time to feed the cat!"; |
| 415 | $x =~ s/cat/hacker/; # $x contains "Time to feed the hacker!" |
| 416 | $y = "'quoted words'"; |
| 417 | $y =~ s/^'(.*)'$/$1/; # strip single quotes, |
| 418 | # $y contains "quoted words" |
| 419 | |
| 420 | With the C<s///> operator, the matched variables C<$1>, C<$2>, etc. |
| 421 | are immediately available for use in the replacement expression. With |
| 422 | the global modifier, C<s///g> will search and replace all occurrences |
| 423 | of the regex in the string: |
| 424 | |
| 425 | $x = "I batted 4 for 4"; |
| 426 | $x =~ s/4/four/; # $x contains "I batted four for 4" |
| 427 | $x = "I batted 4 for 4"; |
| 428 | $x =~ s/4/four/g; # $x contains "I batted four for four" |
| 429 | |
| 430 | The evaluation modifier C<s///e> wraps an C<eval{...}> around the |
| 431 | replacement string and the evaluated result is substituted for the |
| 432 | matched substring. Some examples: |
| 433 | |
| 434 | # reverse all the words in a string |
| 435 | $x = "the cat in the hat"; |
| 436 | $x =~ s/(\w+)/reverse $1/ge; # $x contains "eht tac ni eht tah" |
| 437 | |
| 438 | # convert percentage to decimal |
| 439 | $x = "A 39% hit rate"; |
| 440 | $x =~ s!(\d+)%!$1/100!e; # $x contains "A 0.39 hit rate" |
| 441 | |
| 442 | The last example shows that C<s///> can use other delimiters, such as |
| 443 | C<s!!!> and C<s{}{}>, and even C<s{}//>. If single quotes are used |
| 444 | C<s'''>, then the regex and replacement are treated as single quoted |
| 445 | strings. |
| 446 | |
| 447 | =head2 The split operator |
| 448 | |
| 449 | C<split /regex/, string> splits C<string> into a list of substrings |
| 450 | and returns that list. The regex determines the character sequence |
| 451 | that C<string> is split with respect to. For example, to split a |
| 452 | string into words, use |
| 453 | |
| 454 | $x = "Calvin and Hobbes"; |
| 455 | @word = split /\s+/, $x; # $word[0] = 'Calvin' |
| 456 | # $word[1] = 'and' |
| 457 | # $word[2] = 'Hobbes' |
| 458 | |
| 459 | To extract a comma-delimited list of numbers, use |
| 460 | |
| 461 | $x = "1.618,2.718, 3.142"; |
| 462 | @const = split /,\s*/, $x; # $const[0] = '1.618' |
| 463 | # $const[1] = '2.718' |
| 464 | # $const[2] = '3.142' |
| 465 | |
| 466 | If the empty regex C<//> is used, the string is split into individual |
| 467 | characters. If the regex has groupings, then list produced contains |
| 468 | the matched substrings from the groupings as well: |
| 469 | |
| 470 | $x = "/usr/bin"; |
| 471 | @parts = split m!(/)!, $x; # $parts[0] = '' |
| 472 | # $parts[1] = '/' |
| 473 | # $parts[2] = 'usr' |
| 474 | # $parts[3] = '/' |
| 475 | # $parts[4] = 'bin' |
| 476 | |
| 477 | Since the first character of $x matched the regex, C<split> prepended |
| 478 | an empty initial element to the list. |
| 479 | |
| 480 | =head1 BUGS |
| 481 | |
| 482 | None. |
| 483 | |
| 484 | =head1 SEE ALSO |
| 485 | |
| 486 | This is just a quick start guide. For a more in-depth tutorial on |
| 487 | regexes, see L<perlretut> and for the reference page, see L<perlre>. |
| 488 | |
| 489 | =head1 AUTHOR AND COPYRIGHT |
| 490 | |
| 491 | Copyright (c) 2000 Mark Kvale |
| 492 | All rights reserved. |
| 493 | |
| 494 | This document may be distributed under the same terms as Perl itself. |
| 495 | |
| 496 | =head2 Acknowledgments |
| 497 | |
| 498 | The author would like to thank Mark-Jason Dominus, Tom Christiansen, |
| 499 | Ilya Zakharevich, Brad Hughes, and Mike Giroux for all their helpful |
| 500 | comments. |
| 501 | |
| 502 | =cut |
| 503 | |