[unix-history] / usr.bin / elvis / doc / regexp.ms

.Go 4 "REGULAR EXPRESSIONS"

.PP
\*E uses regular expressions for searching and substututions.
A regular expression is a text string in which some characters have
special meanings.
This is much more powerful than simple text matching.
.SH
Syntax
.PP
\*E' regexp package treats the following one- or two-character
strings (called meta-characters) in special ways:
.IP "\e(\fIsubexpression\fP\e)" 0.8i
The \e( and \e) metacharacters are used to delimit subexpressions.
When the regular expression matches a particular chunk of text,
\*E will remember which portion of that chunk matched the \fIsubexpression\fP.
The :s/regexp/newtext/ command makes use of this feature.
.IP "^" 0.8i
The ^ metacharacter matches the beginning of a line.
If, for example, you wanted to find "foo" at the beginning of a line,
you would use a regular expression such as /^foo/.
Note that ^ is only a metacharacter if it occurs
at the beginning of a regular expression;
anyplace else, it is treated as a normal character.
.IP "$" 0.8i
The $ metacharacter matches the end of a line.
It is only a metacharacter when it occurs at the end of a regular expression;
elsewhere, it is treated as a normal character.
For example, the regular expression /$$/ will search for a dollar sign at
the end of a line.
.IP "\e<" 0.8i
The \e< metacharacter matches a zero-length string at the beginning of
a word.
A word is considered to be a string of 1 or more letters and digits.
A word can begin at the beginning of a line
or after 1 or more non-alphanumeric characters.
.IP "\e>" 0.8i
The \e> metacharacter matches a zero-length string at the end of a word.
A word can end at the end of the line
or before 1 or more non-alphanumeric characters.
For example, /\e<end\e>/ would find any instance of the word "end",
but would ignore any instances of e-n-d inside another word
such as "calendar".
.IP "\&." 0.8i
The . metacharacter matches any single character.
.IP "[\fIcharacter-list\fP]" 0.8i
This matches any single character from the \fIcharacter-list\fP.
Inside the \fIcharacter-list\fP, you can denote a span of characters
by writing only the first and last characters, with a hyphen between
them.
If the \fIcharacter-list\fP is preceded by a ^ character, then the
list is inverted -- it will match character that \fIisn't\fP mentioned
in the list.
For example, /[a-zA-Z]/ matches any letter, and /[^ ]/ matches anything
other than a blank.
.IP "\e{\fIn\fP\e}" 0.8i
This is a closure operator,
which means that it can only be placed after something that matches a
single character.
It controls the number of times that the single-character expression
should be repeated.
.IP "" 0.8i
The \e{\fIn\fP\e} operator, in particular, means that the preceding
expression should be repeated exactly \fIn\fP times.
For example, /^-\e{80\e}$/ matches a line of eighty hyphens, and
/\e<[a-zA-Z]\e{4\e}\e>/ matches any four-letter word.
.IP "\e{\fIn\fP,\fIm\fP\e}" 0.8i
This is a closure operator which means that the preceding single-character
expression should be repeated between \fIn\fP and \fIm\fP times, inclusive.
If the \fIm\fP is omitted (but the comma is present) then \fIm\fP is
taken to be inifinity.
For example, /"[^"]\e{3,5\e}"/ matches any pair of quotes which contains
three, four, or five non-quote characters.
.IP "*" 0.8i
The * metacharacter is a closure operator which means that the preceding
single-character expression can be repeated zero or more times.
It is equivelent to \e{0,\e}.
For example, /.*/ matches a whole line.
.IP "\e+" 0.8i
The \e+ metacharacter is a closure operator which means that the preceding
single-character expression can be repeated one or more times.
It is equivelent to \e{1,\e}.
For example, /.\e+/ matches a whole line, but only if the line contains
at least one character.
It doesn't match empty lines.
.IP "\e?" 0.8i
The \e? metacharacter is a closure operator which indicates that the
preceding single-character expression is optional -- that is, that it
can occur 0 or 1 times.
It is equivelent to \e{0,1\e}.
For example, /no[ -]\e?one/ matches "no one", "no-one", or "noone".
.PP
Anything else is treated as a normal character which must exactly match
a character from the scanned text.
The special strings may all be preceded by a backslash to
force them to be treated normally.
.SH
Substitutions
.PP
The :s command has at least two arguments: a regular expression,
and a substitution string.
The text that matched the regular expression is replaced by text
which is derived from the substitution string.
.br
.ne 15 \" so we don't mess up the table
.PP
Most characters in the substitution string are copied into the
text literally but a few have special meaning:
.LD
.ta 0.75i 1.3i
	&	Insert a copy of the original text
	~	Insert a copy of the previous replacement text
	\e1	Insert a copy of that portion of the original text which
		matched the first set of \e( \e) parentheses
	\e2-\e9	Do the same for the second (etc.) pair of \e( \e)
	\eU	Convert all chars of any later & or \e# to uppercase
	\eL	Convert all chars of any later & or \e# to lowercase
	\eE	End the effect of \eU or \eL
	\eu	Convert the first char of the next & or \e# to uppercase
	\el	Convert the first char of the next & or \e# to lowercase
.TA
.DE
.PP
These may be preceded by a backslash to force them to be treated normally.
If "nomagic" mode is in effect,
then & and ~ will be treated normally,
and you must write them as \e& and \e~ for them to have special meaning.
.SH
Options
.PP
\*E has two options which affect the way regular expressions are used.
These options may be examined or set via the :set command.
.PP
The first option is called "[no]magic".
This is a boolean option, and it is "magic" (TRUE) by default.
While in magic mode, all of the meta-characters behave as described above.
In nomagic mode, only ^ and $ retain their special meaning.
.PP
The second option is called "[no]ignorecase".
This is a boolean option, and it is "noignorecase" (FALSE) by default.
While in ignorecase mode, the searching mechanism will not distinguish between
an uppercase letter and its lowercase form.
In noignorecase mode, uppercase and lowercase are treated as being different.
.PP
Also, the "[no]wrapscan" option affects searches.
.SH
Examples
.PP
This example changes every occurence of "utilize" to "use":
.sp
.ti +1i
:%s/utilize/use/g
.PP
This example deletes all whitespace that occurs at the end of a line anywhere
in the file.
(The brackets contain a single space and a single tab.):
.sp
.ti +1i
:%s/[   ]\e+$//
.PP
This example converts the current line to uppercase:
.sp
.ti +1i
:s/.*/\eU&/
.PP
This example underlines each letter in the current line,
by changing it into an "underscore backspace letter" sequence.
(The ^H is entered as "control-V backspace".):
.sp
.ti +1i
:s/[a-zA-Z]/_^H&/g
.PP
This example locates the last colon in a line,
and swaps the text before the colon with the text after the colon.
The first \e( \e) pair is used to delimit the stuff before the colon,
and the second pair delimit the stuff after.
In the substitution text, \e1 and \e2 are given in reverse order
to perform the swap:
.sp
.ti +1i
:s/\e(.*\e):\e(.*\e)/\e2:\e1/
Commit	Line	Data
15637ed4 RG	1	.Go 4 "REGULAR EXPRESSIONS"
	2
	3	.PP
	4	\*E uses regular expressions for searching and substututions.
	5	A regular expression is a text string in which some characters have
	6	special meanings.
	7	This is much more powerful than simple text matching.
	8	.SH
	9	Syntax
	10	.PP
	11	\*E' regexp package treats the following one- or two-character
	12	strings (called meta-characters) in special ways:
78ed81a3	13	.IP "\e(\fIsubexpression\fP\e)" 0.8i
78ed81a3	14	The \e( and \e) metacharacters are used to delimit subexpressions.
15637ed4 RG	15	When the regular expression matches a particular chunk of text,
	16	\*E will remember which portion of that chunk matched the \fIsubexpression\fP.
	17	The :s/regexp/newtext/ command makes use of this feature.
	18	.IP "^" 0.8i
	19	The ^ metacharacter matches the beginning of a line.
	20	If, for example, you wanted to find "foo" at the beginning of a line,
	21	you would use a regular expression such as /^foo/.
	22	Note that ^ is only a metacharacter if it occurs
	23	at the beginning of a regular expression;
	24	anyplace else, it is treated as a normal character.
	25	.IP "$" 0.8i
	26	The $ metacharacter matches the end of a line.
	27	It is only a metacharacter when it occurs at the end of a regular expression;
	28	elsewhere, it is treated as a normal character.
	29	For example, the regular expression /$$/ will search for a dollar sign at
	30	the end of a line.
78ed81a3	31	.IP "\e<" 0.8i
78ed81a3	32	The \e< metacharacter matches a zero-length string at the beginning of
15637ed4 RG	33	a word.
	34	A word is considered to be a string of 1 or more letters and digits.
	35	A word can begin at the beginning of a line
	36	or after 1 or more non-alphanumeric characters.
78ed81a3	37	.IP "\e>" 0.8i
78ed81a3	38	The \e> metacharacter matches a zero-length string at the end of a word.
15637ed4 RG	39	A word can end at the end of the line
15637ed4 RG	40	or before 1 or more non-alphanumeric characters.
78ed81a3	41	For example, /\e<end\e>/ would find any instance of the word "end",
15637ed4 RG	42	but would ignore any instances of e-n-d inside another word
	43	such as "calendar".
	44	.IP "\&." 0.8i
	45	The . metacharacter matches any single character.
	46	.IP "[\fIcharacter-list\fP]" 0.8i
	47	This matches any single character from the \fIcharacter-list\fP.
	48	Inside the \fIcharacter-list\fP, you can denote a span of characters
	49	by writing only the first and last characters, with a hyphen between
	50	them.
	51	If the \fIcharacter-list\fP is preceded by a ^ character, then the
	52	list is inverted -- it will match character that \fIisn't\fP mentioned
	53	in the list.
	54	For example, /[a-zA-Z]/ matches any letter, and /[^ ]/ matches anything
	55	other than a blank.
78ed81a3	56	.IP "\e{\fIn\fP\e}" 0.8i
15637ed4 RG	57	This is a closure operator,
	58	which means that it can only be placed after something that matches a
	59	single character.
	60	It controls the number of times that the single-character expression
	61	should be repeated.
	62	.IP "" 0.8i
78ed81a3	63	The \e{\fIn\fP\e} operator, in particular, means that the preceding
15637ed4	64	expression should be repeated exactly \fIn\fP times.
78ed81a3	65	For example, /^-\e{80\e}$/ matches a line of eighty hyphens, and
	66	/\e<[a-zA-Z]\e{4\e}\e>/ matches any four-letter word.
	67	.IP "\e{\fIn\fP,\fIm\fP\e}" 0.8i
15637ed4 RG	68	This is a closure operator which means that the preceding single-character
	69	expression should be repeated between \fIn\fP and \fIm\fP times, inclusive.
	70	If the \fIm\fP is omitted (but the comma is present) then \fIm\fP is
	71	taken to be inifinity.
78ed81a3	72	For example, /"[^"]\e{3,5\e}"/ matches any pair of quotes which contains
15637ed4 RG	73	three, four, or five non-quote characters.
	74	.IP "*" 0.8i
	75	The * metacharacter is a closure operator which means that the preceding
	76	single-character expression can be repeated zero or more times.
78ed81a3	77	It is equivelent to \e{0,\e}.
15637ed4	78	For example, /.*/ matches a whole line.
78ed81a3	79	.IP "\e+" 0.8i
78ed81a3	80	The \e+ metacharacter is a closure operator which means that the preceding
15637ed4	81	single-character expression can be repeated one or more times.
78ed81a3	82	It is equivelent to \e{1,\e}.
78ed81a3	83	For example, /.\e+/ matches a whole line, but only if the line contains
15637ed4 RG	84	at least one character.
15637ed4 RG	85	It doesn't match empty lines.
78ed81a3	86	.IP "\e?" 0.8i
78ed81a3	87	The \e? metacharacter is a closure operator which indicates that the
15637ed4 RG	88	preceding single-character expression is optional -- that is, that it
15637ed4 RG	89	can occur 0 or 1 times.
78ed81a3	90	It is equivelent to \e{0,1\e}.
78ed81a3	91	For example, /no[ -]\e?one/ matches "no one", "no-one", or "noone".
15637ed4 RG	92	.PP
	93	Anything else is treated as a normal character which must exactly match
	94	a character from the scanned text.
	95	The special strings may all be preceded by a backslash to
	96	force them to be treated normally.
	97	.SH
	98	Substitutions
	99	.PP
	100	The :s command has at least two arguments: a regular expression,
	101	and a substitution string.
	102	The text that matched the regular expression is replaced by text
	103	which is derived from the substitution string.
	104	.br
	105	.ne 15 \" so we don't mess up the table
	106	.PP
	107	Most characters in the substitution string are copied into the
	108	text literally but a few have special meaning:
	109	.LD
	110	.ta 0.75i 1.3i
	111	& Insert a copy of the original text
	112	~ Insert a copy of the previous replacement text
78ed81a3	113	\e1 Insert a copy of that portion of the original text which
	114	matched the first set of \e( \e) parentheses
	115	\e2-\e9 Do the same for the second (etc.) pair of \e( \e)
	116	\eU Convert all chars of any later & or \e# to uppercase
	117	\eL Convert all chars of any later & or \e# to lowercase
	118	\eE End the effect of \eU or \eL
	119	\eu Convert the first char of the next & or \e# to uppercase
	120	\el Convert the first char of the next & or \e# to lowercase
15637ed4 RG	121	.TA
	122	.DE
	123	.PP
	124	These may be preceded by a backslash to force them to be treated normally.
	125	If "nomagic" mode is in effect,
	126	then & and ~ will be treated normally,
78ed81a3	127	and you must write them as \e& and \e~ for them to have special meaning.
15637ed4 RG	128	.SH
	129	Options
	130	.PP
	131	\*E has two options which affect the way regular expressions are used.
	132	These options may be examined or set via the :set command.
	133	.PP
	134	The first option is called "[no]magic".
	135	This is a boolean option, and it is "magic" (TRUE) by default.
	136	While in magic mode, all of the meta-characters behave as described above.
	137	In nomagic mode, only ^ and $ retain their special meaning.
	138	.PP
	139	The second option is called "[no]ignorecase".
	140	This is a boolean option, and it is "noignorecase" (FALSE) by default.
	141	While in ignorecase mode, the searching mechanism will not distinguish between
	142	an uppercase letter and its lowercase form.
	143	In noignorecase mode, uppercase and lowercase are treated as being different.
	144	.PP
	145	Also, the "[no]wrapscan" option affects searches.
	146	.SH
	147	Examples
	148	.PP
	149	This example changes every occurence of "utilize" to "use":
	150	.sp
	151	.ti +1i
	152	:%s/utilize/use/g
	153	.PP
	154	This example deletes all whitespace that occurs at the end of a line anywhere
	155	in the file.
	156	(The brackets contain a single space and a single tab.):
	157	.sp
	158	.ti +1i
78ed81a3	159	:%s/[ ]\e+$//
15637ed4 RG	160	.PP
	161	This example converts the current line to uppercase:
	162	.sp
	163	.ti +1i
78ed81a3	164	:s/.*/\eU&/
15637ed4 RG	165	.PP
	166	This example underlines each letter in the current line,
	167	by changing it into an "underscore backspace letter" sequence.
	168	(The ^H is entered as "control-V backspace".):
	169	.sp
	170	.ti +1i
	171	:s/[a-zA-Z]/_^H&/g
	172	.PP
	173	This example locates the last colon in a line,
	174	and swaps the text before the colon with the text after the colon.
78ed81a3	175	The first \e( \e) pair is used to delimit the stuff before the colon,
15637ed4	176	and the second pair delimit the stuff after.
78ed81a3	177	In the substitution text, \e1 and \e2 are given in reverse order
15637ed4 RG	178	to perform the swap:
	179	.sp
	180	.ti +1i
78ed81a3	181	:s/\e(.\e):\e(.\e)/\e2:\e1/