<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>奔向远方 &#187; 正则表达式</title>
	<atom:link href="http://www.tisswb.com/archives/tag/%e6%ad%a3%e5%88%99%e8%a1%a8%e8%be%be%e5%bc%8f/feed" rel="self" type="application/rss+xml" />
	<link>http://www.tisswb.com</link>
	<description>结婚开始倒计时了，高兴~</description>
	<lastBuildDate>Tue, 19 Jul 2011 09:30:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>正则表达式元字符列表</title>
		<link>http://www.tisswb.com/archives/66.html</link>
		<comments>http://www.tisswb.com/archives/66.html#comments</comments>
		<pubDate>Tue, 03 Jun 2008 16:59:28 +0000</pubDate>
		<dc:creator>笨二十一</dc:creator>
				<category><![CDATA[Linux/Unix]]></category>
		<category><![CDATA[PHP]]></category>
		<category><![CDATA[Web技术]]></category>
		<category><![CDATA[服务器]]></category>
		<category><![CDATA[正则表达式]]></category>

		<guid isPermaLink="false">http://www.tisswb.cn/?p=66</guid>
		<description><![CDATA[字符 描述 

将下一个字符标记为一个特殊字符、或一个原义字符、或一个后向引用、或一个八进制转义符。例如，&#8217;n&#8217; 匹配字符 &#8220;n&#8221;。&#8217;n&#8217; 匹配一个换行符。序列 &#8221; 匹配 &#8220;&#8221; 而 &#8220;(&#8221; 则匹配 &#8220;(&#8220;。
^
匹配输入字符串的开始位置。如果设置了 RegExp 对象的 Multiline 属性，^ 也匹配 &#8216;n&#8217; 或 &#8216;r&#8217; 之后的位置。
$
匹配输入字符串的结束位置。如果设置了RegExp 对象的 Multiline 属性，$ 也匹配 &#8216;n&#8217; 或 &#8216;r&#8217; 之前的位置。
*
匹配前面的子表达式零次或多次。例如，zo* 能匹配 &#8220;z&#8221; 以及 &#8220;zoo&#8221;。 * 等价于{0,}。
<span class="readmore"><a href="http://www.tisswb.com/archives/66.html" title="正则表达式元字符列表" target="_blank">阅读全文——共2765字</a></span>]]></description>
			<content:encoded><![CDATA[<div style="font-size: 9pt; font-family: Courier New;"><strong>字符 描述 </strong><br />
<strong></strong><br />
将下一个字符标记为一个特殊字符、或一个原义字符、或一个后向引用、或一个八进制转义符。例如，&#8217;n&#8217; 匹配字符 &#8220;n&#8221;。&#8217;n&#8217; 匹配一个换行符。序列 &#8221; 匹配 &#8220;&#8221; 而 &#8220;(&#8221; 则匹配 &#8220;(&#8220;。<br />
<strong>^</strong><br />
匹配输入字符串的开始位置。如果设置了 RegExp 对象的 Multiline 属性，^ 也匹配 &#8216;n&#8217; 或 &#8216;r&#8217; 之后的位置。<br />
<strong>$</strong><br />
匹配输入字符串的结束位置。如果设置了RegExp 对象的 Multiline 属性，$ 也匹配 &#8216;n&#8217; 或 &#8216;r&#8217; 之前的位置。<br />
<strong>*</strong><br />
匹配前面的子表达式零次或多次。例如，zo* 能匹配 &#8220;z&#8221; 以及 &#8220;zoo&#8221;。 * 等价于{0,}。<br />
+ 匹配前面的子表达式一次或多次。例如，&#8217;zo+&#8217; 能匹配 &#8220;zo&#8221; 以及 &#8220;zoo&#8221;，但不能匹配 &#8220;z&#8221;。+ 等价于 {1,}。<br />
<strong>?</strong><br />
匹配前面的子表达式零次或一次。例如，&#8221;do(es)?&#8221; 可以匹配 &#8220;do&#8221; 或 &#8220;does&#8221; 中的&#8221;do&#8221; 。? 等价于 {0,1}。<br />
<strong>{n}</strong><br />
n 是一个非负整数。匹配确定的 n 次。例如，&#8217;o{2}&#8217; 不能匹配 &#8220;Bob&#8221; 中的 &#8216;o&#8217;，但是能匹配 &#8220;food&#8221; 中的两个 o。<br />
<strong>{n,}</strong><br />
n 是一个非负整数。至少匹配n 次。例如，&#8217;o{2,}&#8217; 不能匹配 &#8220;Bob&#8221; 中的 &#8216;o&#8217;，但能匹配 &#8220;foooood&#8221; 中的所有 o。&#8217;o{1,}&#8217; 等价于 &#8216;o+&#8217;。&#8217;o{0,}&#8217; 则等价于 &#8216;o*&#8217;。<br />
<strong>{n,m}</strong><br />
m 和 n 均为非负整数，其中n &lt;= m。最少匹配 n 次且最多匹配 m 次。刘， &#8220;o{1,3}&#8221; 将匹配 &#8220;fooooood&#8221; 中的前三个 o。&#8217;o{0,1}&#8217; 等价于 &#8216;o?&#8217;。请注意在逗号和两个数之间不能有空格。<br />
<strong>?</strong><br />
当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如，对于字符串 &#8220;oooo&#8221;，&#8217;o+?&#8217; 将匹配单个 &#8220;o&#8221;，而 &#8216;o+&#8217; 将匹配所有 &#8216;o&#8217;。<br />
<strong>. </strong><br />
匹配除 &#8220;n&#8221; 之外的任何单个字符。要匹配包括 &#8216;n&#8217; 在内的任何字符，请使用象 &#8216;[.n]&#8216; 的模式。<br />
<strong>(pattern)</strong><br />
匹配pattern 并获取这一匹配。所获取的匹配可以从产生的 Matches 集合得到，在VBScript 中使用 SubMatches 集合，在JScript 中则使用 {CONTENT}… 属性。要匹配圆括号字符，请使用 &#8216;(&#8216; 或 &#8216;)&#8217;。<br />
<strong>(?:pattern)</strong><br />
匹配 pattern 但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用 &#8220;或&#8221; 字符 (|) 来组合一个模式的各个部分是很有用。例如， &#8216;industr(?:y|ies) 就是一个比 &#8216;industry|industries&#8217; 更简略的表达式。<br />
<strong>(?=pattern)</strong><br />
正向预查，在任何匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如， &#8216;Windows (?=95|98|NT|2000)&#8217; 能匹配 &#8220;Windows 2000&#8243; 中的 &#8220;Windows&#8221; ，但不能匹配 &#8220;Windows 3.1&#8243; 中的 &#8220;Windows&#8221;。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。<br />
<strong>(?!pattern)</strong><br />
负向预查，在任何不匹配Negative lookahead matches the search string at any point where a string not matching pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如&#8217;Windows (?!95|98|NT|2000)&#8217; 能匹配 &#8220;Windows 3.1&#8243; 中的 &#8220;Windows&#8221;，但不能匹配 &#8220;Windows 2000&#8243; 中的 &#8220;Windows&#8221;。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始<br />
<strong>x|y </strong><br />
匹配 x 或 y。例如，&#8217;z|food&#8217; 能匹配 &#8220;z&#8221; 或 &#8220;food&#8221;。&#8217;(z|f)ood&#8217; 则匹配 &#8220;zood&#8221; 或 &#8220;food&#8221;。<br />
<strong>[xyz]</strong><br />
字符集合。匹配所包含的任意一个字符。例如， &#8216;[abc]&#8216; 可以匹配 &#8220;plain&#8221; 中的 &#8216;a&#8217;。<br />
<strong>[^xyz]</strong><br />
负值字符集合。匹配未包含的任意字符。例如， &#8216;[^abc]&#8216; 可以匹配 &#8220;plain&#8221; 中的&#8217;p'。<br />
<strong>[a-z]</strong><br />
字符范围。匹配指定范围内的任意字符。例如，&#8217;[a-z]&#8216; 可以匹配 &#8216;a&#8217; 到 &#8216;z&#8217; 范围内的任意小写字母字符。<br />
<strong>[^a-z]</strong><br />
负值字符范围。匹配任何不在指定范围内的任意字符。例如，&#8217;[^a-z]&#8216; 可以匹配任何不在 &#8216;a&#8217; 到 &#8216;z&#8217; 范围内的任意字符。<br />
<strong>b</strong><br />
匹配一个单词边界，也就是指单词和空格间的位置。例如， &#8216;erb&#8217; 可以匹配&#8221;never&#8221; 中的 &#8216;er&#8217;，但不能匹配 &#8220;verb&#8221; 中的 &#8216;er&#8217;。<br />
<strong>B</strong><br />
匹配非单词边界。&#8217;erB&#8217; 能匹配 &#8220;verb&#8221; 中的 &#8216;er&#8217;，但不能匹配 &#8220;never&#8221; 中的 &#8216;er&#8217;。<br />
<strong>cx</strong><br />
匹配由x指明的控制字符。例如， cM 匹配一个 Control-M 或回车符。 x 的值必须为 A-Z 或 a-z 之一。否则，将 c 视为一个原义的 &#8216;c&#8217; 字符。<br />
<strong>d</strong><br />
匹配一个数字字符。等价于 [0-9]。<br />
<strong>D</strong><br />
匹配一个非数字字符。等价于 [^0-9]。<br />
<strong>f</strong><br />
匹配一个换页符。等价于 x0c 和 cL。<br />
<strong>n</strong><br />
匹配一个换行符。等价于 x0a 和 cJ。<br />
<strong>r</strong><br />
匹配一个回车符。等价于 x0d 和 cM。<br />
<strong>s</strong><br />
匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ fnrtv]。<br />
<strong>S</strong><br />
匹配任何非空白字符。等价于 [^ fnrtv]。<br />
<strong>t</strong><br />
匹配一个制表符。等价于 x09 和 cI。<br />
<strong>v</strong><br />
匹配一个垂直制表符。等价于 x0b 和 cK。<br />
<strong>w</strong><br />
匹配包括下划线的任何单词字符。等价于&#8217;[A-Za-z0-9_]&#8216;。<br />
<strong>W</strong><br />
匹配任何非单词字符。等价于 &#8216;[^A-Za-z0-9_]&#8216;。<br />
<strong>xn</strong><br />
匹配 n，其中 n 为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如， &#8216;x41&#8242; 匹配 &#8220;A&#8221;。&#8217;x041&#8242; 则等价于 &#8216;x04&#8242; &amp; &#8220;1&#8243;。正则表达式中可以使用 ASCII 编码。.<br />
<strong>num</strong><br />
匹配 num，其中 num 是一个正整数。对所获取的匹配的引用。例如，&#8217;(.)&#8217; 匹配两个连续的相同字符。<br />
<strong>n</strong><br />
标识一个八进制转义值或一个后向引用。如果 n 之前至少 n 个获取的子表达式，则 n 为后向引用。否则，如果 n 为八进制数字 (0-7)，则 n 为一个八进制转义值。<br />
<strong>nm</strong><br />
标识一个八进制转义值或一个后向引用。如果 nm 之前至少有is preceded by at least nm 个获取得子表达式，则 nm 为后向引用。如果 nm 之前至少有 n 个获取，则 n 为一个后跟文字 m 的后向引用。如果前面的条件都不满足，若  n 和 m 均为八进制数字 (0-7)，则 nm 将匹配八进制转义值 nm。<br />
<strong>nml</strong><br />
如果 n 为八进制数字 (0-3)，且 m 和 l 均为八进制数字 (0-7)，则匹配八进制转义值 nml。<br />
<strong>un</strong><br />
匹配 n，其中 n 是一个用四个十六进制数字表示的 Unicode 字符。例如，u00A9 匹配版权符号 (?)。</div>
]]></content:encoded>
			<wfw:commentRss>http://www.tisswb.com/archives/66.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>深入浅出学正则表达式教程</title>
		<link>http://www.tisswb.com/archives/64.html</link>
		<comments>http://www.tisswb.com/archives/64.html#comments</comments>
		<pubDate>Tue, 03 Jun 2008 14:34:59 +0000</pubDate>
		<dc:creator>笨二十一</dc:creator>
				<category><![CDATA[Linux/Unix]]></category>
		<category><![CDATA[Web技术]]></category>
		<category><![CDATA[服务器]]></category>
		<category><![CDATA[正则表达式]]></category>

		<guid isPermaLink="false">http://www.tisswb.cn/?p=64</guid>
		<description><![CDATA[       本文是Jan Goyvaerts为RegexBuddy写的教程的译文，版权归原作者所有，欢迎转载。但是为了尊重原作者和译者的劳动，请注明出处！谢谢！
 
1.      什么是正则表达式
基本说来，正则表达式是一种用来描述一定数量文本的模式。Regex代表Regular Express。本文将用&#60;&#60;regex&#62;&#62;来表示一段具体的正则表达式。
一段文本就是最基本的模式，简单的匹配相同的文本。
 
2.      不同的正则表达式引擎
正则表达式引擎是一种可以处理正则表达式的软件。通常，引擎是更大的应用程序的一部分。在软件世界，不同的正则表达式并不互相兼容。本教程会集中讨论Perl 5 类型的引擎，因为这种引擎是应用最广泛的引擎。同时我们也会提到一些和其他引擎的区别。许多近代的引擎都很类似，但不完全一样。例如.NET正则库，JDK正则包。
 
3.      文字符号
<span class="readmore"><a href="http://www.tisswb.com/archives/64.html" title="深入浅出学正则表达式教程" target="_blank">阅读全文——共14783字</a></span>]]></description>
			<content:encoded><![CDATA[<p>       <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">本文是</span>Jan Goyvaerts<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">为</span>RegexBuddy<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">写的教程的译文，版权归原作者所有，欢迎转载。但是为了尊重原作者和译者的劳动，请注明出处！谢谢！</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><strong style="mso-bidi-font-weight: normal"><span style="FONT-SIZE: 18pt"> </span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">1.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">什么是正则表达式</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">基本说来，正则表达式是一种用来描述一定数量文本的模式。</span>Regex<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表</span>Regular Express<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。本文将用</span>&lt;&lt;regex&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">来表示一段具体的正则表达式。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一段文本就是最基本的模式，简单的匹配相同的文本。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">2.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">不同的正则表达式引擎</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式引擎是一种可以处理正则表达式的软件。通常，引擎是更大的应用程序的一部分。在软件世界，不同的正则表达式并不互相兼容。本教程会集中讨论</span>Perl 5 <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">类型的引擎，因为这种引擎是应用最广泛的引擎。同时我们也会提到一些和其他引擎的区别。许多近代的引擎都很类似，但不完全一样。例如</span>.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则库，</span>JDK<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则包。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">3.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">文字符号</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">最基本的正则表达式由单个文字符号组成。如</span>&lt;&lt;a&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，它将匹配字符串中第一次出现的字符“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。如对字符串“</span>Jack is a boy<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。“</span>J<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后的“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”将被匹配。而第二个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”将不会被匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式也可以匹配第二个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，这必须是你告诉正则表达式引擎从第一次匹配的地方开始搜索。在文本编辑器中，你可以使用“查找下一个”。在编程语言中，会有一个函数可以使你从前一次匹配的位置开始继续向后搜索。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">类似的，</span>&lt;&lt;cat&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会匹配“</span>About cats and dogs<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。这等于是告诉正则表达式引擎，找到一个</span>&lt;&lt;c&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，紧跟一个</span>&lt;&lt;a&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，再跟一个</span>&lt;&lt;t&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">要注意，正则表达式引擎缺省是大小写敏感的。除非你告诉引擎忽略大小写，否则</span>&lt;&lt;cat&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">不会匹配“</span>Cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">特殊字符</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">对于文字字符，有</span>11<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字符被保留作特殊用途。他们是：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.5in; TEXT-INDENT: 0.5in">[ ]  ^ $ . | ? * + ( )</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">这些特殊字符也被称作元字符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你想在正则表达式中将这些字符用作文本字符，你需要用反斜杠“</span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”对其进行换码</span><span lang="ZH-CN"> </span>(escape)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。例如你想匹配“</span>1+1=2<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，正确的表达式为</span>&lt;&lt;1+1=2&gt;&gt;.</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">需要注意的是，</span>&lt;&lt;1+1=2&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">也是有效的正则表达式。但它不会匹配“</span>1+1=2<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，而会匹配“</span>123+111=234<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的“</span>111=2<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。因为“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”在这里表示特殊含义（重复</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次到多次）。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在编程语言中，要注意，一些特殊的字符会先被编译器处理，然后再传递给正则引擎。因此正则表达式</span>&lt;&lt;1+2=2&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>C++<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中要写成“</span>1\+1=2<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。为了匹配“</span>C:temp<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，你要用正则表达式</span>&lt;&lt;C:\temp&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。而在</span>C++<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中，正则表达式则变成了“</span>C:\\temp<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">不可显示字符</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">可以使用特殊字符序列来代表某些不可显示字符：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;t&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表</span>Tab(0&#215;09)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;r&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表回车符</span>(0x0D)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;n&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表换行符</span>(0x0A)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">要注意的是</span>Windows<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中文本文件使用“</span>rn<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”来结束一行而</span>Unix<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">使用“</span>n<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">4.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式引擎的内部工作机制</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">知道正则表达式引擎是如何工作的有助于你很快理解为何某个正则表达式不像你期望的那样工作。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">有两种类型的引擎：文本导向</span>(text-directed)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的引擎和正则导向</span>(regex-directed)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的引擎。</span>Jeffrey Friedl<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">把他们称作</span>DFA<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>NFA<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引擎。本文谈到的是正则导向的引擎。这是因为一些非常有用的特性，如“惰性”量词</span>(lazy quantifiers)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和反向引用</span>(backreferences)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，只能在正则导向的引擎中实现。所以毫不意外这种引擎是目前最流行的引擎。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">你可以轻易分辨出所使用的引擎是文本导向还是正则导向。如果反向引用或“惰性”量词被实现，则可以肯定你使用的引擎是正则导向的。你可以作如下测试：将正则表达式</span>&lt;&lt;regex|regex not&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到字符串“</span>regex not<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。如果匹配的结果是</span>regex<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，则引擎是正则导向的。如果结果是</span>regex not<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，则是文本导向的。因为正则导向的引擎是“猴急”的，它会很急切的进行表功，报告它找到的第一个匹配</span><span lang="ZH-CN"> </span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则导向的引擎总是返回最左边的匹配</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">这是需要你理解的很重要的一点：即使以后有可能发现一个“更好”的匹配，正则导向的引擎也总是返回最左边的匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当把</span>&lt;&lt;cat&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到“</span>He captured a catfish for his cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，引擎先比较</span>&lt;&lt;c&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和“</span>H<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，结果失败了。于是引擎再比较</span>&lt;&lt;c&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和“</span>e<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，也失败了。直到第四个字符，</span>&lt;&lt;c&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配了“</span>c<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span>&lt;&lt;a&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配了第五个字符。到第六个字符</span>&lt;&lt;t&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">没能匹配“</span>p<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，也失败了。引擎再继续从第五个字符重新检查匹配性。直到第十五个字符开始，</span>&lt;&lt;cat&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配上了“</span>catfish<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，正则表达式引擎急切的返回第一个匹配的结果，而不会再继续查找是否有其他更好的匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">5.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集是由一对方括号“</span>[]<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”括起来的字符集合。使用字符集，你可以告诉正则表达式引擎仅仅匹配多个字符中的一个。如果你想匹配一个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”或一个“</span>e<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，使用</span>&lt;&lt;[ae]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。你可以使用</span>&lt;&lt;gr[ae]y&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配</span>gray<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>grey<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。这在你不确定你要搜索的字符是采用美国英语还是英国英语时特别有用。相反，</span>&lt;&lt;gr[ae]y&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将不会匹配</span>graay<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>graey<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。字符集中的字符顺序并没有什么关系，结果都是相同的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">你可以使用连字符“</span>-<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”定义一个字符范围作为字符集。</span>&lt;&lt;[0-9]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">到</span>9<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间的单个数字。你可以使用不止一个范围。</span>&lt;&lt;[0-9a-fA-F] &gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配单个的十六进制数字，并且大小写不敏感。你也可以结合范围定义与单个字符定义。</span>&lt;&lt;[0-9a-fxA-FX]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配一个十六进制数字或字母</span>X<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。再次强调一下，字符和范围定义的先后顺序对结果没有影响。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集的一些应用</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查找一个可能有拼写错误的单词，比如</span>&lt;&lt;sep[ae]r[ae]te&gt;&gt; <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span><span lang="ZH-CN"> </span>&lt;&lt;li[cs]en[cs]e&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查找程序语言的标识符，</span>&lt;&lt;A-Za-z_][A-Za-z_0-9]*&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>(*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">表示重复</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或多次</span>)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查找</span>C<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">风格的十六进制数</span>&lt;&lt;0[xX][A-Fa-f0-9]+&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>(+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">表示重复一次或多次</span>)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">取反字符集</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在左方括号“</span>[<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后面紧跟一个尖括号“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，将会对字符集取反。结果是字符集将匹配任何不在方括号中的字符。不像“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，取反字符集是可以匹配回车换行符的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">需要记住的很重要的一点是，取反字符集必须要匹配一个字符。</span>&lt;&lt;q[^u]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">并不意味着：匹配一个</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，后面没有</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">跟着。它意味着：匹配一个</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，后面跟着一个不是</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的字符。所以它不会匹配“</span>Iraq<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，而会匹配“</span>Iraq is a country<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和一个空格符。事实上，空格符是匹配中的一部分，因为它是一个“不是</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的字符”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你只想匹配一个</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，条件是</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">后面有一个不是</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的字符，我们可以用后面将讲到的向前查看来解决。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集中的元字符</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">需要注意的是，在字符集中只有</span>4<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个</span><span lang="ZH-CN"> </span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符具有特殊含义。它们是：“</span>]  ^ -<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。“</span>]<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表字符集定义的结束；“</span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表转义；“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表取反；“</span>-<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表范围定义。其他常见的元字符在字符集定义内部都是正常字符，不需要转义。例如，要搜索星号</span>*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或加号</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，你可以用</span>&lt;&lt;[+*]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。当然，如果你对那些通常的元字符进行转义，你的正则表达式一样会工作得很好，但是这会降低可读性。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在字符集定义中为了将反斜杠“</span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”作为一个文字字符而非特殊含义的字符，你需要用另一个反斜杠对它进行转义。</span>&lt;&lt;[\x]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会匹配一个反斜杠和一个</span>X<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。“</span>]^-<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”都可以用反斜杠进行转义，或者将他们放在一个不可能使用到他们特殊含义的位置。我们推荐后者，因为这样可以增加可读性。比如对于字符“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，将它放在除了左括号“</span>[<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后面的位置，使用的都是文字字符含义而非取反含义。如</span>&lt;&lt;[x^]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会匹配一个</span>x<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>&lt;&lt;[]x]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会匹配一个“</span>]<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”或“</span>x<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span>&lt;&lt;[-x]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>&lt;&lt;[x-]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">都会匹配一个“</span>-<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”或“</span>x<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集的简写</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">因为一些字符集非常常用，所以有一些简写方式。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;d&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表</span>&lt;&lt;[0-9]&gt;&gt;;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;w&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表单词字符。这个是随正则表达式实现的不同而有些差异。绝大多数的正则表达式实现的单词字符集都包含了</span>&lt;&lt;A-Za-z0-9_]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;s&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代表“白字符”。这个也是和不同的实现有关的。在绝大多数的实现中，都包含了空格符和</span>Tab<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">符，以及回车换行符</span>&lt;&lt;rn&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集的缩写形式可以用在方括号之内或之外。</span>&lt;&lt;sd&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配一个白字符后面紧跟一个数字。</span>&lt;&lt;[sd]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配单个白字符或数字。</span>&lt;&lt;[da-fA-F]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将匹配一个十六进制数字。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">取反字符集的简写</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;[S]&gt;&gt; = &lt;&lt;[^s]&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;[W]&gt;&gt; = &lt;&lt;[^w]&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;[D]&gt;&gt; = &lt;&lt;[^d]&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符集的重复</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你用“</span>?*+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”操作符来重复一个字符集，你将会重复整个字符集。而不仅是它匹配的那个字符。正则表达式</span>&lt;&lt;[0-9]+&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会匹配</span>837<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">以及</span>222<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你仅仅想重复被匹配的那个字符，可以用向后引用达到目的。我们以后将讲到向后引用。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">6.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">使用</span>?*</strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>+ </strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">进行重复</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">?<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">：告诉引擎匹配前导字符</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次或一次。事实上是表示前导字符是可选的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">：告诉引擎匹配前导字符</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次或多次</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">：告诉引擎匹配前导字符</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次或多次</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">&lt;[A-Za-z][A-Za-z0-9]*&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配没有属性的</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">标签，“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”以及“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”是文字符号。第一个字符集匹配一个字母，第二个字符集匹配一个字母或数字。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们似乎也可以用</span>&lt;[A-Za-z0-9]+&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。但是它会匹配</span>&lt;1&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。但是这个正则表达式在你知道你要搜索的字符串不包含类似的无效标签时还是足够有效的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">限制性重复</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">许多现代的正则表达式实现，都允许你定义对一个字符重复多少次。词法是：</span>{min,max}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>min<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>max<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">都是非负整数。如果逗号有而</span>max<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">被忽略了，则</span>max<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">没有限制。如果逗号和</span>max<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">都被忽略了，则重复</span>min<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">因此</span>{0,}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一样，</span>{1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>+ <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的作用一样。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">你可以用</span>&lt;&lt;b[1-9][0-9]{3}b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配</span>1000~9999<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间的数字</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”表示单词边界</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>&lt;&lt;b[1-9][0-9]{2,4}b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配一个在</span>100~99999<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间的数字。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">注意贪婪性</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">假设你想用一个正则表达式匹配一个</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">标签。你知道输入将会是一个有效的</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">文件，因此正则表达式不需要排除那些无效的标签。所以如果是在两个尖括号之间的内容，就应该是一个</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">标签。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">许多正则表达式的新手会首先想到用正则表达式</span>&lt;&lt; &lt;.+&gt; &gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，他们会很惊讶的发现，对于测试字符串，“</span>This is a &lt;EM&gt;first&lt;/EM&gt; test<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，你可能期望会返回</span>&lt;EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，然后继续进行匹配的时候，返回</span>&lt;/EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">但事实是不会。正则表达式将会匹配“</span>&lt;EM&gt;first&lt;/EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。很显然这不是我们想要的结果。原因在于“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”是贪婪的。也就是说，“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”会导致正则表达式引擎试图尽可能的重复前导字符。只有当这种重复会引起整个正则表达式匹配失败的情况下，引擎会进行回溯。也就是说，它会放弃最后一次的“重复”，然后处理正则表达式余下的部分。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”类似，“</span>?*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的重复也是贪婪的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">深入正则表达式引擎内部</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">让我们来看看正则引擎如何匹配前面的例子。第一个记号是“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，这是一个文字符号。第二个符号是“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，匹配了字符“</span>E<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，然后“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”一直可以匹配其余的字符，直到一行的结束。然后到了换行符，匹配失败</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”不匹配换行符</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。于是引擎开始对下一个正则表达式符号进行匹配。也即试图匹配“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。到目前为止，“</span>&lt;.+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”已经匹配了“</span>&lt;EM&gt;first&lt;/EM&gt; test<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎会试图将“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”与换行符进行匹配，结果失败了。于是引擎进行回溯。结果是现在“</span>&lt;.+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配“</span>&lt;EM&gt;first&lt;/EM&gt; tes<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。于是引擎将“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”与“</span>t<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”进行匹配。显然还是会失败。这个过程继续，直到“</span>&lt;.+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配“</span>&lt;EM&gt;first&lt;/EM<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”与“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配。于是引擎找到了一个匹配“</span>&lt;EM&gt;first&lt;/EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。记住，正则导向的引擎是“急切的”，所以它会急着报告它找到的第一个匹配。而不是继续回溯，即使可能会有更好的匹配，例如“</span>&lt;EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。所以我们可以看到，由于“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的贪婪性，使得正则表达式引擎返回了一个最左边的最长的匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">用懒惰性取代贪婪性</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一个用于修正以上问题的可能方案是用“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的惰性代替贪婪性。你可以在“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后面紧跟一个问号“</span>?<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”来达到这一点。“</span>*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，“</span>{}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和“</span>?<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”表示的重复也可以用这个方案。因此在上面的例子中我们可以使用“</span>&lt;.+?&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。让我们再来看看正则表达式引擎的处理过程。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">再一次，正则表达式记号“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”会匹配字符串的第一个“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。下一个正则记号是“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。这次是一个懒惰的“</span>+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”来重复上一个字符。这告诉正则引擎，尽可能少的重复上一个字符。因此引擎匹配“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和字符“</span>E<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，然后用“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配“</span>M<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，结果失败了。引擎会进行回溯，和上一个例子不同，因为是惰性重复，所以引擎是扩展惰性重复而不是减少，于是“</span>&lt;.+<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”现在被扩展为“</span>&lt;EM<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎继续匹配下一个记号“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。这次得到了一个成功匹配。引擎于是报告“</span>&lt;EM&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”是一个成功的匹配。整个过程大致如此。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">惰性扩展的一个替代方案</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们还有一个更好的替代方案。可以用一个贪婪重复与一个取反字符集：“</span>&lt;[^&gt;]+&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。之所以说这是一个更好的方案在于使用惰性重复时，引擎会在找到一个成功匹配前对每一个字符进行回溯。而使用取反字符集则不需要进行回溯。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">最后要记住的是，本教程仅仅谈到的是正则导向的引擎。文本导向的引擎是不回溯的。但是同时他们也不支持惰性重复操作。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">7.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">使用“</span>.</strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配几乎任意字符</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在正则表达式中，“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”是最常用的符号之一。不幸的是，它也是最容易被误用的符号之一。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配一个单个的字符而不用关心被匹配的字符是什么。唯一的例外是新行符。在本教程中谈到的引擎，缺省情况下都是不匹配新行符的。因此在缺省情况下，“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”等于是字符集</span>[^nr](Window)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>[^n]( Unix)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的简写。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">这个例外是因为历史的原因。因为早期使用正则表达式的工具是基于行的。它们都是一行一行的读入一个文件，将正则表达式分别应用到每一行上去。在这些工具中，字符串是不包含新行符的。因此“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”也就从不匹配新行符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">现代的工具和语言能够将正则表达式应用到很大的字符串甚至整个文件上去。本教程讨论的所有正则表达式实现都提供一个选项，可以使“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配所有的字符，包括新行符。在</span>RegexBuddy, EditPad Pro<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>PowerGREP<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">等工具中，你可以简单的选中“点号匹配新行符”。在</span>Perl<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中，“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”可以匹配新行符的模式被称作“单行模式”。很不幸，这是一个很容易混淆的名词。因为还有所谓“多行模式”。多行模式只影响行首行尾的锚定</span>(anchor)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，而单行模式只影响“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">其他语言和正则表达式库也采用了</span>Perl<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的术语定义。当在</span>.NET Framework<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中使用正则表达式类时，你可以用类似下面的语句来激活单行模式：</span>Regex.Match(“string”,”regex”,RegexOptions.SingleLine)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">保守的使用点号“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">点号可以说是最强大的元字符。它允许你偷懒：用一个点号，就能匹配几乎所有的字符。但是问题在于，它也常常会匹配不该匹配的字符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我会以一个简单的例子来说明。让我们看看如何匹配一个具有“</span>mm/dd/yy<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”格式的日期，但是我们想允许用户来选择分隔符。很快能想到的一个方案是</span>&lt;&lt;dd.dd.dd&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。看上去它能匹配日期“</span>02/12/03<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。问题在于</span>02512703<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">也会被认为是一个有效的日期。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;dd[-/.]dd[-/.]dd&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">看上去是一个好一点的解决方案。记住点号在一个字符集里不是元字符。这个方案远不够完善，它会匹配“</span>99/99/99<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。而</span>&lt;&lt;[0-1]d[-/.][0-3]d[-/.]dd&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">又更进一步。尽管他也会匹配“</span>19/39/99<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。你想要你的正则表达式达到如何完美的程度取决于你想达到什么样的目的。如果你想校验用户输入，则需要尽可能的完美。如果你只是想分析一个已知的源，并且我们知道没有错误的数据，用一个比较好的正则表达式来匹配你想要搜寻的字符就已经足够。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"><strong style="mso-bidi-font-weight: normal"> </strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">8.<span style="font-family: 'Times New Roman';">      </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">字符串开始和结束的锚定</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">锚定和一般的正则表达式符号不同，它不匹配任何字符。相反，他们匹配的是字符之前或之后的位置。“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配一行字符串第一个字符前的位置。</span>&lt;&lt;^a&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会匹配字符串“</span>abc<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span>&lt;&lt;^b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将不会匹配“</span>abc<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的任何字符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">类似的，</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配字符串中最后一个字符的后面的位置。所以</span>&lt;&lt;c$&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配“</span>abc<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的</span>c<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">锚定的应用</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在编程语言中校验用户输入时，使用锚定是非常重要的。如果你想校验用户的输入为整数，用</span>&lt;&lt;^d+$&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">用户输入中，常常会有多余的前导空格或结束空格。你可以用</span>&lt;&lt;^s*&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>&lt;&lt;s*$&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">来匹配前导空格或结束空格。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">使用“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和“</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”作为行的开始和结束锚定</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你有一个包含了多行的字符串。例如：“</span>first linenrsecond line<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">其中</span>nr<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">表示一个新行符</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。常常需要对每行分别处理而不是整个字符串。因此，几乎所有的正则表达式引擎都提供一个选项，可以扩展这两种锚定的含义。“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”可以匹配字串的开始位置</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>f<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之前</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，以及每一个新行符的后面位置</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>nr<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>s<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。类似的，</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会匹配字串的结束位置</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">最后一个</span>e<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之后</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，以及每个新行符的前面</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>e<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">与</span>nr<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间</span>)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中，当你使用如下代码时，将会定义锚定匹配每一个新行符的前面和后面位置：</span>Regex.Match(&#8220;string&#8221;, &#8220;regex&#8221;, RegexOptions.Multiline)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用：</span>string str = Regex.Replace(Original, &#8220;^&#8221;, &#8220;&gt; &#8220;, RegexOptions.Multiline)&#8211;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会在每行的行首插入“</span>&gt; <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">绝对锚定</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;A&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">只匹配整个字符串的开始位置，</span>&lt;&lt;Z&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">只匹配整个字符串的结束位置。即使你使用了“多行模式”，</span>&lt;&lt;A&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>&lt;&lt;Z&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">也从不匹配新行符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">即使</span>Z<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">只匹配字符串的结束位置，仍然有一个例外的情况。如果字符串以新行符结束，则</span>Z<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会匹配新行符前面的位置，而不是整个字符串的最后面。这个“改进”是由</span>Perl<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引进的，然后被许多的正则表达式实现所遵循，包括</span>Java<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">等。如果应用</span>&lt;&lt;^[a-z]+$&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">到“</span>joen<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，则匹配结果是“</span>joe<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”而不是“</span>joen<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore"><strong>9.</strong><span style="font-family: 'Times New Roman';">      </span></span></span><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">单词边界</span></strong> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">元字符</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">也是一种对位置进行匹配的“锚”。这种匹配是</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">长度匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; mso-para-margin-left: 1.5gd"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">有</span>4<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">种位置被认为是“单词边界”：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 57pt; TEXT-INDENT: -21pt; mso-list: l0 level2 lfo1; tab-stops: list 57.0pt"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">1)<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在字符串的第一个字符前的位置</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果字符串的第一个字符是一个“单词字符”</span>)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 57pt; TEXT-INDENT: -21pt; mso-list: l0 level2 lfo1; tab-stops: list 57.0pt"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">2)<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在字符串的最后一个字符后的位置</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果字符串的最后一个字符是一个“单词字符”</span>)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 57pt; TEXT-INDENT: -21pt; mso-list: l0 level2 lfo1; tab-stops: list 57.0pt"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">3)<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在一个“单词字符”和“非单词字符”之间，其中“非单词字符”紧跟在“单词字符”之后</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 57pt; TEXT-INDENT: -21pt; mso-list: l0 level2 lfo1; tab-stops: list 57.0pt"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">4)<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在一个“非单词字符”和“单词字符”之间，其中“单词字符”紧跟在“非单词字符”后面</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="mso-spacerun: yes"> </span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">“单词字符”是可以用“</span>w<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配的字符，“非单词字符”是可以用“</span>W<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配的字符。在大多数的正则表达式实现中，“单词字符”通常包括</span>&lt;&lt;[a-zA-Z0-9_]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">例如：</span>&lt;&lt;b4b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">能够匹配单个的</span>4<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">而不是一个更大数的一部分。这个正则表达式不会匹配“</span>44<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的</span>4<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">换种说法，几乎可以说</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配一个“字母数字序列”的开始和结束的位置。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">“单词边界”的取反集为</span>&lt;&lt;B&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，他要匹配的位置是两个“单词字符”之间或者两个“非单词字符”之间的位置。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">深入正则表达式引擎内部</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">让我们看看把正则表达式</span>&lt;&lt;bisb&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到字符串“</span>This island is beautiful<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎先处理符号</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。因为</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">是</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">长度</span><span lang="ZH-CN"> </span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，所以第一个字符</span>T<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">前面的位置会被考察。因为</span>T<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">是一个“单词字符”，而它前面的字符是一个空字符</span>(void)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，所以</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配了单词边界。接着</span>&lt;&lt;i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和第一个字符“</span>T<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配失败。匹配过程继续进行，直到第五个空格符，和第四个字符“</span>s<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”之间又匹配了</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。然而空格符和</span>&lt;&lt;i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">不匹配。继续向后，到了第六个字符“</span>i<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，和第五个空格字符之间匹配了</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，然后</span>&lt;&lt;is&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和第六、第七个字符都匹配了。然而第八个字符和第二个“单词边界”不匹配，所以匹配又失败了。到了第</span>13<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字符</span>i<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，因为和前面一个空格符形成“单词边界”，同时</span>&lt;&lt;is&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和“</span>is<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配。引擎接着尝试匹配第二个</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。因为第</span>15<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个空格符和“</span>s<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”形成单词边界，所以匹配成功。引擎“急着”返回成功匹配的结果。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">10.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">选择符</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式中“</span>|<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”表示选择。你可以用选择符匹配多个可能的正则表达式中的一个。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你想搜索文字“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”或“</span>dog<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，你可以用</span>&lt;&lt;cat|dog&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。如果你想有更多的选择，你只要扩展列表</span>&lt;&lt;cat|dog|mouse|fish&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">选择符在正则表达式中具有最低的优先级，也就是说，它告诉引擎要么匹配选择符左边的所有表达式，要么匹配右边的所有表达式。你也可以用圆括号来限制选择符的作用范围。如</span>&lt;&lt;b(cat|dog)b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，这样告诉正则引擎把</span>(cat|dog)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当成一个正则表达式单位来处理。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">注意正则引擎的“急于表功”性</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则引擎是急切的，当它找到一个有效的匹配时，它会停止搜索。因此在一定条件下，选择符两边的表达式的顺序对结果会有影响。假设你想用正则表达式搜索一个编程语言的函数列表：</span>Get<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>GetValue<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>Set<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>SetValue<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。一个明显的解决方案是</span>&lt;&lt;Get|GetValue|Set|SetValue&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。让我们看看当搜索</span>SetValue<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">时的结果。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">因为</span>&lt;&lt;Get&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>&lt;&lt;GetValue&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">都失败了，而</span>&lt;&lt;Set&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配成功。因为正则导向的引擎都是“急切”的，所以它会返回第一个成功的匹配，就是“</span>Set<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，而不去继续搜索是否有其他更好的匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和我们期望的相反，正则表达式并没有匹配整个字符串。有几种可能的解决办法。一是考虑到正则引擎的“急切”性，改变选项的顺序，例如我们使用</span>&lt;&lt;GetValue|Get|SetValue|Set&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，这样我们就可以优先搜索最长的匹配。我们也可以把四个选项结合起来成两个选项：</span>&lt;&lt;Get(Value)?|Set(Value)?&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。因为问号重复符是贪婪的，所以</span>SetValue<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">总会在</span>Set<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之前被匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一个更好的方案是使用单词边界：</span>&lt;&lt;b(Get|GetValue|Set|SetValue)b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>&lt;&lt;b(Get(Value)?|Set(Value)?b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。更进一步，既然所有的选择都有相同的结尾，我们可以把正则表达式优化为</span>&lt;&lt;b(Get|Set)(Value)?b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">11.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">组与向后引用</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">把正则表达式的一部分放在圆括号内，你可以将它们形成组。然后你可以对整个组使用一些正则操作，例如重复操作符。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">要注意的是，只有圆括号“</span>()<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”才能用于形成组。“</span>[]<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”用于定义字符集。“</span>{}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”用于定义重复操作。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当用“</span>()<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”定义了一个正则表达式组后，正则引擎则会把被匹配的组按照顺序编号，存入缓存。当对被匹配的组进行向后引用的时候，可以用“</span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">数字”的方式进行引用。</span>&lt;&lt;1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引用第一个匹配的后向引用组，</span>&lt;&lt;2&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引用第二个组，以此类推，</span>&lt;&lt;n&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引用第</span>n<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个组。而</span>&lt;&lt; &gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">则引用整个被匹配的正则表达式本身。我们看一个例子。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">假设你想匹配一个</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">标签的开始标签和结束标签，以及标签中间的文本。比如</span>&lt;B&gt;This is a test&lt;/B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，我们要匹配</span>&lt;B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和</span>&lt;/B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">以及中间的文字。我们可以用如下正则表达式：“</span>&lt;([A-Z][A-Z0-9]*)[^&gt;]*&gt;.*?&lt;/1&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">首先，“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”将会匹配“</span>&lt;B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的第一个字符“</span>&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。然后</span>[A-Z]<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配</span>B<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>[A-Z0-9]*<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会匹配</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">到多次字母数字，后面紧接着</span>0<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">到多个非“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的字符。最后正则表达式的“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”将会匹配“</span>&lt;B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的“</span>&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。接下来正则引擎将对结束标签之前的字符进行惰性匹配，直到遇到一个“</span>&lt;/<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”符号。然后正则表达式中的“</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”表示对前面匹配的组“</span>([A-Z][A-Z0-9]*)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”进行引用，在本例中，被引用的是标签名“</span>B<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。所以需要被匹配的结尾标签为“</span>&lt;/B&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">你可以对相同的后向引用组进行多次引用，</span>&lt;&lt;([a-c])x1x1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将匹配“</span>axaxa<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”、“</span>bxbxb<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”以及“</span>cxcxc<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。如果用数字形式引用的组没有有效的匹配，则引用到的内容简单的为空。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一个后向引用不能用于它自身。</span>&lt;&lt;([abc]1)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">是错误的。因此你不能将</span>&lt;&lt; &gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">用于一个正则表达式匹配本身，它只能用于替换操作中。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">后向引用不能用于字符集内部。</span>&lt;&lt;(a)[1b]&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中的</span>&lt;&lt;1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">并不表示后向引用。在字符集内部，</span>&lt;&lt;1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">可以被解释为八进制形式的转码。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">向后引用会降低引擎的速度，因为它需要存储匹配的组。如果你不需要向后引用，你可以告诉引擎对某个组不存储。例如：</span>&lt;&lt;Get(?:Value)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。其中“</span>(<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后面紧跟的“</span>?:<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”会告诉引擎对于组</span>(Value)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，不存储匹配的值以供后向引用。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">重复操作与后向引用</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当对组使用重复操作符时，缓存里后向引用内容会被不断刷新，只保留最后匹配的内容。例如：</span>&lt;&lt;([abc]+)=1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将匹配“</span>cab=cab<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，但是</span>&lt;&lt;([abc])+=1&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">却不会。因为</span>([abc])<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">第一次匹配“</span>c<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”时，“</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表“</span>c<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”；然后</span>([abc])<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">会继续匹配“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。最后“</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”代表“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，所以它会匹配“</span>cab=b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用：检查重复单词</span>&#8211;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当编辑文字时，很容易就会输入重复单词，例如“</span>the the<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。使用</span>&lt;&lt;b(w+)s+1b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">可以检测到这些重复单词。要删除第二个单词，只要简单的利用替换功能替换掉“</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”就可以了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">组的命名和引用</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在</span>PHP<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>Python<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中，可以用</span>&lt;&lt;(?P&lt;name&gt;group)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">来对组进行命名。在本例中，词法</span>?P&lt;name&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">就是对组</span>(group)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">进行了命名。其中</span>name<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">是你对组的起的名字。你可以用</span>(?P=name)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">进行引用。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的命名组</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">.NET framework<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">也支持命名组。不幸的是，微软的程序员们决定发明他们自己的语法，而不是沿用</span>Perl<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">、</span>Python<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">的规则。目前为止，还没有任何其他的正则表达式实现支持微软发明的语法。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">下面是</span>.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">中的例子：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">(?&lt;first&gt;group)(?’second’group)</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正如你所看到的，</span>.NET<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">提供两种词法来创建命名组：一是用尖括号“</span>&lt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，或者用单引号“</span>’’<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。尖括号在字符串中使用更方便，单引号在</span>ASP<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代码中更有用，因为</span>ASP<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">代码中“</span>&lt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”被用作</span>HTML<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">标签。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">要引用一个命名组，使用</span>k&lt;name&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>k’name’.</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">当进行搜索替换时，你可以用“</span>${name}<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”来引用一个命名组。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">12.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式的匹配模式</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">本教程所讨论的正则表达式引擎都支持三种匹配模式：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">&lt;&lt;/i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">使正则表达式对大小写不敏感，</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">&lt;&lt;/s&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">开启“单行模式”，即点号“</span>.<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配新行符</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">&lt;&lt;/m&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">开启“多行模式”，即“</span>^<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和“</span>$<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配新行符的前面和后面的位置。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在正则表达式内部打开或关闭模式</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果你在正则表达式内部插入修饰符</span>(?ism)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，则该修饰符只对其右边的正则表达式起作用。</span>(?-i)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">是关闭大小写不敏感。你可以很快的进行测试。</span>&lt;&lt;(?i)te(?-i)st&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应该匹配</span>TEst<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，但是不能匹配</span>teST<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">或</span>TEST.</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">13.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">原子组与防止回溯</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在一些特殊情况下，因为回溯会使得引擎的效率极其低下。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">让我们看一个例子：要匹配这样的字串，字串中的每个字段间用逗号做分隔符，第</span>12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字段由</span>P<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">开头。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们容易想到这样的正则表达式</span>&lt;&lt;^(.*?,){11}P&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。这个正则表达式在正常情况下工作的很好。但是在极端情况下，如果第</span>12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字段不是由</span>P<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">开头，则会发生灾难性的回溯。如要搜索的字串为“</span>1,2,3,4,5,6,7,8,9,10,11,12,13<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。首先，正则表达式一直成功匹配直到第</span>12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字符。这时，前面的正则表达式消耗的字串为“</span>1,2,3,4,5,6,7,8,9,10,11,<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，到了下一个字符，</span>&lt;&lt;P&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">并不匹配“</span>12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。所以引擎进行回溯，这时正则表达式消耗的字串为“</span>1,2,3,4,5,6,7,8,9,10,11<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。继续下一次匹配过程，下一个正则符号为点号</span>&lt;&lt;.&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，可以匹配下一个逗号“</span>,<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。然而</span>&lt;&lt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，</span>&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">并不匹配字符“</span>12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”中的“</span>1<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。匹配失败，继续回溯。大家可以想象，这样的回溯组合是个非常大的数量。因此可能会造成引擎崩溃。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">用于阻止这样巨大的回溯有几种方案：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">一种简单的方案是尽可能的使匹配精确。用取反字符集代替点号。例如我们用如下正则表达式</span>&lt;&lt;^([^,rn]*,){11}P&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，这样可以使失败回溯的次数下降到</span>11<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">次。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">另一种方案是使用原子组。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">原子组的目的是使正则引擎失败的更快一点。因此可以有效的阻止海量回溯。原子组的语法是</span>&lt;&lt;(?&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式</span>)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。位于</span>(?&gt;)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">之间的所有正则表达式都会被认为是一个单一的正则符号。一旦匹配失败，引擎将会回溯到原子组前面的正则表达式部分。前面的例子用原子组可以表达成</span>&lt;&lt;^(?&gt;(.*?,){11})P&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。一旦第十二个字段匹配失败，引擎回溯到原子组前面的</span>&lt;&lt;^&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">14.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">向前查看与向后查看</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in">Perl 5 <span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">引入了两个强大的正则语法：“向前查看”和“向后查看”。他们也被称作“零长度断言”。他们和锚定一样都是零长度的（所谓零长度即指该正则表达式不消耗被匹配的字符串）。不同之处在于“前后查看”会实际匹配字符，只是他们会抛弃匹配只返回匹配结果：匹配或不匹配。这就是为什么他们被称作“断言”。他们并不实际消耗字符串中的字符，而只是断言一个匹配是否可能。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">几乎本文讨论的所有正则表达式的实现都支持“向前向后查看”。唯一的一个例外是</span>Javascript<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">只支持向前查看。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">肯定和否定式的向前查看</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如我们前面提过的一个例子：要查找一个</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，后面没有紧跟一个</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。也就是说，要么</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">后面没有字符，要么后面的字符不是</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。采用否定式向前查看后的一个解决方案为</span>&lt;&lt;q(?!u)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。否定式向前查看的语法是</span>&lt;&lt;(?!<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查看的内容</span>)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">肯定式向前查看和否定式向前查看很类似：</span>&lt;&lt;(?=<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查看的内容</span>)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果在“查看的内容”部分有组，也会产生一个向后引用。但是向前查看本身并不会产生向后引用，也不会被计入向后引用的编号中。这是因为向前查看本身是会被抛弃掉的，只保留匹配与否的判断结果。如果你想保留匹配的结果作为向后引用，你可以用</span>&lt;&lt;(?=(regex))&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">来产生一个向后引用。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">肯定和否定式的先后查看</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">向后查看和向前查看有相同的效果，只是方向相反</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">否定式向后查看的语法是：</span>&lt;&lt;(?&lt;!<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查看内容</span>)&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">肯定式向后查看的语法是：</span>&lt;&lt;(?&lt;=<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">查看内容</span>)&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们可以看到，和向前查看相比，多了一个表示方向的左尖括号。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">例：</span>&lt;&lt;(?&lt;!a)b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">将会匹配一个没有“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”作前导字符的“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">值得注意的是：向前查看从当前字符串位置开始对“查看”正则表达式进行匹配；向后查看则从当前字符串位置开始先后回溯一个字符，然后再开始对“查看”正则表达式进行匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">深入正则表达式引擎内部</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">让我们看一个简单例子。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">把正则表达式</span>&lt;&lt;q(?!u)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到字符串“</span>Iraq<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。正则表达式的第一个符号是</span>&lt;&lt;q&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。正如我们知道的，引擎在匹配</span>&lt;&lt;q&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">以前会扫过整个字符串。当第四个字符“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”被匹配后，“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”后面是空字符</span>(void)<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。而下一个正则符号是向前查看。引擎注意到已经进入了一个向前查看正则表达式部分。下一个正则符号是</span>&lt;&lt;u&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，和空字符不匹配，从而导致向前查看里的正则表达式匹配失败。因为是一个否定式的向前查看，意味着整个向前查看结果是成功的。于是匹配结果“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”被返回了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们在把相同的正则表达式应用到“</span>quit<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span>&lt;&lt;q&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配了“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。下一个正则符号是向前查看部分的</span>&lt;&lt;u&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，它匹配了字符串中的第二个字符“</span>i<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎继续走到下个字符“</span>i<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。然而引擎这时注意到向前查看部分已经处理完了，并且向前查看已经成功。于是引擎抛弃被匹配的字符串部分，这将导致引擎回退到字符“</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">因为向前查看是否定式的，意味着查看部分的成功匹配导致了整个向前查看的失败，因此引擎不得不进行回溯。最后因为再没有其他的“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”和</span>&lt;&lt;q&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配，所以整个匹配失败了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">为了确保你能清楚地理解向前查看的实现，让我们把</span>&lt;&lt;q(?=u)i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到“</span>quit<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span>&lt;&lt;q&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">首先匹配“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。然后向前查看成功匹配“</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，匹配的部分被抛弃，只返回可以匹配的判断结果。引擎从字符“</span>i<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”回退到“</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。由于向前查看成功了，引擎继续处理下一个正则符号</span>&lt;&lt;i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。结果发现</span>&lt;&lt;i&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">和“</span>u<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”不匹配。因此匹配失败了。由于后面没有其他的“</span>q<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，整个正则表达式的匹配失败了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">更进一步理解正则表达式引擎内部机制</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">让我们把</span>&lt;&lt;(?&lt;=a)b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">应用到“</span>thingamabob<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎开始处理向后查看部分的正则符号和字符串中的第一个字符。在这个例子中，向后查看告诉正则表达式引擎回退一个字符，然后查看是否有一个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”被匹配。因为在“</span>t<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”前面没有字符，所以引擎不能回退。因此向后查看失败了。引擎继续走到下一个字符“</span>h<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。再一次，引擎暂时回退一个字符并检查是否有个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”被匹配。结果发现了一个“</span>t<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。向后查看又失败了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">向后查看继续失败，直到正则表达式到达了字符串中的“</span>m<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，于是肯定式的向后查看被匹配了。因为它是零长度的，字符串的当前位置仍然是“</span>m<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。下一个正则符号是</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，和“</span>m<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”匹配失败。下一个字符是字符串中的第二个“</span>a<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎向后暂时回退一个字符，并且发现</span>&lt;&lt;a&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">不匹配“</span>m<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在下一个字符是字符串中的第一个“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。引擎暂时性的向后退一个字符发现向后查看被满足了，同时</span>&lt;&lt;b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">匹配了“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。因此整个正则表达式被匹配了。作为结果，正则表达式返回字符串中的第一个“</span>b<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.3in; TEXT-INDENT: -0.25in; mso-list: l1 level1 lfo2; tab-stops: list .3in"><span style="font-family: Symbol; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;"><span style="mso-list: Ignore">·<span style="font-family: 'Times New Roman';">        </span></span></span><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">向前向后查看的应用</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们来看这样一个例子：查找一个具有</span>6<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">位字符的，含有“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的单词。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">首先，我们可以不用向前向后查看来解决问题，例如：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.5in">&lt;&lt; catw{3}|wcatw{2}|w{2}catw|w{3}cat&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">足够简单吧！但是当需求变成查找一个具有</span>6-12<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">位字符，含有“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”，“</span>dog<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”或“</span>mouse<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”的单词时，这种方法就变得有些笨拙了。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">我们来看看使用向前查看的方案。在这个例子中，我们有两个基本需求要满足：一是我们需要一个</span>6<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">位的字符，二是单词含有“</span>cat<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">满足第一个需求的正则表达式为</span>&lt;&lt;bw{6}b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。满足第二个需求的正则表达式为</span>&lt;&lt;bw*catw*b&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">把两者结合起来，我们可以得到如下的正则表达式：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="mso-tab-count: 1">     </span>&lt;&lt;(?=bw{6}b)bw*catw*b&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">具体的匹配过程留给读者。但是要注意的一点是，向前查看是不消耗字符的，因此当判断单词满足具有</span>6<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">个字符的条件后，引擎会从开始判断前的位置继续对后面的正则表达式进行匹配。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">最后作些优化，可以得到下面的正则表达式：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in">&lt;&lt;b(?=w{6}b)w{0,3}catw*&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.3in"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">15.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">正则表达式中的条件测试</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">条件测试的语法为</span>&lt;&lt;(?ifthen|else)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">。“</span>if<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">”部分可以是向前向后查看表达式。如果用向前查看，则语法变为：</span>&lt;&lt;(?(?=regex)then|else)&gt;&gt;<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，其中</span>else<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">部分是可选的。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">如果</span>if<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">部分为</span>true<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">，则正则引擎会试图匹配</span>then<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">部分，否则引擎会试图匹配</span>else<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">部分。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">需要记住的是，向前先后查看并不实际消耗任何字符，因此后面的</span>then<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">与</span>else<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">部分的匹配时从</span>if<span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">测试前的部分开始进行尝试。</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt"> </p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt 0.25in; TEXT-INDENT: -0.25in; mso-list: l0 level1 lfo1; tab-stops: list .25in"><strong style="mso-bidi-font-weight: normal"><span style="mso-fareast-font-family: 'Times New Roman'"><span style="mso-list: Ignore">16.<span style="font-family: 'Times New Roman';">  </span></span></span></strong><strong style="mso-bidi-font-weight: normal"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">为正则表达式添加注释</span></strong></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">在正则表达式中添加注释的语法是：</span>&lt;&lt;(?#comment)&gt;&gt;</p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="font-family: 宋体; mso-ascii-font-family: 'Times New Roman'; mso-hansi-font-family: 'Times New Roman';">例：为用于匹配有效日期的正则表达式添加注释：</span></p>
<p class="MsoNormal" style="MARGIN: 0in 0in 0pt; TEXT-INDENT: 0.25in"><span style="mso-spacerun: yes"> </span>(?#year)(19|20)dd[- /.](?#month)(0[1-9]|1[012])[- /.](?#day)(0[1-9]|[12][0-9]|3[01])<br />
 </p>
<p></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.tisswb.com/archives/64.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

