JavaScript中的正则表达式（regular expression）

（文章内容主要摘自《JavaScript-The Definitive Guide》5th edition）
利用JavaScript提供的方法，在客户端通过正则表达式(regular expression)的方式，验证页面输入的合法性是很常用且很高效的做法。想要与给定的正则表达式的模式相比较，不仅可以通过字符串提供的一些方法，也可以通过正则表达式对象(RegExp)提供的方法实现。

正则表达式的定义与语法

    在JavaScrpt中，可以通过RegExp的构造函数RegExp()来构造一个正则表达式对象，更常见的，也可以通过直接量的语法，定义一个正则表达式对象。与字符串类似，表达式内容的两侧用斜线(/)标识。

    直接量字符

    反斜线开头的字符具有特殊的意义

Character	Matches
字符、数字	Itself
`\0`	空字符 `(\u0000`)
`\t`	Tab `(\u0009`)
`\n`	换行 (`\u000A`)
`\v`	Vertical tab (`\u000B`)
`\f`	Form feed (`\u000C`)
`\r`	回车 (`\u000D`)
`\xnn`	The Latin character specified by the hexadecimal number `nn`; for example, `\x0A` is the same as `\n`
`\uxxxx`	The Unicode character specified by the hexadecimal number `xxxx`; for example, `\u0009` is the same as `\t`
`\cX`	The control character `^X`; for example, `\cJ` is equivalent to the newline character `\n`

    另外一些特殊意义的符号：
       ^ $ . * + ? = ! : | \ / ( ) [ ] { }

    字符类

    许多单独的字符可以利用方括号，组合成一个字符类。一个字符类可以匹配任何一个其包含的字符，仅限一个字符。例如: /[abc]/ 匹配字母a, b, c中的任义一个字母。而“脱字符”^可以表达相反的意思，例如，/[^abc]/匹配除了a, b, c以外的任义一个字符。连字号 - 表达两个字符之间的任义字符，例如，/[a-z]/ 匹配小写字母 a 到 z 之间的任义一个字母。
    因为一些字符类比较常用，JavaScript中定义了一些字符来表示这些常用的字符类。

Character	Matches
`[...]`	任意一个在中括号内的字符。
`[^...]`	任意一个不在中括号内的字符
`.`	Any character except newline or another Unicode line terminator.
`\w`	任意一个 ASCII 字符。相当于 `[a-zA-Z0-9_]`。
`\W`	任意一个非 ASCII 字符。相当于 `[^a-zA-Z0-9_]`。
`\s`	任意一个 Unicode 空格符。
`\S`	任意一个非Unicode空格符。注意 `\w（小写）` 和 `\S` 不是一回事。
`\d`	任意一个 ASCII 数字。相当于 `[0-9]`。
`\D`	任意一个非 ASCII 数字。相当于`[^0-9]`。
`[\b]`	一个退格符（特例）。

    转义字符是可以使用在[ ]内的。值得注意的是\b，在方括号[ ]之内时，其意思是退格符。然而在方括号之外直接使用时，则匹配字符的边界。

    重复

With the regular expression syntax you've learned so far, you can describe a two-digit number as /\d\d/ and a four-digit number as /\d\d\d\d/. But you don't have any way to describe, for example, a number that can have any number of digits or a string of three letters followed by an optional digit. These more complex patterns use regular-expression syntax that specifies how many times an element of a regular expression may be repeated.

The characters that specify repetition always follow the pattern to which they are being applied. Because certain types of repetition are quite commonly used, there are special characters to represent these cases. For example, + matches one or more occurrences of the previous pattern. Table 11-3 summarizes the repetition syntax.

Table 11-3. Regular expression repetition characters
Character	Meaning
`{n,m}`	该项至少出现n次，但是不多于m次。
`{n,}`	该项至少出现n次。
`{n}`	该项出现n次。（不能多，也不能少）
`?`	该项出现0次或者一次。就是说，该项是可选的，相当于{0,1}。
`+`	该项出现1次或者更多次，相当于 `{1,}`.
`*`	该项出现0次或者更多次。相当于 `{0,}`.

下面是一些例子:

/\d{2,4}/     // 匹配2到4个数字
/\w{3}\d?/    // 匹配3个字符和1个可选的数字，即该数字可以有也可以没有
/\s+java\s+/  // Match "java" with one or more spaces before and after
/[^"]*/       // Match zero or more non-quote characters

Be careful when using the * and ? repetition characters. Since these characters may match zero instances of whatever precedes them, they are allowed to match nothing. For example, the regular expression /a*/ actually matches the string "bbbb" because the string contains zero occurrences of the letter a!

选择、分组和引用

The regular-expression grammar includes special characters for specifying alternatives, grouping subexpressions, and referring to previous subexpressions. The | character separates alternatives. For example, /ab|cd|ef/ matches the string "ab" or the string "cd" or the string "ef". And /\d{3}|[a-z]{4}/ matches either three digits or four lowercase letters.

Note that alternatives are considered left to right until a match is found. If the left alternative matches, the right alternative is ignored, even if it would have produced a "better" match. Thus, when the pattern /a|ab/ is applied to the string "ab", it matches only the first letter.

Parentheses have several purposes in regular expressions. One purpose is to group separate items into a single subexpression so that the items can be treated as a single unit by |, *, +, ?, and so on. For example, /java(script)?/ matches "java" followed by the optional "script". And /(ab|cd)+|ef)/ matches either the string "ef" or one or more repetitions of either of the strings "ab" or "cd".

Another purpose of parentheses in regular expressions is to define subpatterns within the complete pattern. When a regular expression is successfully matched against a target string, it is possible to extract the portions of the target string that matched any particular parenthesized subpattern. (You'll see how these matching substrings are obtained later in the chapter.) For example, suppose you are looking for one or more lowercase letters followed by one or more digits. You might use the pattern /[a-z]+\d+/. But suppose you only really care about the digits at the end of each match. If you put that part of the pattern in parentheses (/[a-z]+(\d+)/), you can extract the digits from any matches you find, as explained later.

A related use of parenthesized subexpressions is to allow you to refer back to a subexpression later in the same regular expression. This is done by following a \ character by a digit or digits. The digits refer to the position of the parenthesized subexpression within the regular expression. For example, \1 refers back to the first subexpression, and \3 refers to the third. Note that, because subexpressions can be nested within others, it is the position of the left parenthesis that is counted. In the following regular expression, for example, the nested subexpression ([Ss]cript) is referred to as \2:

/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/

A reference to a previous subexpression of a regular expression does not refer to the pattern for that subexpression but rather to the text that matched the pattern. Thus, references can be used to enforce a constraint that separate portions of a string contain exactly the same characters. For example, the following regular expression matches zero or more characters within single or double quotes. However, it does not require the opening and closing quotes to match (i.e., both single quotes or both double quotes):

/['"][^'"]*['"]/

To require the quotes to match, use a reference:

/(['"])[^'"]*\1/

The \1 matches whatever the first parenthesized subexpression matched. In this example, it enforces the constraint that the closing quote match the opening quote. This regular expression does not allow single quotes within double-quoted strings or vice versa. It is not legal to use a reference within a character class, so you cannot write:

/(['"])[^\1]*\1/

Later in this chapter, you'll see that this kind of reference to a parenthesized subexpression is a powerful feature of regular-expression search-and-replace operations.

In JavaScript 1.5 (but not JavaScript 1.2), it is possible to group items in a regular expression without creating a numbered reference to those items. Instead of simply grouping the items within ( and ), begin the group with (?: and end it with ). Consider the following pattern, for example:

/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/

Here, the subexpression (?:[Ss]cript) is used simply for grouping, so the ? repetition character can be applied to the group. These modified parentheses do not produce a reference, so in this regular expression, \2 refers to the text matched by (fun\w*).

Table 11-4 summarizes the regular-expression alternation, grouping, and referencing operators.

Table 11-4. Regular expression alternation, grouping, and reference characters
Character	Meaning
`\|`	Alternation. Match either the subexpression to the left or the subexpression to the right.
`(...)`	Grouping. Group items into a single unit that can be used with `*`, `+`, `?`, `\|`, and so on. Also remember the characters that match this group for use with later references.
`(?:...)`	Grouping only. Group items into a single unit, but do not remember the characters that match this group.
`\n`	Match the same characters that were matched when group number `n` was first matched. Groups are subexpressions within (possibly nested) parentheses. Group numbers are assigned by counting left parentheses from left to right. Groups formed with `(?:` are not numbered.

    确定匹配位置

    确定匹配的起始与结束位置，对于精确匹配也很关键。

As described earlier, many elements of a regular expression match a single character in a string. For example, \s matches a single character of whitespace. Other regular expression elements match the positions between characters, instead of actual characters. \b, for example, matches a word boundarythe boundary between a \w (ASCII word character) and a \W (nonword character), or the boundary between an ASCII word character and the beginning or end of a string.^[*] Elements such as \b do not specify any characters to be used in a matched string; what they do specify, however, is legal positions at which a match can occur. Sometimes these elements are called regular-expression anchors because they anchor the pattern to a specific position in the search string. The most commonly used anchor elements are ^, which ties the pattern to the beginning of the string, and $, which anchors the pattern to the end of the string.

^[*] Except within a character class (square brackets), where \b matches the backspace character.

For example, to match the word "JavaScript" on a line by itself, you can use the regular expression /^JavaScript$/. If you want to search for "Java" used as a word by itself (not as a prefix, as it is in "JavaScript"), you can try the pattern /\sJava\s/, which requires a space before and after the word. But there are two problems with this solution. First, it does not match "Java" if that word appears at the beginning or the end of a string, but only if it appears with space on either side. Second, when this pattern does find a match, the matched string it returns has leading and trailing spaces, which is not quite what's needed. So instead of matching actual space characters with \s, match (or anchor to) word boundaries with \b. The resulting expression is /\bJava\b/. The element \B anchors the match to a location that is not a word boundary. Thus, the pattern /\B[Ss]cript/ matches "JavaScript" and "postscript", but not "script" or "Scripting".

In JavaScript 1.5 (but not JavaScript 1.2), you can also use arbitrary regular expressions as anchor conditions. If you include an expression within (?= and ) characters, it is a lookahead assertion, and it specifies that the enclosed characters must match, without actually matching them. For example, to match the name of a common programming language, but only if it is followed by a colon, you could use /[Jj]ava([Ss]cript)?(?=\:)/. This pattern matches the word "JavaScript" in "JavaScript: The Definitive Guide", but it does not match "Java" in "Java in a Nutshell" because it is not followed by a colon.

If you instead introduce an assertion with (?!, it is a negative lookahead assertion, which specifies that the following characters must not match. For example, /Java(?!Script)([A-Z]\w*)/ matches "Java" followed by a capital letter and any number of additional ASCII word characters, as long as "Java" is not followed by "Script". It matches "JavaBeans" but not "Javanese", and it matches "JavaScrip" but not "JavaScript" or "JavaScripter".

Table 11-5 summarizes regular-expression anchors.

Table 11-5. Regular-expression anchor characters
Character	Meaning
`^`	Match the beginning of the string and, in multiline searches, the beginning of a line.
`$`	Match the end of the string and, in multiline searches, the end of a line.
`\b`	Match a word boundary. That is, match the position between a `\w` character and a `\W` character or between a `\w` character and the beginning or end of a string. (Note, however, that `[\b]` matches backspace.)
`\B`	Match a position that is not a word boundary.
`(?=p)`	A positive lookahead assertion. Require that the following characters match the pattern `p`, but do not include those characters in the match.
`(?!p)`	A negative lookahead assertion. Require that the following characters do not match the pattern `p`.

标志

正则表达式最后一个语法问题就是标志。有三种标志，如下表：

Character	Meaning
`i`	Perform case-insensitive matching.
`g`	Perform a global matchthat is, find all matches rather than stopping after the first match.
`m`	Multiline mode. `^` matches beginning of line or beginning of string, and `$` matches end of line or end of string.

模式匹配的字符串方法

    JavaScript中，为字符串提供了4个使用正则表达式的方法。
    String.search();
    String.replace();
    String.match();
    String.split();
    search()的参数是一个正则表达式。如果在参数位置传递的不是正则表达式，会先将该参数传递给正则表达式的构造函数RegExp()，将其转换成正则表达式。

        "JavaScript".search(/script/i);

    search()忽略g标志。不会进行全局查找，它的返回值是匹配字符的起始位置。如果没有找到匹配值，则返回-1.上例中，返回4。

    replace()执行“查找-替换”操作。第一个参数是正则表达式，第二个是替换字符串。
    replace()非常有用。可以利用下例的方法，将字符串两侧的双引号，替换成两个单引号。

        var quote = /"([^"]*)"/g;
        text.replace(quote, "''$1''");

    match()是最常用的方法。
       "1 plus 2 equals 3".match(/\d+/g) // returns ["1", "2", "3"]
    如果正则表达式不含有g标志，match不进行全局查找。仅仅查找到第一个匹配的字符串为止，并返回一个数组array。数组的第一个元素array[0]储存匹配的字符串。下一个元素array[1]储存匹配第一个括号内表达式(parenthesized expression)的字符串。以后的元素以此类推。
    To draw a paralled with replace()， array[n]储存的$n中的内容。
    例如：

            var url = /(\w+):\/\/([\w.]+)\/(\S*)/;
            var text = "Visit my blog at http://www.example.com/~david";
            var result = text.match(url);
            if (result != null)
            {
              var fullurl = result[0]; // Contains "http://www.example.com/~david"
              var protocol = result[1]; // Contains "http"
              var host = result[2]; // Contains "www.example.com"
              var path = result[3]; // Contains "~david"
            }

    如果正则表达式包含g标志，match进行全局查找，返回的数组中，每个元素储存一个与正则表达式相匹配的字符串。

    split()的参数一般是一个符号，用以分隔字符串。例如：

       "123,456,789".split(","); // Returns ["123","456","789"]

    也可以是一个正则表达式。这个能力是该方法非常有用。例如，你可以利用正则表达式，去掉分隔字符两侧的空格：

       "1, 2, 3, 4, 5".split(/\s*,\s*/); // Returns ["1","2","3","4","5"]

模式匹配的RegExp对象方法

    RegExp对象也可以通过RegExp()构造函数生成。构造函数，是动态生成RegExp对象的好方法。它包括一个或者两个字符串参数。第一个参数是正则表达式的内容，第二个参数是标志，例如：g, i, m等。

            // Find all five-digit numbers in a string. Note the double \\ in this case.
            var zipcode = new RegExp("\\d{5}", "g");

    RegExp对象有两种方法验证字符串与正则表达式模式是否匹配。第一个方法就是exec( )方法，类似于match方法。不同于match的是，exec方法无论是否有 g 标志，它都只返回同样的数组array。array的第一个元素array[0]储存完全匹配的字符串，随后的元素一次储存与子字符类想匹配的子字符串。当模式有 g 标志的时候，exec方法执行一次以后，会自动将RegExp对象的一个特殊属性lastIndex置为此次匹配的字符串的最后一个字母的后一个位置。
    当通一个正则表达式再次执行的时候，会在lastIndex位置开始查找，而不是 0 位置开始查找。如果exec没有找到匹配的字符串，它将自动将lastIndex置为 0。这个特殊的方法，可以很方便的循环遍历整个字符串，以找到所有匹配的子字符串。
    当然，你也可以在找到最后一个匹配子字符串以前的任意时刻将lastIndex置为 0，然后用该RegExp对象执行另外的字符串。

    var pattern = /Java/g;
    var text = "JavaScript is more fun than Java!";
    var result;
    while((result = pattern.exec(text)) != null)
    {
        alert("Matched '" + result[0] + "'" + " at position " + result.index + "; next search begins at " + pattern.lastIndex);
    }

    RegExp对象的另外一个执行匹配的方法是test( )，它要比exec( )简单的多。它只有一个字符串作为唯一的参数，返回true或者在没有找到匹配字符串是返回null。当RegExp有 g 标志时，test与exec对lastIndex执行同样的操作

例子：

    将textbox传递给方法checkDate，作为Object的值。检验textbox中输入的月份是否为mm/dd/yyyy这样的格式：

1 function checkDate(Object)
2 {
3     var strValue=Object.value;
4     var pattern = /(
5         //大月
6         (^(10|12|0?[13578])([/])(3[01]|[12][0-9]|0?[1-9])([/])((1[8-9]\d{2})|([2-9]\d{3}))$)|
7         //小月
8         (^(11|0?[469])([/])(30|[12][0-9]|0?[1-9])([/])((1[8-9]\d{2})|([2-9]\d{3}))$)|
9         //2月
10         (^(0?2)([/])(2[0-8]|1[0-9]|0?[1-9])([/])((1[8-9]\d{2})|([2-9]\d{3}))$)|
11         (^(0?2)([/])(29)([/])([2468][048]00)$)|
12         (^(0?2)([/])(29)([/])([3579][26]00)$)|
13         (^(0?2)([/])(29)([/])([1][89][0][48])$)|
14         (^(0?2)([/])(29)([/])([2-9][0-9][0][48])$)|
15         (^(0?2)([/])(29)([/])([1][89][2468][048])$)|
16         (^(0?2)([/])(29)([/])([2-9][0-9][2468][048])$)|
17         (^(0?2)([/])(29)([/])([1][89][13579][26])$)|
18         (^(0?2)([/])(29)([/])([2-9][0-9][13579][26])$))/;
19     var message = "";
20
21     if(strValue.match(pattern)==null)
22     {
23         return false;
24     }
25     else
26     {
27         return true;
28     }
29 }

检验textbox输入是否符合email格式：

1 function checkEmail(Object)
2 {
3     var strValue=Object.value;
4
5     var pattern = /\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*/;
6
7     if(strValue.match(pattern)==null)
8     {
9         return false;
10     }
11     else
12     {
13         return true;
14     }
15 }

posted on 2007-12-18 12:39 Willson 阅读(3536) 评论(0) 编辑收藏举报

JavaScript中的正则表达式（regular expression）

Table 11-3. Regular expression repetition characters

Table 11-4. Regular expression alternation, grouping, and reference characters

Table 11-5. Regular-expression anchor characters