2007年7月22日

The six ways to write an HREF link, with it still working in a browser, are:
<a href=http://www.yahoo.com target=_blank>link</a>
<a href=http://www.yahoo.com>link</a>
<a href="http://www.yahoo.com">link</a>
<a href='http://www.yahoo.com'>link</a>
<a href="http://www.yahoo.com" >link</a>
Microsoft’s example regular expression, found on MSDN, only works for links with double quotes. The pattern string looks like this:
href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))
Not very helpful. Of course we ended up having to write our own from scratch. Here is what we came up with (all on one line):
(?:[hH][rR][eE][fF]\s*=)
(?:[\s""']*)
(?!#|[Mm]ailto|[lL]ocation.|[jJ]avascript|.*css|.*this\.)
(.*?)(?:[\s>""'])
More interesting RegEx links:
Regular Expression Library :: ASPSmith.com -- an online compendium of regexs commonly needed
Regular Expressions in .NET :: LarkWare.com
Regular Expressions in .NET :: Windows Developer Magazine :: Michael Weinhardt and Chris Sells
posted @ 2007-07-22 21:30 烧烤 阅读(32) 评论(0) 编辑

Simple Matches

Let's start with simple expressions using the Regex and the Match class.

    Match m = Regex.Match("abracadabra", "(a|b|r)+");

You now have an instance of a Match that can be tested for success, as in:

    if (m.Success)
...

without even looking at the contents of the matched string.

If you wanted to use the matched string, you can simply convert it to a string:

    Console.WriteLine("Match="+m.ToString());

This example gives us the output:

Match=abra

which is the amount of the string that has been successfully matched.

Replacing Strings

Simple string replacements are very straightforward. For example, the statement:

  string s = Regex.Replace("abracadabra", "abra", "zzzz");

returns the string zzzzcadzzzz, in which all occurrences of the matching pattern are replaced by the replacement string zzzzz.

Now let's look at a more complex expression:

  string s = Regex.Replace("  abra  ", @"^\s*(.*?)\s*$", "$1");

This returns the string abra, with preceeding and trailing spaces removed.

The above pattern is actually generally useful for removing leading and trailing spaces from any string. We also have used the literal string quote construct in C#. Within a literal string, the compiler does not process the \ as an escape character. Consequently, the @"..." is very useful when working with regular expressions, when you are specifying escaped metacharacters with a \. Also of note is the use of $1 as the replacement string. The replacement string can only contain substitutions, which are references to Capture Group in the regular expression.

Engine Details

Now let's try to understand a slightly more complex sample by doing a walk-through of a grouping structure. Given the following sample:

string text = "abracadabra1abracadabra2abracadabra3";
string pat = @"
(		# start the first group
abra	# match the literal 'abra'
(		# start the second (inner) group
cad	# match the literal 'cad'
)?	# end the second (optional) group
)		# end the first group
+		# match one or more occurences
";
// use 'x' modifier to ignore comments
Regex r = new Regex(pat, "x");
// get the list of group numbers
int[] gnums = r.GetGroupNumbers();
// get first match
Match m = r.Match(text);
while (m.Success)
{
// start at group 1
for (int i = 1; i < gnums.Length; i++)
{
Group g = m.Group(gnums[i]);
// get the group for this match
Console.WriteLine("Group"+gnums[i]+"=["+g.ToString()+"]");
// get caps for this group
CaptureCollection cc = g.Captures;
for (int j = 0; j < cc.Count; j++)
{
Capture c = cc[j];
Console.WriteLine("	Capture" + j + "=["+c.ToString()
+ "] Index=" + c.Index + " Length=" + c.Length);
}
}
// get next match
m = m.NextMatch();
}

the output of this sample would be:

Group1=[abra]
Capture0=[abracad] Index=0 Length=7
Capture1=[abra] Index=7 Length=4
Group2=[cad]
Capture0=[cad] Index=4 Length=3
Group1=[abra]
Capture0=[abracad] Index=12 Length=7
Capture1=[abra] Index=19 Length=4
Group2=[cad]
Capture0=[cad] Index=16 Length=3
Group1=[abra]
Capture0=[abracad] Index=24 Length=7
Capture1=[abra] Index=31 Length=4
Group2=[cad]
Capture0=[cad] Index=28 Length=3

Let's start by examining the string pat, which contains the regular expression. The first capture group is marked by the first parenthesis, and then the expression will match an abra, if the regex engine matches the expression to that which is found in the text. Then the second capture group, marked by the second parenthesis, begins, but the definition of the first capture group is still ongoing. What this tells us is that the first group must match abracad and the second group would just match the cad. So, if you decide to make the cad match an optional occurrence with the ? metacharacter, then abra or abracad will be matched. Next, you end the first group, and ask the expression to match 1 or more occurrences by specifying the + metacharacter.

Now let's examine what happens during the matching process. First, create an instance of the expression by calling the Regex constructor, which is also where you specify your options. In this case, I'm using the x option, as I have included comments in the regular expression itself, and some whitespace for formatting purposes. By turning on the x option, the expression will ignore the comments, and all whitespace that I have not explicitly escaped.

Next, get the list of group numbers (gnums) defined in this regular expression. You could also have used these numbers explicitly, but this provides you with a programmatic method. This method is also useful if you have specified named groups, as a way of quickly indexing through the set of groups.

Next, perform the first match. Then enter a loop testing for success of the current match. The next step is to iterate through the list of groups starting at group 1. The reason you do not use group 0 in this sample is that group 0 is the fully captured match string, and what you usually (but not always) want to pick out of a string is a subgroup. You might use group 0 if you wanted to collect the fully matched string as a single string.

Within each group, iterate through the CaptureCollection. There is usually only one capture per match, per group, but in this case, for Group1, two captures show: Capture0 and Capture1. And if you had asked for only the ToString of Group1, you would have received abra, although it also did match the abracad. The group ToString value will be the value of the last Capture in its CaptureCollection. This is the expected behavior, and if you want the match to stop after just the abra, you would remove the + from the expression, telling the regex engine to match on just the expression.

Procedural-Based vs. Expression-Based

Generally, the users of regular expressions will tend to fall into one of two groups.

The first group tends to use minimal regular expressions that provide matching or grouping behaviors, and then write procedural code to perform some iterative behavior.

The second group tries to utilize the maximum power and functionality of the expression-processing engine itself, with as little procedural logic as possible.

For most of us, the best answer is somewhere in between, and I hope this article outlines both the capabilities of the .NET regexp classes, as well as the trade-offs in complexity and performance of the solution.

Procedural-Based Patterns

A common processing need is to match certain parts of a string and perform some processing. So, here's an example that matches words within a string and capitalizes them:

string text = "the quick red fox jumped over the lazy brown dog.";
System.Console.WriteLine("text=[" + text + "]");
string result = "";
string pattern = @"\w+|\W+";
foreach (Match m in Regex.Matches(text, pattern))
{
// get the matched string
string x = m.ToString();
// if the first char is lower case
if (char.IsLower(x[0]))
// capitalize it
x = char.ToUpper(x[0]) + x.Substring(1, x.Length-1);
// collect all text
result += x;
}
System.Console.WriteLine("result=[" + result + "]");

As you can see, you use the C# foreach statement to process the set of matches found, and perform some processing. In this case, creating a new result string.

The output of the sample is:

text=[the quick red fox jumped over the lazy brown dog.]
result=[The Quick Red Fox Jumped Over The Lazy Brown Dog.]

Expression-Based Patterns

Another way to implement the above example is by providing a MatchEvaluator, which will process it as a single result set.

So the new sample looks like:

  static string CapText(Match m)
{
// get the matched string
string x = m.ToString();
// if the first char is lower case
if (char.IsLower(x[0]))
// capitalize it
return char.ToUpper(x[0]) + x.Substring(1, x.Length-1);
return x;
}
static void Main()
{
string text = "the quick red fox jumped over the
lazy brown dog.";
System.Console.WriteLine("text=[" + text + "]");
string pattern = @"\w+";
string result = Regex.Replace(text, pattern,
new MatchEvaluator(Test.CapText));
System.Console.WriteLine("result=[" + result + "]");
}

Also of note is that the pattern was simplified since I only needed to modify the words and not the non-words.

Cookbook Expressions

To wrap up this overview of how regular expressions are used in the C# environment, I'll leave you with a set of useful expressions that have been used in other environments. I got them from a great book, the Perl Cookbook, by Tom Christiansen and Nathan Torkington, and updated them for C# programmers. I hope you find them useful.

Roman Numbers

string p1 = "^m*(d?c{0,3}|c[dm])"
+ "(l?x{0,3}|x[lc])(v?i{0,3}|i[vx])$";
string t1 = "vii";
Match m1 = Regex.Match(t1, p1);

Swapping First Two Words

string t2 = "the quick brown fox";
string p2 = @"(\S+)(\s+)(\S+)";
Regex x2 = new Regex(p2);
string r2 = x2.Replace(t2, "$3$2$1", 1);

Keyword = Value

string t3 = "myval = 3";
string p3 = @"(\w+)\s*=\s*(.*)\s*$";
Match m3 = Regex.Match(t3, p3);

Line of at Least 80 Characters

string t4 = "********************"
+ "******************************"
+ "******************************";
string p4 = ".{80,}";
Match m4 = Regex.Match(t4, p4);

MM/DD/YY HH:MM:SS

string t5 = "01/01/01 16:10:01";
string p5 =
@"(\d+)/(\d+)/(\d+) (\d+):(\d+):(\d+)";
Match m5 = Regex.Match(t5, p5);

Changing Directories (for Windows)

string t6 =
@"C:\Documents and Settings\user1\Desktop\";
string r6 = Regex.Replace(t6,
@"\\user1\\",
@"\\user2\\");

Expanding (%nn) Hex Escapes

string t7 = "%41"; // capital A
string p7 = "%([0-9A-Fa-f][0-9A-Fa-f])";
// uses a MatchEvaluator delegate
string r7 = Regex.Replace(t7, p7,
HexConvert);

Deleting C Comments (Imperfectly)

string t8 = @"
/*
* this is an old cstyle comment block
*/
";
string p8 = @"
/\*  # match the opening delimiter
.*?	 # match a minimal numer of chracters
\*/	 # match the closing delimiter
";
string r8 = Regex.Replace(t8, p8, "", "xs");

Removing Leading and Trailing Whitespace

string t9a = "   leading";
string p9a = @"^\s+";
string r9a = Regex.Replace(t9a, p9a, "");
string t9b = "trailing  ";
string p9b = @"\s+$";
string r9b = Regex.Replace(t9b, p9b, "");

Turning '\' Followed by 'n' Into a Real Newline

string t10 = @"\ntest\n";
string r10 = Regex.Replace(t10, @"\\n", "\n");

IP Address

string t11 = "55.54.53.52";
string p11 = "^" +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])\." +
@"([01]?\d\d|2[0-4]\d|25[0-5])" +
"$";
Match m11 = Regex.Match(t11, p11);

Removing Leading Path from Filename

string t12 = @"c:\file.txt";
string p12 = @"^.*\\";
string r12 = Regex.Replace(t12, p12, "");

Joining Lines in Multiline Strings

string t13 = @"this is
a split line";
string p13 = @"\s*\r?\n\s*";
string r13 = Regex.Replace(t13, p13, " ");

Extracting All Numbers from a String

string t14 = @"
test 1
test 2.3
test 47
";
string p14 = @"(\d+\.?\d*|\.\d+)";
MatchCollection mc14 = Regex.Matches(t14, p14);

Finding All Caps Words

string t15 = "This IS a Test OF ALL Caps";
string p15 = @"(\b[^\Wa-z0-9_]+\b)";
MatchCollection mc15 = Regex.Matches(t15, p15);

Finding All Lowercase Words

string t16 = "This is A Test of lowercase";
string p16 = @"(\b[^\WA-Z0-9_]+\b)";
MatchCollection mc16 = Regex.Matches(t16, p16);

Finding All Initial Caps

string t17 = "This is A Test of Initial Caps";
string p17 = @"(\b[^\Wa-z0-9_][^\WA-Z0-9_]*\b)";
MatchCollection mc17 = Regex.Matches(t17, p17);

Finding Links in Simple HTML

string t18 = @"
<html>
<a href=""first.htm"">first tag text</a>
<a href=""next.htm"">next tag text</a>
</html>
";
string p18 = @"<A[^>]*?HREF\s*=\s*[""']?"
+ @"([^'"" >]+?)[ '""]?>";
MatchCollection mc18 = Regex.Matches(t18, p18, "si");

Finding Middle Initial

string t19 = "Hanley A. Strappman";
string p19 = @"^\S+\s+(\S)\S*\s+\S";
Match m19 = Regex.Match(t19, p19);

Changing Inch Marks to Quotes

string t20 = @"2' 2"" ";
string p20 = "\"([^\"]*)";
string r20 = Regex.Replace(t20, p20, "``$1''");

Download the source code for these Cookbook Expressions.

 
posted @ 2007-07-22 21:08 烧烤 阅读(451) 评论(0) 编辑
  

导航

公告

昵称:烧烤
园龄:4年7个月
粉丝:0
关注:0
<2012年2月>
2930311234
567891011
12131415161718
19202122232425
26272829123
45678910

统计

搜索

 
 

常用链接

随笔档案

文章分类

最新评论