[2019.04.22]正则表达式的失控——回溯循环

1.什么是回溯循环

Consider the regular expression (x+x+)+y. Before you scream in horror and say this contrived example should be written as xx+y to match exactly the same without those terribly nested quantifiers: just assume that each "x" represents something more complex, with certain strings being matched by both "x". See the section on HTML files below for a real example.

Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer explodes. Almost.

The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails. Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to (6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.

If you try this regex on a 10x string in RegexBuddy's debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.

RegexBuddy is forgiving in that it detects it's going in circles, and aborts the match attempt. Other regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version 5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.

RegexBuddy 会自己发现死循环并放弃匹配尝试，可是有些表达式引擎（比如.NET）会一直试下去，或者直接因为栈溢出崩溃掉（比如5.10版本以前的Perl）。在Win上发生堆栈溢出是很烦人的，因为你的应用程序会突然“无缘无故没有理由地”闪退。所以在运营使用自定义正则表达式的Web服务的时候一定要小心，新手有惊人的写出指数级别的复杂表达式的能力（译者按：说的是我）。

1get the drift -- Understand the general meaning or purport

2Bow out -- To leave a job or stop doing an activity, usually after a long time

1.正则表达式是回溯式判定的：当包含“+”这种“to unlimitied times”的符号时，一次判定到字符串结尾，表达式却没有判定完，引擎会回溯并减少先前的判定次数，重试判定。

2.没有长时间循环限制的引擎会一直尝试下去直到所有的判定组合都被尝试（暴力测试啊！）

1.消除掉表达式中的嵌套量词，可选项不要用+这种东西：比如([\w]+[\w.-]+) -->([\w][\w.-]+)，这尤其重要

使用一个只匹配一次的量词可以避免灾难性的回溯

2.[我还没研究明白dalao想表达什么]

Regex: ([A-Za-z0-9]+([\w.-]?[A-Za-z0-9]+)*)+@([\w-]+\.)+([A-Za-z0-9]+)