WSGI进行URL正则匹配之前是否应该缩小正则匹配范围
起因
因为业务需要,计划实现一个WSGI application,用于现有逻辑. 实现WSGI时碰到需要进行URL匹配的问题了.有人提议能否能像SQL中的查询一样,进行预处理缩小被匹配的正则范围.
这里有几个值得考虑的地方. - 如何进行缩小匹配正则范围. - 额外的一次字符串匹配(startswith)的消耗是否小于缩小后的正则匹配消耗.
验证
首先进行正则表达式的处理,获取正则中能用于字符串匹配的"开头几个"字符. 这就涉及到正则表达式的解析了.好在Python的成熟类库,自带提供了解析正则表达式的库
正则表达式解析
re.sre_parse提供了对正则表达式的解析. 返回的是每个字符的类型.有"literal"的常规字符.也有"in"的正则捕获,还有"max_repeat"的贪婪匹配等等.
In [28]: a.parse("/asdas\wd\sa\Wsssss\Safadf\Dsd\d+a[^a]*dasdasd")
有了re.sre_parse之后处理就变得简单了.碰到非字符常量就截取到当前位置字符,作为匹配键.
其他逻辑实现
其他的逻辑就很直接了,直接模拟WSGI application的URL匹配操作.与regex_match各用29s,21s.
import re
match_list = [
("/sss", "dataA"),
("/ddd\d+", "dataB"),
("/asd\W+", "dataC"),
("/bbb", "dataD"),
]
match_list_start = []
for n, i in enumerate(match_list):
reg_nodes = [0 if node[0] == "literal" else 1 for node in re.sre_parse.parse(i[0])]
position = -1
try:
position = reg_nodes.index(1)
except ValueError, ex:
pass
match_list_start.append(i[0][:position] if position > -1 else i[0])
match_list[n] = (re.compile(i[0]), i[1])
def str_match(ls):
for i in ls:
for n, j in enumerate(match_list_start):
if i.startswith(j):
match = match_list[n][0].match(i)
if match:
#domatchting
pass
def regex_match(ls):
for i in ls:
for j in match_list:
match = j[0].match(i)
if match:
#domatching
pass
example_str = [
"/ssssssssssss",
"/ddd123123",
"/asd4134134asdasd",
"/bbbbbbbbbbbbbbb",
"/bbb",
"/sss",
"/dddwww",
"/asdWWWWWW",
] * 800000
str_match(example_str)
match_list = [
("/sss", "dataA"),
("/ddd\d+", "dataB"),
("/asd\W+", "dataC"),
("/bbb", "dataD"),
]
match_list_start = []
for n, i in enumerate(match_list):
reg_nodes = [0 if node[0] == "literal" else 1 for node in re.sre_parse.parse(i[0])]
position = -1
try:
position = reg_nodes.index(1)
except ValueError, ex:
pass
match_list_start.append(i[0][:position] if position > -1 else i[0])
match_list[n] = (re.compile(i[0]), i[1])
def str_match(ls):
for i in ls:
for n, j in enumerate(match_list_start):
if i.startswith(j):
match = match_list[n][0].match(i)
if match:
#domatchting
pass
def regex_match(ls):
for i in ls:
for j in match_list:
match = j[0].match(i)
if match:
#domatching
pass
example_str = [
"/ssssssssssss",
"/ddd123123",
"/asd4134134asdasd",
"/bbbbbbbbbbbbbbb",
"/bbb",
"/sss",
"/dddwww",
"/asdWWWWWW",
] * 800000
str_match(example_str)
结果大家可以自己测一下.我这编跑是耗时95s,str_match
结论
在cProfile反馈的数据中光是string的startswith耗时16s.后来只执行regex.match耗时21s. 所以直接进行正则匹配更为有效.