python正则表达式中的函数 - 司徒轩宇

python 的 re 模块提供了很多方便的函数使你可以使用正则表达式来操作字符串，每种函数都有它自己的特性和使用场景，熟悉之后对你的工作会有很大帮助

- compile(pattern, flags=0)

给定一个正则表达式 pattern，指定使用的模式 flags 默认为0 即不使用任何模式,然后会返回一个 SRE_Pattern (参见第四小节 re 内置对象用法) 对象

regex = re.compile(".+")
print regex
# output> <_sre.SRE_Pattern object at 0x00000000026BB0B8>

这个对象可以调用其他函数来完成匹配，一般来说推荐使用 compile 函数预编译出一个正则模式之后再去使用，这样在后面的代码中可以很方便的复用它，当然大部分函数也可以不用 compile 直接使用，具体见 findall 函数

s = '''first line
second line
third line'''
#
regex = re.compile(".+")
# 调用 findall 函数
print regex.findall(s)
# output> ['first line', 'second line', 'third line']
# 调用 search 函数
print regex.search(s).group()
# output> first lin

- escape(pattern)

转义如果你需要操作的文本中含有正则的元字符，你在写正则的时候需要将元字符加上反斜扛 \ 去匹配自身，而当这样的字符很多时，写出来的正则表达式就看起来很乱而且写起来也挺麻烦的，这个时候你可以使用这个函数,用法如下

s = ".+\d123"
#
regex_str = re.escape(".+\d123")
# 查看转义后的字符
print regex_str
# output> \.\+\\d123

# 查看匹配到的结果
for g in re.findall(regex_str, s):
    print g
# output> .+\d123

- findall(pattern, string, flags=0)

参数 pattern 为正则表达式, string 为待操作字符串, flags 为所用模式，函数作用为在待操作字符串中寻找所有匹配正则表达式的字串，返回一个列表，如果没有匹配到任何子串，返回一个空列表。

s = '''first line
second line
third line'''

# compile 预编译后使用 findall
regex = re.compile("\w+")
print regex.findall(s)
# output> ['first', 'line', 'second', 'line', 'third', 'line']

# 不使用 compile 直接使用 findall
print re.findall("\w+", s)
# output> ['first', 'line', 'second', 'line', 'third', 'line']

- finditer(pattern, string, flags=0)

参数和作用与 findall 一样，不同之处在于 findall 返回一个列表， finditer 返回一个迭代器(参见 http://www.cnblogs.com/huxi/archive/2011/07/01/2095931.html )，而且迭代器每次返回的值并不是字符串，而是一个 SRE_Match (参见第四小节 re 内置对象用法) 对象，这个对象的具体用法见 match 函数。

s = '''first line
second line
third line'''

regex = re.compile("\w+")
print regex.finditer(s)
# output> <callable-iterator object at 0x0000000001DF3B38>
for i in regex.finditer(s):
    print i
# output> <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>
#         <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>
#         <_sre.SRE_Match object at 0x0000000002B7A920>
#         <_sre.SRE_Match object at 0x0000000002B7A8B8>

- match(pattern, string, flags=0)

使用指定正则去待操作字符串中寻找可以匹配的子串, 返回匹配上的第一个字串，并且不再继续找，需要注意的是 match 函数是从字符串开始处开始查找的，如果开始处不匹配，则不再继续寻找，返回值为一个 SRE_Match (参见第四小节 re 内置对象用法) 对象，找不到时返回 None

s = '''first line
second line
third line'''

# compile
regex = re.compile("\w+")
m = regex.match(s)
print m
# output> <_sre.SRE_Match object at 0x0000000002BCA8B8>
print m.group()
# output> first

# s 的开头是 "f", 但正则中限制了开始为 i 所以找不到
regex = re.compile("^i\w+")
print regex.match(s)
# output> None

- purge()

当你在程序中使用 re 模块，无论是先使用 compile 还是直接使用比如 findall 来使用正则表达式操作文本，re 模块都会将正则表达式先编译一下，并且会将编译过后的正则表达式放到缓存中，这样下次使用同样的正则表达式的时候就不需要再次编译，因为编译其实是很费时的，这样可以提升效率，而默认缓存的正则表达式的个数是 100, 当你需要频繁使用少量正则表达式的时候，缓存可以提升效率，而使用的正则表达式过多时，缓存带来的优势就不明显了 (参考《python re.compile对性能的影响》http://blog.trytofix.com/article/detail/13/)，这个函数的作用是清除缓存中的正则表达式，可能在你需要优化占用内存的时候会用到。

- search(pattern, string, flags=0)

函数类似于 match，不同之处在于不限制正则表达式的开始匹配位置

s = '''first line
second line
third line'''

# 需要从开始处匹配 所以匹配不到 
print re.match('i\w+', s)
# output> None

# 没有限制起始匹配位置
print re.search('i\w+', s)
# output> <_sre.SRE_Match object at 0x0000000002C6A920>

print re.search('i\w+', s).group()
# output> irst

- split(pattern, string, maxsplit=0, flags=0)

参数 maxsplit 指定切分次数，函数使用给定正则表达式寻找切分字符串位置，返回包含切分后子串的列表，如果匹配不到，则返回包含原字符串的一个列表

s = '''first 111 line
second 222 line
third 333 line'''

# 按照数字切分
print re.split('\d+', s)
# output> ['first ', ' line\nsecond ', ' line\nthird ', ' line']

# \.+ 匹配不到 返回包含自身的列表
print re.split('\.+', s, 1)
# output> ['first 111 line\nsecond 222 line\nthird 333 line']

# maxsplit 参数
print re.split('\d+', s, 1)
# output> ['first ', ' line\nsecond 222 line\nthird 333 line']

- sub(pattern, repl, string, count=0, flags=0)

替换函数，将正则表达式 pattern 匹配到的字符串替换为 repl 指定的字符串, 参数 count 用于指定最大替换次数

s = "the sum of 7 and 9 is [7+9]."

# 基本用法 将目标替换为固定字符串
print re.sub('\[7\+9\]', '16', s)
# output> the sum of 7 and 9 is 16.

# 高级用法 1 使用前面匹配的到的内容 \1 代表 pattern 中捕获到的第一个分组的内容
print re.sub('\[(7)\+(9)\]', r'\2\1', s)
# output> the sum of 7 and 9 is 97.


# 高级用法 2 使用函数型 repl 参数, 处理匹配到的 SRE_Match 对象
def replacement(m):
    p_str = m.group()
    if p_str == '7':
        return '77'
    if p_str == '9':
        return '99'
    return ''
print re.sub('\d', replacement, s)
# output> the sum of 77 and 99 is [77+99].


# 高级用法 3 使用函数型 repl 参数, 处理匹配到的 SRE_Match 对象 增加作用域 自动计算
scope = {}
example_string_1 = "the sum of 7 and 9 is [7+9]."
example_string_2 = "[name = 'Mr.Gumby']Hello,[name]"

def replacement(m):
    code = m.group(1)
    st = ''
    try:
        st = str(eval(code, scope))
    except SyntaxError:
        exec code in scope
    return st

# 解析: code='7+9'
#       str(eval(code, scope))='16'
print re.sub('\[(.+?)\]', replacement, example_string_1)
# output> the sum of 7 and 9 is 16.


# 两次替换
# 解析1: code="name = 'Mr.Gumby'"
#       eval(code)
#       raise SyntaxError
#       exec code in scope
#       在命名空间 scope 中将 "Mr.Gumby" 赋给了变量 name

# 解析2: code="name"
#       eval(name) 返回变量 name 的值 Mr.Gumby
print re.sub('\[(.+?)\]', replacement, example_string_2)
# output> Hello,Mr.Gumby

- subn(pattern, repl, string, count=0, flags=0)

作用与函数 sub 一样，唯一不同之处在于返回值为一个元组，第一个值为替换后的字符串，第二个值为发生替换的次数

- template(pattern, flags=0)

这个吧，咋一看和 compile 差不多，不过不支持 +、？、*、｛｝等这样的元字符，只要是需要有重复功能的元字符，就不支持，查了查资料，貌似没人知道这个函数到底是干嘛的...

posted on 2022-05-18 18:04 司徒轩宇阅读(94) 评论(0) 收藏举报

刷新页面返回顶部