正则表达式

一、语法

正则表达式的语法是独立于编程语言的。

1. 位置字符

^: 匹配字符串开始位置
$: 匹配字符串结束位置
|: 中|美国匹配"中"和"美国"，(中|美)国匹配"中国"和"美国"
\A: 只匹配字符串的开始位置
\Z: 只匹配字符串的结尾
\b
\B

2. 元字符

\: 隐藏特殊字符的特殊含义。Python中"\\"打印出来就是\，r"\\"打印出来就是\\，所以要匹配"\"字符串，正则表达式应该为\\\\
.: 匹配除换行符以外的任何字符，如果指定了DOTALL，那么匹配所有字符。
\d: [0-9]
\D: [^0-9]
\s: [\t\n\r\f\v]
\S: [^\t\n\r\f\v]
\w: [a-zA-Z0-9_]
\W: [^a-zA-Z0-9_]

3. 集合`[]`

集合: [abc]匹配"a"、"b"和"c"；
范围集合: [a-z], [0-9], [A-Z]；
特殊字符()+*在集合中会失去特殊含义，\W等字符类在集合中不会失去含义。
集合的反义: ^，集中中匹配"^"需要使用[^^]

4. 数量词

*: 任意次数
+: 大于等于1
?: 0或者1
*?, +?, ??: *、+、?都是贪婪模式，它们会匹配尽可能长的文本，比如<.*>可以匹配<a>b<c>整个字符串，给他们加上?会调整至最小模式。
{m}: 匹配指定次数
{m,n}, {m,n}?: 前者是贪婪模式，后者是最小模式

5. 捕获组

(...): 捕获组，一旦匹配组中的内容便可抽取得到，并在后面通过\number来匹配；如果直接匹配括号，比如(需要\(或[(]；
(?...): 这是一个扩展符号，?后面的第一个字符决定具体的含义到底是什么，这个扩展一般不会创建一个组，唯一的特例是(?P<name>...)
(?aiLmsux): 这个组不匹配字符串，它是对整个字符串的flag设置，这种方法和直接把flag当做参数传递给re.compile()的区别是，它把flag设置放在了正则表达式里面。设置的时候，选择一个或者多个字母，含义依次等同re.A（ASCII-only matching），re.I(忽视大小写)，re.L（locale dependent），re.M（多行），re.S（.匹配所有字符），re.U（Unicode匹配），re.X（verbose）
(?:...): 非捕获组，其中的内容不会被抽取出来
(?P<name>...): 命名捕获组
(?P=name): 引用前面的命名捕获组，但是二者的内容必须一样
(?#...): 注释文字
(?=...)和(?!...): 匹配在（不在）指定模式文本前面的字符串
(?<=...)和(?<!...): 匹配在（不在）指定模式文本后面的字符串
(?(id/name)yes-pattern|no-pattern)

二、python中常用的api

1. 搜索查找

re.search(pattern, string, flags=0)
re.findall(pattern, string, flags=0)  # 返回列表
re.finditer(pattern, string, flags=0)  # 返回迭代器

2. 匹配校验

re.match()  # 开头匹配
re.fullmatch()  # 完全匹配

3. 替换

re.sub(pattern, repl, string, count=0, flags=0)
re.subn(pattern, repl, string, count=0, flags=0)

4. 拆分

re.split(pattern, string, maxsplit=0, flags=0)

三、示例

在python中，r会取消字符串中\的转义效果

print('r和不加r的区别:')
print(r'\n')
print('\n')  # 打印出两个换行
print('='*50)
res = re.match(r'\d+', '3333')
print(res)
res = re.match('\\d+', '3333')  # \\会变成一个\，丢给re
print(res)
print('='*50)

运行结果:

r和不加r的区别:
\n


==================================================
<re.Match object; span=(0, 4), match='3333'>
<re.Match object; span=(0, 4), match='3333'>
==================================================

捕获组

group()直接返回的是整个正则字符串匹配的内容，也是0号捕获组；
group()接受序号和名字作为参数，它返回的就是对应捕获组的内容；
groups()以元组的形式返回所有捕获组的内容；

res = re.match(r'(欧文|杜兰特)打败了(哈登|杜兰特)', '欧文打败了杜兰特')
print(res)
print('group():', res.group(), ', start():', res.start(), ', end():', res.end(), ', span():', res.span())
print('group(1):', res.group(1), ', start(1):', res.start(1), ', end(1):', res.end(1), ', span(1):', res.span(1))
print('group(2):', res.group(2), ', start(2):', res.start(2), ', end(2):', res.end(2), ', span(2):', res.span(2))
print('groups():', res.groups())
print('='*50)

运行结果：

<re.Match object; span=(0, 8), match='欧文打败了杜兰特'>
group(): 欧文打败了杜兰特 , start(): 0 , end(): 8 , span(): (0, 8)
group(1): 欧文 , start(1): 0 , end(1): 2 , span(1): (0, 2)
group(2): 杜兰特 , start(2): 5 , end(2): 8 , span(2): (5, 8)
groups(): ('欧文', '杜兰特')
==================================================

接下来看命名捕获组的示例

print('模式中根据名称重复使用一个模式:')
res = re.search(r'(?P<player>欧文|杜兰特)打败了(?P=player)', '欧文打败了欧文')  # 字符串必须一模一样
print("方法一:", res)
res = re.search(r'(?P<player>欧文|杜兰特)打败了\1', '欧文打败了欧文')  # 字符串必须一模一样
print("方法二:", res)

print('在替换的时候引用命名捕获组')
res = re.sub(r'(?P<winner>欧文|杜兰特)打败了(?P<loser>欧文|杜兰特)', 'winner is \g<winner>, loser is \g<loser>', '欧文打败了杜兰特')
print('方法一:', res)
res = re.sub(r'(?P<winner>欧文|杜兰特)打败了(?P<loser>欧文|杜兰特)', 'winner is \g<1>, loser is \g<2>', '欧文打败了杜兰特')
print('方法二:', res)
res = re.sub(r'(?P<winner>欧文|杜兰特)打败了(?P<loser>欧文|杜兰特)', 'winner is \\1, loser is \\2', '欧文打败了杜兰特')
print('方法三:', res)

运行结果

模式中根据名称重复使用一个模式:
方法一: <re.Match object; span=(0, 7), match='欧文打败了欧文'>
方法二: <re.Match object; span=(0, 7), match='欧文打败了欧文'>
在替换的时候引用命名捕获组
方法一: winner is 欧文, loser is 杜兰特
方法二: winner is 欧文, loser is 杜兰特
方法三: winner is 欧文, loser is 杜兰特

接下来是前向后向查看。

# lookahead assertion
texts = ['15岁', '15周岁', '15年']
for text in texts:
    res = re.search(r'\d+(?=周?岁)', text)
    print(res.string if res else res)

print('='*50)

# negative lookahead assertion
texts = ['检测肝瘤为阳性', '检测脑瘤为阴性']
for text in texts:
    res = re.search(r'检测(.+瘤)为(?!阴性)', text)
    print(res.string if res else res)

运行结果

15岁
15周岁
None
==================================================
检测肝瘤为阳性
None

以及

texts = ['@qq.com', '@163.com', 'google.com']
for text in texts:
    res = re.search(r'(?<=@)[a-zA-Z0-9]+', text)
    print(res.string if res else res)

print('='*50)

结果

@qq.com
@163.com
None

posted @ 2021-02-23 19:36 YoungF 阅读(76) 评论(0) 收藏举报

刷新页面返回顶部

youngf

focusing on nlp & llm

正则表达式

一、语法

1. 位置字符

2. 元字符

3. 集合`[]`

4. 数量词

5. 捕获组

二、python中常用的api

1. 搜索查找

2. 匹配校验

3. 替换

4. 拆分

三、示例

公告

youngf

focusing on nlp & llm

正则表达式

一、语法

1. 位置字符

2. 元字符

3. 集合[]

4. 数量词

5. 捕获组

二、python中常用的api

1. 搜索查找

2. 匹配校验

3. 替换

4. 拆分

三、示例

公告

3. 集合`[]`