正则表达式

正则表达式为高级的文本模式匹配，抽取，与/或文本形式的搜索和替换功能提供基础。
正则表达式(regex)由一些字符和特殊字符组成的字符串。正则表撒式能按某种模式匹配一些列有相似特征字符串。只能匹配一个字符串的正则表达式毫无意义。
python通过标准库re模块支持正则表达式。

第一个正则表达式

foo 只能匹配 foo,正则表达是强大之处在于引用特殊字符来定义字符集合，匹配子组，重复模式。匹配字符串集合，而不是某个单字符串

特殊符号和字符

表示法	描述	正则表达式示例
符号
`literal`	匹配文本字符串的字面值literal	`foo`
`re1	re2`	匹配正则表达式re1或re2
`.`	匹配任何字符（除了\n)	`b.b`
`^`	匹配字符串起始部分	`^Dear`
`$`	匹配字符串终止部分	`/bin/sh$`
`*`	匹配0次或多次前面出现正则表达式	`[A-Za-Z0-9]*`
`+`	匹配1次或多次	[A-Z]+
`?`	匹配0次或1次	[A-Z]?
`{N}`	匹配N次前面出现正则表达式	`[0-9]{N}`
`{N,M}`
`[...]`	字符集	`[aeiou]`
`[x-y]`	匹配x-y范围内任意一字符	`[0-9]`
`[^...]`	不匹配此字符集任意一个字符	`[^aeiou]`
特殊字符
`\d`	匹配0-9	`data\d+.txt`
`\w`
`\s`
扩展表示法

使用择一匹配符号匹配多个正则表达式模式

正则表达式模式	匹配字符串
`at	home`

p匹配任意单字符(.)除了`\n`换行

正则表达式模式	匹配字符串
f.o	fao、f9o
`\.`	.

从字符串起始或者结尾或者单词边界匹配

正则表达式模式	匹配字符串
^From	任何以From作为起始字符串
/bin/tcsh$	任何以/bin/tcsh结尾字符串
^Subject:hi$	任何由单独字符串Subject:hi构成字符串
the	匹配the
\bthe	任何以the开始字符串
\bthe\b	仅仅匹配单词the
Bthe	任何包含但并不以the作为起始字符串

创建字符集合

正则表达式模式	匹配字符串
b[aeiu]t	bat,bet,bit,but

限定范围和否定

使用闭包操作符实现存在性和频数匹配

表示字符集的特殊字符

使用圆括号指定分组

只要用一对圆括号包裹任何正则表达式
- 对正则表达式进行分组
- 匹配子组

正则表达式模式	匹配字符串
`\d+(\.\d*)?`	表示简单浮点数字符串

扩展表示法

正则表达式和python语言

re模块：核心函数和方法

`compile(pattern,flag=0)` 返回正则表达式对象。

re模块函数和正则表达式对象方法
- match(pattern,string,flags=0) 匹配成功返回匹配对象，失败返回None
- search(pattern,string,flags=0)
- findall(pattern,string[,flags]) 返回一个匹配列表
- finditer(pattern,string[,flags]) 与findall 相同返回一个迭代器，迭代器返回一个匹配对象
- split(pattern,string,max=0) 正则表达式模式分隔符，将字符串分割为列表，返回成功匹配列表
re模块函数和正则表达式对象方法
- sub(pattern,repl,string,count=0)
常用匹配对象方法
- group(num=0) 返回整个匹配对象
- groups(default=None) 返回一个包含所有陪陪子组的元组
- groupdict(default=None) 返回一个包含素哟有匹配命名子组的字典

编译正则表达式

- re.compile()

匹配对象以及group()和groups()方法

使用 match()方法匹配字符串

- match() search()匹配对象； 匹配对象group()方法用于显示成功匹配。

import re
r = re.compile('foo') #返回正则表达式对象
m = re.match(r,'foo') #模式匹配对象
if m is not None:
    print(m.group())
m = re.match(r,'food on the table foo')
if m is not None:
    print(m.group())

使用search()在一个字符串中查找模式(搜索与匹配对比)

搜索模式出现在一个字符串中间部分。

 m = re.match('foo','seafood') #匹配失败
m     = re.search('foo','seafood') #匹配成功

重复，特殊字符及其分组

简单电子邮件正则表达式
\w+@\w+\.com ---> yujun@qq.com
\w+@(\w+\.)?\w+\.com --->yujun@qq.com yujun@www.qq.com
\w+@(\w+\.)*\w+\.com ---->nobody@www.xxx.yyy.zzz.com
匹配能够提取内容

m = re.match('(a)(b)','ab')
print(m.group()) #ab
print(m.group(1)) #a
print(m.group(2)) #b
print(m.groups()) #('a','b')

m = re.match('(\w\w\w)-(\d\d\d)','abc-123')
print(m.group()) #'abc-123'
print(m.group(1)) #子组1 'abc'
print(m.group(2)) #子组2 '123'
print(m.groups())  #('123','abc')

re模块方法 findall() sub() subn() split()

使用findall() fiditer()查找每一次出现位置

re.findall('car','car') # ['car']
re.findall('car','scray') #['car']
re.findall('car','carry the barcard to the car') #['car','car','car']
s = 'That and this'
print(re.findall(r'(th\w+)',s,re.I)) #['This','that]
it = re.finditer(r'(th\w+)',s,re.I)
g = next(it)
print(g.groups())  #('This',)
g = next(it)
print(g.groups())  #('that')
print(g.group(1)) #that

使用sub()和subn()搜索与替换

print(re.sub('[ae]','X','abcedf')) #XbcXdf
print(re.subn('[ae]','X','abcedf')) #('XbcXdf', 2)
# 可以用n来代替分组编号 MM/DD/YY --> DD/MM/YY
print(re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|d{4})',r'\2/\1/\3','2/20/91')) #20/2/91

在限定模式上使用splite()分割字符串

print(re.split(':','str1:str2:str3')) #['str1','str2','str3']

扩展符号

正则表达式入门

正则表达式目的

- 判断字符串是否符合正则表达式逻辑
- 通过正则表达式从指定的字符串中获取我们需要的特定部分。

如何判断一个字符串是否是手机号码？

#手机号码特征 13[\d]{9}
print(re.match(r'13[\d]{9}','13362158971').group())

match方法

match object = re.match('正则表达式','要匹配字符串')
match object.group() 返回字符串匹配部分

正则表达式单字符匹配('

', '匹配数组', '

字符	功能	示例
.	匹配任意一个字符除了\n	`print(re.match('.','你好').group()) #你`
[]	匹配[]中列举一个字符
\d
\D
\s	空白字符
\S
\w
\W

正则表达式描述数量

字符	功能
*	匹配前一个字符出现0次或者无限次
+
？

s = '\\nabc'
print(re.match('\\\\nabc',s)) #\\nabc
print(re.match(r'\\nabc',s)) #<_sre.SRE_Match object; span=(0, 5), match='\\nabc'>

表示边界

字符	功能
$	re.match(r'1[3]\d{9}$','13888888888')
^	match感觉不出来，findall(),search()
b
B

正则分组

字符	功能

(ab)	将括号中字符作为一个分组
\num	引用分组num匹配到的字符串
(?P	分组起别名
(?P=name	引用别名为name分组匹配到字符串

result=re.match(r'(<h1>)(.*)(</h1>)','<h1>匹配数组</h1>')
print(result.groups()) #('<h1>', '匹配数组', '</h1>')
print(result.group(0),result.group(1),result.group(2),result.group(3))
#\num
s='<html><h1>itcast</h1></html>'
print(re.match(r'<(.+)><(.+)>.+</\2></\1>',s))
#<?P<name>>定义
print(re.match(r'<(?P<k1>.+)><(?P<k2>.+)>.+</(?P=k2)></(?P=k1)>',s))
#(?P=name)引用
#匹配邮箱
p = r'(\w+)@(163|126|gmail|qq)\.(com|cn|net)$'
r = re.match(p,'itcast@qq.com')
print(r.group())

re模块的高级用法

re.match( )
re.search( )
re.findall()
re.sub(）批量替换

贪婪模式

正则正常情况下贪婪模式
在* ? + {m,n}后面加? 使贪婪变成非贪婪

result=re.match(r'(<h1>)(.*)(</h1>)','<h1>匹配数组</h1>')
print(result.groups()) #('<h1>', '匹配数组', '</h1>')
print(result.group(0),result.group(1),result.group(2),result.group(3))
#\num
s='<html><h1>itcast</h1></html>'
print(re.match(r'<(.+)><(.+)>.+</\2></\1>',s))
#<?P<name>>定义
print(re.match(r'<(?P<k1>.+)><(?P<k2>.+)>.+</(?P=k2)></(?P=k1)>',s))
#(?P=name)引用
#匹配邮箱
p = r'(\w+)@(163|126|gmail|qq)\.(com|cn|net)$'
r = re.match(p,'itcast@qq.com')
print(r.group())
s = 'This is a number 234-235-22'
r = re.match(r'.+(\d+-\d+-\d+)',s)
print(r.group(1)) # 4-235-22
r = re.match(r'.+?(\d+-\d+-\d+)',s)
print(r.group(1)) #234-235-22

posted @ 2020-02-10 12:52 salary_01 阅读(141) 评论(0) 收藏举报

刷新页面返回顶部

停学不停课

正则表达式

正则表达式

第一个正则表达式

特殊符号和字符

使用择一匹配符号匹配多个正则表达式模式

p匹配任意单字符(.)除了\n换行

从字符串起始或者结尾或者单词边界匹配

创建字符集合

限定范围和否定

使用闭包操作符实现存在性和频数匹配

表示字符集的特殊字符

使用圆括号指定分组

扩展表示法

正则表达式和python语言

re模块：核心函数和方法

编译正则表达式

匹配对象以及group()和groups()方法

使用 match()方法匹配字符串

使用search()在一个字符串中查找模式(搜索与匹配对比)

重复，特殊字符及其分组

re模块方法 findall() sub() subn() split()

使用sub()和subn()搜索与替换

在限定模式上使用splite()分割字符串

扩展符号

正则表达式入门

正则表达式目的

match方法

正则表达式单字符匹配('

', '匹配数组', '

正则表达式描述数量

表示边界

正则分组

re模块的高级用法

贪婪模式

p匹配任意单字符(.)除了`\n`换行