Python基础 - 正则表达式

正则表达式, 是用来描述, 匹配字符串规则的. 跟什么编程语言没啥关系, 这个太强大了. Python中, 内置 re 模块对正则有很强大的支持.

正则表达式基本语法

". " 任意单字符,除了\n
"* " 其前面子模式0或多次
'+ ' 其前面子模式的**1或多次
'- ' 在 [ ] 之间表示范围,如[0-9]
| 前or后的字符串
^ 后面的模式****开头
'$ '后面的模式结尾
? 前面0或1个字符, 也作为非贪婪限定词
\ 转义
\num 子模式编号,名字
\f 换页
\n 换行
\r 回车(回到一行的头部)
\b 单词头或单词尾
\B 非\b
\d 数字[0-9]
\D 非数字[^0-9]
\s 空白符,\t\n\f\v
\S 非空白符
\w 字母, 数字, 下划线, 中文
\W 除\w外的特殊字符
() 里面内容作为一个整体
{m} 前面子模式m次
{m,n} 前面的子模式m次到n次(闭区间)
[ ] 里面任意一个单字符
[^ x] 非里面的任意一个单字符
[a-zA-Z0-9_]
[^ a-z]

扩展语法

( ) 表示一个子模式, 将里面的内容作为一整体看待

import re
re.match(r'(cj){3}', 'cjcjcjcjcjcjxxx').group()  # 都加上原字符 r''

'cjcjcj'

(?P<groupname) 为子模式命名
(?#...) 注释
(?:...) 匹配但不捕获该匹配的表达式
(?<=...) 正则之后,...的内容出现则匹配,但不返回
(?=...)
(?<!...) 不匹配
(?!...)
这些不怎么用先忽略

正则表达式集锦

abcde 可匹配 abcde
[cj]python 可匹配 cpthon, jpython
[a-zA-Z0-9_] 大小写字母,数字,下划线
python|java 匹配 python or java
[^abc] 除了a,b,c外的一个字符
r'(http://)?(www.)?python.org' 只能 python.org, http://python.org, http://www.python.org
^(http) 以http开头
(.com) 以. com 结尾
(pattern)* 0或多次
(pattern)+ 或多次, 至少一次
(pattern)? 0或1次
(pattern){m} m次
(pattern){m,n} m到n次闭区间
(a|b)*c 0或多个a或b, 后面紧跟一个字母c
ab{1,} 等价于ab+
^[1]{1}([A-Z0-9a-z._]{4,19} 长度为5-25之间,字母开头,后面跟字母or数字or下划线or点的字串
^(\w){6,20} 长度为6-20,可包含数字,字母,汉字,下划线,点
\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3} 是否为合法IP
(13[4-9]\d{8})|(15[01289]\d{8}) 移动号码
\w+@\w+(.\w+)+ 合法邮箱 cjsd_marketing@163.com
(?=.* [a-z])(?=.* [A-Z])(?=.* \d)(?=.* [,.]{8,} 强密码检查,同时包含大小写字母,数字,特殊字符*,且长度至少为8位
(?!.* ['";==?]+.+>) **包含.'";= %? 的任意一个则匹配失败
(.)\1+ 匹配任意子模式至少1次
缓缓.........

re.match('\w+@(\w+\.)+\w+ ', 'cjsd_marketing@163.com').group()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-14-4b9af62021f0> in <module>
----> 1 re.match('\w+@(\w+\.)+\w+ ', 'cjsd_marketing@163.com').group()


AttributeError: 'NoneType' object has no attribute 'group'

re.match('\w+@\w+(\.\w+)+', 'cjsd_marketing@163.com').group()

'cjsd_marketing@163.com'

re 模块

match(pattern,string) 从字符开始出匹配,放回match对象或None, 需要调用group()显示一下

complie(pattern[,flags] 创建模式对象

search(pattern, string) 从左到右,一搜索到则返回match对象,否则None
findall(pattern, string) 以列表形式返回所有匹配内容
sub(pat, repl, string[,count=0]) 从string中pat到的字串用repl代替(字串or方法)[默认为0次]
split(pattern, string)
finditer(pattern,srting) 以列表形式返回所有匹配上的可迭代对象
purge() 清空正则表达式缓存
escape(string) 特殊正则字符转义

import re 
re.findall('\d', 'sfs8sdfjsd8sdfjsd90dfs8')

['8', '8', '9', '0', '8']

text = 'alpha.beta....gama delta'

re.split('[\.]+', text )  # 按照模式(一个点,或多个点)分割字符串

['alpha', 'beta', 'gama delta']

re.split('[\.]+', text, maxsplit=1)  # 最多分割1次

['alpha', 'beta....gama delta']

re.findall('[a-zA-Z]+', text)  # 查找所有单词

['alpha', 'beta', 'gama', 'delta']

re.sub('{name}', 'chenjie', 'Dear {name}')  # 从string中用chenjie 取替换匹配的字符

'Dear chenjie'

re.sub('a|s|d', 'good', 'as')

'goodgood'

s = 'it is a good good good idea idea'    ???
re.sub(r'(\b\w+)\1',r'\1', s)

  File "<ipython-input-28-2eecdf4127bd>", line 1
    s = 'it is a good good good idea idea'    ???
                                              ^
SyntaxError: invalid syntax

re.sub('a', lambda x: x.group(0).upper(), 'aaa, aab, abcdas')   # 将所有 小写a 变成大写 a

'AAA, AAb, AbcdAs'

re.sub('[a-zA-Z]', lambda x: chr(ord(x.group(0))^32), 'aaa Bbc agbDs')  # 英文字母,大小写互换

'AAA bBC AGBdS'

re.subn('a', 'chenjie', 'aasksdf afk jfej ak fsd ')   # 返回新字符串和替换次数

('chenjiechenjiesksdf chenjiefk jfej chenjiek fsd ', 4)

re.escape('http://www.python.org')  # 字符串转义

'http\\:\\/\\/www\\.python\\.org'

re.match('yes|no', 'yesnofsfsdfdsf')  # 从字符头开始匹配, 匹配成功则返回match对象

<_sre.SRE_Match object; span=(0, 3), match='yes'>

re.match('\w+@\w+(\.\w+)+','cjsd_marketing@163.com').group()  # match() 成功并group()出来

'cjsd_marketing@163.com'

1. 删除多余空格, 如果连续多个,则保留一个,同时删除字符串两侧的所有空白字符

s = 'aaa     bbb,    cd, fff   fs    ,  '

' '.join(s.split())   # 先按空格分割,在join一下, 显然,不太搞得定, 不仅有空格, 还有,

'aaa bbb, cd, fff fs ,'

' '.join(re.split(',|\s+', s.strip()))  # re.split(',|\s+', s.strip())  按照 空格,或者逗号分割

'aaabbbcdffffs'

re.sub(',|\s+', ' ', s.strip())    # 直接用空格替换,这个厉害了

'aaa bbb  cd  fff fs  '

2. 删除字符串中指定的内容

email = 'cjsd_marketing@163.com'  # 想要删除 _marketing

m = re.search('_marketing', email) # marketing
email[:m.start()] + email[m.end(): ]   # 老方法,字符串拼接

'cjsd_@163.com'

re.sub('_marketing', '', email)  # 直接sub 替换, 找到_marketing 用 空白符替换(全部替换)

'cjsd@163.com'

re.sub('a', 'b', 'aa,aa,gdsdaa')

'bb,bb,gdsdbb'

小结:sub()替换是无敌强, 就跟word的查找替换是一样的

3. 特定字符搜索

text = 'Beautiful is better than ugly.'

re.findall('\\bb.*?\\b', text)  #  以字母 b 开头的完整单词 \bxxx\b, 包起来, re则转义一下 \\b xxx \\b

['better']

re.findall(r'\bb.*?\b', text)  # 以后都加上 r, \bxx\b 包起来, 此处非贪心

['better']

re.findall(r'\bb.*\b', text)  # \b xxx \b  包起来, 贪心模式

['better than ugly']

re.findall(r'\Bh.+?\b', text)  # 不以h 开头,且剩余部分含有h的单词

['han']

import re
re.findall(r'\b\w+?\b', text)  # 所有单词

['Beautiful', 'is', 'better', 'than', 'ugly']

re.findall(r'\w+', text)

['Beautiful', 'is', 'better', 'than', 'ugly']

re.match('^B.*l$', text) # ^ $ 匹配的是字符串的开头和结尾, 不是其中的单词

re.split('\s', text)  #  使用任何空白字符分割

['Beautiful', 'is', 'better', 'than', 'ugly.']

re.findall(r'\d+\.\d+\.\d+', 'Python 3.6.1')  # x.x.x形式的数字

['3.6.1']

match 对象

m = re.match(r'(\w+) (\w+)', 'chen jie will be a great man')  # 中间还匹配了一个空格

m.group(0)  # 第一个子模式

'chen jie'

m.group(1)  # 第二个子模式

'chen'

m.group(1,2)

('chen', 'jie')

栗子

提取字符串中的电话号码

import re 

tel_number = '''Suppose my Phone No. is 0606-1234666, 
                Yours number is 010-123456,
                his number is 025-8799342.'''

pattern = re.compile(r'(\d{3,4})-(\d{7,8})')   # 逗号后面不能有空格

result = pattern.search(tel_number, index)
if not result:
    print('no match')
    
print('=='*20)
print('success', result.group())

========================================
success 0606-1234666

用正则表达式批量检查网页文件是否包含iframe(内嵌)框架

import os
import re 


def delect_iframe(file):
    
    content = []  # 存放网页的列表
    with open(file, encoding='utf-8') as f:
        # 读取文件所有行,删除两侧的空白符, 然后添加到列表中
        for line in f:
            content.append(line.strip())
            
        # 将所有字符连接成字符串
        ''.join(content)
        # 正则
        result = re.findall(r'<iframe\s+src=.*?></iframe>', content)
        if result:
            return {file:result}
        return False
    
    
for file in (f for f in os.listdir('.') if f.endswith(('.html', '.htm'))):   # 遍历一个文件生成器
    result = delect_iframe(file)
    if not result:
        continue
    # 输出检查结果
    for k, v in r.items():
        print(k)
        for vv in v:
            print('\t', vv)
            
            
# print(result.group())

0606-1234666

a-zA-Z ↩︎

posted @ 2020-02-20 22:44 致于数据科学家的小陈阅读(712) 评论(0) 收藏举报

刷新页面返回顶部

宁鸣而死

学无止境, 气有浩然.