正则表达式

1.常用的匹配规则

模式	描述
\w	匹配字母、数字以及下划线
\W	匹配不是字母、数字以及下划线的字符
\s	匹配任意空白字符
\S	匹配任意非空字符
\d	匹配任意数字，相当于[0-9]
\D	匹配任意非数字的字符
\A	匹配字符串的开头
\Z	匹配字符串结尾，如果存在换行，只匹配到换行前的结束字符串
\z	匹配字符串结尾，如果存在换行，同时还会匹配换行符
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配一行字符串的开头
$	匹配一行字符串的结尾
.	匹配任意字符，除了换行符
[...]	用来表示一组字符，单独列出，例如[ink]->匹配i,n或k
[^...]	匹配不在[]中的字符
*	匹配0个或多个表达式
+	匹配1个或多个表达式
?	匹配0个或1个前面的正则表达式定义的片段，非贪婪方式
	精确匹配n个前面的表达式
	匹配n到m次由前面正则表达式定义的片段，贪婪方式
a\|b	匹配a或b
()	匹配括号内的表达式，也表示一个组

2.match

match方法会从字符串的起始位置开始匹配正则表达式，如果匹配，就返回匹配成功的结果；如果不匹配，返回None

import re

content = 'Hello 123 456 789 world_this is a demo'
result = re.match(r'^Hello\s\d{3}\s\d{3}\s\d{3}', content)
print(result)
print(result.group())

---------------------
输出结果：
<re.Match object; span=(0, 17), match='Hello 123 456 789'>
Hello 123 456 789

在match方法中，第一个参数是匹配的正则表达式，第二个参数是要匹配的字符串
在输出时，group()方法显示匹配成功的字符串，span()显示匹配成功的范围

2.1 匹配目标

通过使用match可以实现匹配，如果想要从字符串内提取一部分内容，就需要使用括号将想要提取的子字符串括起来

import re

content = 'Hello 123456789 world_this is a demo'
result = re.match(r'^Hello\s(\d+)\s(\w+)', content)
print(result)
print(result.group())
print(result.group(1))
print(result.group(2))


----------------------------------
输出结果：
<re.Match object; span=(0, 26), match='Hello 123456789 world_this'>
123456789
world_this

group()会将所有匹配的内容显示出来，group(1)会将第一个括号内匹配的内容显示出来，之后还有()括起来的内容，可以依次调用group(2)、group(3)等获取

2.2 通用匹配

像刚才使用\s匹配空白字符，\d匹配数字的方法比较繁琐，可以直接使用万能匹配”.*“。其中"."用来匹配任意字符（除换行符），"*"的意思是无限次匹配前面的字符，所以这两个组合起来就可以用来匹配任意字符。

import re

content = 'Hello 123456789 world_this is a demo'
result = re.match(r'^Hello.*demo$', content)
print(result)
print(result.group())


------------------------------------
输出结果：
<re.Match object; span=(0, 36), match='Hello 123456789 world_this is a demo'>
Hello 123456789 world_this is a demo

2.3 贪婪与非贪婪

通过上一个示例中，发现有时使用“.*”来匹配字符不能完美满足我们的需要，这时候就涉及到贪婪匹配与非贪婪匹配的问题。

贪婪匹配（.*)，会尽可能多地匹配字符
```
import re

content = 'Hello 123456789 world_this is a demo'
result = re.match(r'^Hello\s(.*)\s', content)
print(result.group(1))


------------------------------------
输出结果：
123456789 world_this is a
```
(.*)\s这种贪婪匹配方法下，会匹配尽可能地匹配字符，也就是它匹配到最末端的空字符才停下，也就是到“demo”前的空字符停下
非贪婪匹配(.*?)，会尽可能少地匹配字符

"""匹配出数字"""
import re

content = 'Hello 123456789 world_this is a demo'
result = re.match(r'^Hello\s(.*?)\s', content)
print(result.group(1))

-----------------------------
输出结果：
123456789

这种方式比2.1当中找出数字的方式简洁许多，(.*?)\s这种非贪婪匹配只会匹配最近的一个空字符，也就是数字“9”后面的第一个空字符结束

2.4修饰符

import re

content = """Hello 123456789 world_this 
is a demo"""
result = re.match(r'^H.*?(\d+).*?demo$', content)
print(result)
print(result.group())
print(result.group(1))


--------------------------
输出结果：
None
AttributeError: 'NoneType' object has no attribute 'group'

将我们要匹配的字符串(content)中加入换行符，这时发现正则表达式无法匹配内容，返回的结果为None。这时需要加入修饰符re.S，这样可以让匹配内容包括换行符在内的所有字符

加入修饰符后，上面的实例代码运行后的结果如下：

import re

content = """Hello 123456789 world_this 
is a demo"""
result = re.match(r'^H.*?(\d+).*?demo$', content, re.S)
print(result.group(1))
-----------------------
输出结果：
123456789

2.5 转义匹配

有时使用贪婪匹配或者非贪婪匹配匹配字符串的时候，有的字符串本身就含有“.”，需要使用'\'来将'.'进行转义

import re

content = 'www.baidu.com'
"""匹配出域名"""
result = re.match('^w.*?\.(.*)', content)
print(result.group(1))


----------------------------
输出结果：
baidu.com

3.search

match是从字符串的开头开始匹配，一旦开头匹配失败，整个匹配过程就会失败，因此match方法在使用时需要考虑目标字符串开头的内容。

search在匹配时会扫描整个字符串，然后返回第一个匹配的内容，就像是从一堆内容中寻找符合的第一个内容

import re

content = """<ul class="col5">
    
    <li class="clearfix">
        <span class="green-num-box">1</span>
        <a class="face" href="https://site.douban.com/mrblack/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/da8bd2ba2268e62.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764845">
              <a href="javascript:;">say88</a>
            </h3>

            <p>布布布莱克&nbsp;/&nbsp;2716次播放</p>
        </div>
        <span class="days">(上榜38天)</span>
        <span class="trend arrow-stay"> 0 </span>
    </li>
    
    <li class="clearfix">
        <span class="green-num-box">2</span>
        <a class="face" href="https://site.douban.com/zhangxianzhi/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/3aaddb0b16d9547.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764862">
              <a href="javascript:;">74-见一面吧-1500</a>
            </h3>

            <p>张弦织&nbsp;/&nbsp;1318次播放</p>
        </div>
        <span class="days">(上榜35天)</span>
        <span class="trend arrow-up"> 1 </span>
    </li>
    
    <li class="clearfix">
        <span class="green-num-box">3</span>
        <a class="face" href="https://site.douban.com/HOPE/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/9e14bce1c82f343.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764780">
              <a href="javascript:;">四九城</a>
            </h3>

            <p>啸&nbsp;/&nbsp;3753次播放</p>
        </div>
        <span class="days">(上榜42天)</span>
        <span class="trend arrow-down"> 1 </span>
    </li>
    
    <li class="clearfix">
        <span class="green-num-box">4</span>
        <a class="face" href="https://site.douban.com/Post80sG/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/23b2b858f9949fa.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764887">
              <a href="javascript:;">Check It Out Y&#39;All</a>
            </h3>

            <p>Post80s&nbsp;/&nbsp;1813次播放</p>
        </div>
        <span class="days">(上榜33天)</span>
        <span class="trend arrow-up"> 7 </span>
    </li>"""

result = re.search(r'<li.*?javascript.*?>(.*?)</a>', content, re.S)
if result:
    print(result.group(1))
----------------------------
输出结果：say88

这个示例是豆瓣音乐排行榜页面的源代码一部分，通过观察可以发现每首歌的歌名都在<li>标签下，<a href="javascript:;">内容之后一直到</a>结束，所以使用<li.*?javascript.*?>(.*?)</a>来进行匹配，但是返回的结果只是第一个匹配的结果

4.findall

在上一节中，使用search只能寻找出第一个符合的字符串，但是如果想要在真个HTML代码中找出所有的歌名，就需要使用findall这个方法来进行匹配

import re

content = """<ul class="col5">

    <li class="clearfix">
        <span class="green-num-box">1</span>
        <a class="face" href="https://site.douban.com/mrblack/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/da8bd2ba2268e62.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764845">
              <a href="javascript:;">say88</a>
            </h3>

            <p>布布布莱克&nbsp;/&nbsp;2716次播放</p>
        </div>
        <span class="days">(上榜38天)</span>
        <span class="trend arrow-stay"> 0 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">2</span>
        <a class="face" href="https://site.douban.com/zhangxianzhi/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/3aaddb0b16d9547.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764862">
              <a href="javascript:;">74-见一面吧-1500</a>
            </h3>

            <p>张弦织&nbsp;/&nbsp;1318次播放</p>
        </div>
        <span class="days">(上榜35天)</span>
        <span class="trend arrow-up"> 1 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">3</span>
        <a class="face" href="https://site.douban.com/HOPE/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/9e14bce1c82f343.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764780">
              <a href="javascript:;">四九城</a>
            </h3>

            <p>啸&nbsp;/&nbsp;3753次播放</p>
        </div>
        <span class="days">(上榜42天)</span>
        <span class="trend arrow-down"> 1 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">4</span>
        <a class="face" href="https://site.douban.com/Post80sG/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/23b2b858f9949fa.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764887">
              <a href="javascript:;">Check It Out Y&#39;All</a>
            </h3>

            <p>Post80s&nbsp;/&nbsp;1813次播放</p>
        </div>
        <span class="days">(上榜33天)</span>
        <span class="trend arrow-up"> 7 </span>
    </li>"""

result = re.findall(r'<li.*?javascript.*?>(.*?)</a>', content, re.S)
print(result)
for re in result:
    print(re)
---------------------
输出结果：
['say88', '74-见一面吧-1500', '四九城', 'Check It Out Y&#39;All']
say88
74-见一面吧-1500
四九城
Check It Out Y&#39;All

可以看到使用findall方法可以将所有匹配的字符串返回，但是返回的字符串是以列表形式返回的

5.sub

正则表达式的作用除了提取匹配的信息，还可以替换匹配的信息

import re

content = '3dfa4fasdf9gsadgfds455'
content = re.sub(r'\d+', '', content)
print(content)
---------------
输出结果：
dfafasdfgsadgfds

如果想要使用sub方法将匹配的信息去除时，可以将第二个参数设为空

同样地，对于第4节中寻找歌名的示例，可以使用sub去除不必要的东西，让HTML代码更为简洁，之后在使用findall进行匹配寻找

import re

html = """<ul class="col5">

    <li class="clearfix">
        <span class="green-num-box">1</span>
        <a class="face" href="https://site.douban.com/mrblack/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/da8bd2ba2268e62.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764845">
              <a href="javascript:;">say88</a>
            </h3>

            <p>布布布莱克&nbsp;/&nbsp;2716次播放</p>
        </div>
        <span class="days">(上榜38天)</span>
        <span class="trend arrow-stay"> 0 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">2</span>
        <a class="face" href="https://site.douban.com/zhangxianzhi/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/3aaddb0b16d9547.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764862">
              <a href="javascript:;">74-见一面吧-1500</a>
            </h3>

            <p>张弦织&nbsp;/&nbsp;1318次播放</p>
        </div>
        <span class="days">(上榜35天)</span>
        <span class="trend arrow-up"> 1 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">3</span>
        <a class="face" href="https://site.douban.com/HOPE/" target="_blank">
            <img src="https://img2.doubanio.com/view/site/small/public/9e14bce1c82f343.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764780">
              <a href="javascript:;">四九城</a>
            </h3>

            <p>啸&nbsp;/&nbsp;3753次播放</p>
        </div>
        <span class="days">(上榜42天)</span>
        <span class="trend arrow-down"> 1 </span>
    </li>

    <li class="clearfix">
        <span class="green-num-box">4</span>
        <a class="face" href="https://site.douban.com/Post80sG/" target="_blank">
            <img src="https://img1.doubanio.com/view/site/small/public/23b2b858f9949fa.jpg">
        </a>
        <div class="intro">
            <h3 class="icon-play" data-sid="764887">
              <a href="javascript:;">Check It Out Y&#39;All</a>
            </h3>

            <p>Post80s&nbsp;/&nbsp;1813次播放</p>
        </div>
        <span class="days">(上榜33天)</span>
        <span class="trend arrow-up"> 7 </span>
    </li>"""


html = re.sub(r'<a.*?>', '', html)
# print(html)
result = re.findall(r'<h3.*?>(.*?)</a>', html, re.S)
for re in result:
    print(re.lstrip())

通过sub将歌名前面的内容去除，直接留下歌名文本，方便后续的匹配

6.compile

compile的功能是将正则字符串编译成正则表达式对象，以便在后面的匹配中复用

import re

content1 = '2021-12-11 12:00'
content2 = '2021-12-12 14:00'
content3 = '2021-12-13 16:00'
content4 = '2021-12-14 18:00'
pattern = re.compile(r'\d{2}:\d{2}')
result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)
result4 = re.sub(pattern, '', content4)
print(result1)
print(result2)
print(result3)
print(result4)

--------------
输出结果：
2021-12-11 
2021-12-12 
2021-12-13 
2021-12-14

posted @ 2021-12-13 17:16 写代码的小灰阅读(48) 评论(0) 收藏举报

刷新页面返回顶部

写代码的小灰

正则表达式

正则表达式

1.常用的匹配规则

2.match

2.1 匹配目标

2.2 通用匹配

2.3 贪婪与非贪婪

2.4修饰符

2.5 转义匹配

3.search

4.findall

5.sub

6.compile

公告