爬虫学习之正则表达式re实战应用

简要流程总结：
1.导入库 re requests

2.伪装头

3.定义信息函数

res引入requests模块 #获取get头

引入re模块 findall （''，res.text,re.S）

遍历输出

4.定义主入口、构造多页函数

附上实战代码:

import re
import requests
import time
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
def duanzi(url):
    res = requests.get(url)
    contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S)
    contents
    for content in contents:
        data = content

        print(data)
if __name__ == '__main__':
    urls = ['http://www.doupoxs.com/qiuyu/373{0}.html'.format(str(i))
            for i in range(5,10)]
    for url in urls:
        duanzi(url)

import re
import requests
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
duanzi = []
def get_info(url):
    req = requests.get(url)
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',req.text,re.S)
    for content in contents:
        data = content
        print(data)

if __name__ == '__main__':
    urls = ['https://www.qiushibaike.com/text/page/{0}'.format(str(i))
            for i in range(1,15)]
    for url in urls:
        get_info(url)

posted on 2018-03-19 19:36 GhostAatrox 阅读(84) 评论(0) 收藏举报