简要流程总结:
1.导入库 re requests
2.伪装头
3.定义信息函数
res引入requests模块 #获取get头
引入re模块 findall ('',res.text,re.S)
遍历输出
4.定义主入口、构造多页函数
附上实战代码:
import re import requests import time headers = { "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" } def duanzi(url): res = requests.get(url) contents = re.findall('<p>(.*?)</p>',res.content.decode('utf-8'),re.S) contents for content in contents: data = content print(data) if __name__ == '__main__': urls = ['http://www.doupoxs.com/qiuyu/373{0}.html'.format(str(i)) for i in range(5,10)] for url in urls: duanzi(url)
import re import requests headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36' } duanzi = [] def get_info(url): req = requests.get(url) contents = re.findall('<div class="content">.*?<span>(.*?)</span>',req.text,re.S) for content in contents: data = content print(data) if __name__ == '__main__': urls = ['https://www.qiushibaike.com/text/page/{0}'.format(str(i)) for i in range(1,15)] for url in urls: get_info(url)
总有一个理由,会让我们开始变强。
浙公网安备 33010602011771号