爬虫笔记

爬虫

1.requests模块基础

1.1 零散知识点

note

requests模块
    - urllib模块
    - requests模块

requests模块：python中原生的一款基于网络请求的模块，功能非常强大，简单便捷，效率极高。
作用：模拟浏览器发请求。

如何使用：（requests模块的编码流程）
    - 指定url
        - UA伪装
        - 请求参数的处理
    - 发起请求
    - 获取响应数据
    - 持久化存储

环境安装：
    pip install requests

实战编码：
    - 需求：爬取搜狗首页的页面数据



实战巩固
    - 需求：爬取搜狗指定词条对应的搜索结果页面（简易网页采集器）
        - UA检测
        - UA伪装
    - 需求：破解百度翻译
        - post请求（携带了参数）
        - 响应数据是一组json数据
    - 需求：爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据

    - 作业：爬取肯德基餐厅查询http://www.kfc.com.cn/kfccda/index.aspx中指定地点的餐厅数据

    - 需求：爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据
                http://125.35.6.84:81/xk/
        - 动态加载数据
        - 首页中对应的企业信息数据是通过ajax动态请求到的。

        http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=e6c1aa332b274282b04659a6ea30430a
        http://125.35.6.84:81/xk/itownet/portal/dzpz.jsp?id=f63f61fe04684c46a016a45eac8754fe
        - 通过对详情页url的观察发现：
            - url的域名都是一样的，只有携带的参数（id）不一样
            - id值可以从首页对应的ajax请求到的json串中获取
            - 域名和id值拼接处一个完整的企业对应的详情页的url
        - 详情页的企业详情数据也是动态加载出来的
            - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById
            - http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById
            - 观察后发现：
                - 所有的post请求的url都是一样的，只有参数id值是不同。
                - 如果我们可以批量获取多家企业的id后，就可以将id和url形成一个完整的详情页对应详情数据的ajax请求的url

数据解析：
    聚焦爬虫
    正则
    bs4
    xpath

UA伪装

#UA：User-Agent（请求载体的身份标识）
#UA检测：门户网站的服务器会检测对应请求的载体身份标识，如果检测到请求的载体身份标识为某一款浏览器，
#说明该请求是一个正常的请求。但是，如果检测到请求的载体身份标识不是基于某一款浏览器的，则表示该请求
#为不正常的请求（爬虫），则服务器端就很有可能拒绝该次请求。

#UA伪装：让爬虫对应的请求载体身份标识伪装成某一款浏览器

代理操作

# 直接上示例代码
import requests
url = 'https://www.baidu.com/s?wd=ip'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}

page_text = requests.get(url=url,headers=headers,proxies={"https":'222.110.147.50:3128'}).text

with open('ip.html','w',encoding='utf-8') as fp:
    fp.write(page_text)

proxies={"https":'222.110.147.50:3128'} 就是一个代理操作，注意传的是字典的格式

携带cookies的发送方式

# 关键的几个代码

# 创建一个session对象
session = requests.Session()
# 使用session进行post请求的发送
response = session.post(url=login_url, headers=headers, data=data)

1.2 json.dump使用

使用

import requests
import json
if __name__ == '__main__':
    url = "https://fanyi.baidu.com/sug"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49"
    }
    word = input("enter a word:")
    data = {
        "kw":word
    }
    response = requests.post(url=url, headers=headers, data=data)
    content_dic = response.json()
    fp = open(word+".json", "w", encoding="utf-8")
    json.dump(content_dic, fp=fp, ensure_ascii=False)
    fp.close()
    print("Get it!!!")

参数解释

第一个参数：传入爬取到的或本地的json格式的字符串

fp：读取文件流，其实就是传入一个open以后的句柄就行

ensure_ascii=False：可以当作一个默认的参数传入，如果是True其就会自动将中文等转为ascii码

2.数据清洗

2.0 数据的返回类型

text（字符串） content（二进制） json() (对象)

2.1 re

正则基础

单字符：
 . : 除换行以外所有字符
 [] ：[aoe] [a-w] 匹配集合中任意一个字符
 \d ：数字  [0-9]
 \D : 非数字
 \w ：数字、字母、下划线、中文
 \W : 非\w
 \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
 \S : 非空白
数量修饰：
    * : 任意多次  >=0
    + : 至少1次   >=1
    ? : 可有可无  0次或者1次
    {m} ：固定m次 hello{3,}
    {m,} ：至少m次
    {m,n} ：m-n次
边界：
    $ : 以某某结尾 
    ^ : 以某某开头
分组：
    (ab)  
贪婪模式： .*
非贪婪（惰性）模式： .*?

在学习如何使用正则表达式进行文本查找/匹配之前，我们先简单的看一下 Python 中基本的正则表达式的语法。

字符意义

. 匹配除换行符之外的任何单个字符

* 匹配前面的子表达式 0 次到多次

+ 匹配前面的子表达式 1 次到多次

? 匹配前面的子表达式不超过 1 次

| 或操作，匹配前面或者后面的子表达式

- 范围操作符，如 0-9、a-z 等

^ 匹配行首

$ 匹配行尾

( ) 小括号用以分组子表达式

花括号指定前面子表达式重复的次数

[ ] 中括号引导字母池，匹配其中的任何一个符号

[^] 中括号以 ^ 开头，为逆向字母池，匹配任何一个不在其中的符号

此外，还有一些特殊的转义字符，比如：

转义符匹配项目

. 匹配 .

\ 匹配 \

\r 匹配换行符

\n 匹配回车符

\b 单词分界符

\B 非单词分界符

\s 匹配任一空白字符，等价于 [\f\n\r\t\v]

\S 匹配非空白字符，等价于 [^\f\n\r\t\v]

\w 匹配任一英文字母、数字、下划线，等价于 [a-zA-Z_0-9]

\W 等价于 [^a-zA-Z_0-9]

\d 等价于 [0-9]

\D 等价于 [^0-9]

常用函数

与正则表达式相关的常用函数有：

compile()：根据正则表达式生成 re 对象；

match()：尝试从字符串首部匹配某正则表达式，若成功则返回一个 match 对象，否则返回 None；

search()：尝试搜算字符串中匹配某正则表达式的部分，若成功则返回一个 match 对象，否则返回 None；

fullmatch()：尝试匹配整个字符串，若成功则返回一个 match 对象，否则返回 None；

findall()：返回由指定字符串中所有匹配该模式的字串组成的列表；

sub(): 用指定字符串或者替换法则（需要一个函数）来替换目标字符串中所有匹配该表达式的子串。

一个match对象主要包含下面的要素：

group，其中 group(0) 包含匹配的整个模式内容，group(1)为第一个子模式，group(2) 为第二个子模式，以此类推 ……

span()：表明了匹配的区间，由两个分量 begin() 和 end() 组成。

字符	意义
.	匹配除换行符之外的任何单个字符
*	匹配前面的子表达式 0 次到多次
+	匹配前面的子表达式 1 次到多次
?	匹配前面的子表达式不超过 1 次
\|	或操作，匹配前面或者后面的子表达式
-	范围操作符，如 `0-9`、`a-z` 等
^	匹配行首
$	匹配行尾
( )	小括号用以分组子表达式
	花括号指定前面子表达式重复的次数
[ ]	中括号引导字母池，匹配其中的任何一个符号
[^]	中括号以 ^ 开头，为逆向字母池，匹配任何一个不在其中的符号

转义符	匹配项目
.	匹配 `.`
\	匹配 `\`
\r	匹配换行符
\n	匹配回车符
\b	单词分界符
\B	非单词分界符
\s	匹配任一空白字符，等价于 [\f\n\r\t\v]
\S	匹配非空白字符，等价于 [^\f\n\r\t\v]
\w	匹配任一英文字母、数字、下划线，等价于 [a-zA-Z_0-9]
\W	等价于 [^a-zA-Z_0-9]
\d	等价于 [0-9]
\D	等价于 [^0-9]

代码实战

# -*- coding:utf-8 -*-
import requests
import re
import os

# 爬取糗事百科中糗图板块下所有的糗图图片
if __name__ == "__main__":
    # 创建一个文件夹，保存所有的图片
    if not os.path.exists('./qiutuLibs'):
        os.mkdir('./qiutuLibs')

    url = 'https://www.qiushibaike.com/pic/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'

    }
    # 使用通用爬虫对url对应的一整张页面进行爬取
    page_text = requests.get(url=url, headers=headers).text

    # 使用聚焦爬虫将页面中所有的糗图进行解析/提取
    ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    img_src_list = re.findall(ex, page_text, re.S)
    # print(img_src_list)
    for src in img_src_list:
        # 拼接出一个完整的图片url
        src = 'https:' + src
        # 请求到了图片的二进制数据
        img_data = requests.get(url=src, headers=headers).content
        # 生成图片名称
        img_name = src.split('/')[-1]
        # 图片存储的路径
        imgPath = './qiutuLibs/' + img_name
        with open(imgPath, 'wb') as fp:
            fp.write(img_data)
            print(img_name, '下载成功！！！')

re.S 参数解释

在字符串a中，包含换行符\n，在这种情况下：

如果不使用re.S参数，则只在每一行内进行匹配，如果一行没有，就换下一行重新开始。

而使用re.S参数以后，正则表达式会将这个字符串作为一个整体，在整体中进行匹配。

2.2 bs4

bs4基础

bs4进行数据解析
    - 数据解析的原理：
        - 1.标签定位
        - 2.提取标签、标签属性中存储的数据值
    - bs4数据解析的原理：
        - 1.实例化一个BeautifulSoup对象，并且将页面源码数据加载到该对象中
        - 2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
    - 环境安装：
        - pip install bs4
        - pip install lxml
    - 如何实例化BeautifulSoup对象：
        - from bs4 import BeautifulSoup
        - 对象的实例化：
            - 1.将本地的html文档中的数据加载到该对象中
                    fp = open('./test.html','r',encoding='utf-8')
                    soup = BeautifulSoup(fp,'lxml')
            - 2.将互联网上获取的页面源码加载到该对象中
                    page_text = response.text
                    soup = BeatifulSoup(page_text,'lxml')
        - 提供的用于数据解析的方法和属性：
            - soup.tagName:返回的是文档中第一次出现的tagName对应的标签
            - soup.find():
                - find('tagName'):等同于soup.div
                - 属性定位：
                    -soup.find('div',class_/id/attr='song')
            - soup.find_all('tagName'):返回符合要求的所有标签（列表）
        - select：
            - select('某种选择器（id，class，标签...选择器）'),返回的是一个列表。
            - 层级选择器：
                - soup.select('.tang > ul > li > a')：>表示的是一个层级
                - oup.select('.tang > ul a')：空格表示的多个层级
        - 获取标签之间的文本数据：
            - soup.a.text/string/get_text()
            - text/get_text():可以获取某一个标签中所有的文本内容
            - string：只可以获取该标签下面直系的文本内容
        - 获取标签中属性值：
            - soup.a['href']

案例

# -*- coding:utf-8 -*-
import requests
from bs4 import BeautifulSoup

# 爬取三国演义小说所有的章节标题和章节内容http://www.shicimingju.com/book/sanguoyanyi.html
if __name__ == "__main__":
    # 对首页的页面数据进行爬取
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
    }
    url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
    page_text = requests.get(url=url, headers=headers).text

    # 在首页中解析出章节的标题和详情页的url
    # 1.实例化BeautifulSoup对象，需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text, 'lxml')
    # 解析章节标题和详情页的url
    li_list = soup.select('.book-mulu > ul > li')
    fp = open('./sanguo.txt', 'w', encoding='utf-8')
    for li in li_list:
        title = li.a.string
        detail_url = 'http://www.shicimingju.com' + li.a['href']
        # 对详情页发起请求，解析出章节内容
        detail_page_text = requests.get(url=detail_url, headers=headers).text
        # 解析出详情页中相关的章节内容
        detail_soup = BeautifulSoup(detail_page_text, 'lxml')
        div_tag = detail_soup.find('div', class_='chapter_content')
        # 解析到了章节的内容
        content = div_tag.text
        fp.write(title + ':' + content + '\n')
        print(title, '爬取成功！！！')

2.3 xpath(*)

xpath基础

xpath解析：最常用且最便捷高效的一种解析方式。通用性。
    - xpath解析原理：
        - 1.实例化一个etree的对象，且需要将被解析的页面源码数据加载到该对象中。
        - 2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
    - 环境的安装：
        - pip install lxml
    - 如何实例化一个etree对象:from lxml import etree
        - 1.将本地的html文档中的源码数据加载到etree对象中：
            etree.parse(filePath)
        - 2.可以将从互联网上获取的源码数据加载到该对象中
            etree.HTML('page_text')
        - xpath('xpath表达式')
    - xpath表达式:
        - /:表示的是从根节点开始定位。表示的是一个层级。
        - //:表示的是多个层级。可以表示从任意位置开始定位。
        - 属性定位：//div[@class='song'] tag[@attrName="attrValue"]
        - 索引定位：//div[@class="song"]/p[3] 索引是从1开始的。
        - 取文本：
            - /text() 获取的是标签中直系的文本内容
            - //text() 标签中非直系的文本内容（所有的文本内容）
        - 取属性：
            /@attrName     ==>img/src

xpath基础（详细版）

XPath 是一门在 XML 文档中查找信息的语言，虽然是被设计用来搜寻 XML 文档的，但是它也能应用于 HTML 文档，并且大部分浏览器也支持通过 XPath 来查询节点。在 Python 爬虫开发中，经常使用 XPath 查找提取网页中的信息，因此 XPath 非常重要。

XPath 使用路径表达式来选取 XML 文档中的节点或节点集。节点是沿着路径(path)或者步(steps)来选取的。接下来介绍如何选取节点，首先了解一下常用的路径表达式，来进行节点的选取，如下表所示：

表达式描述

nodename 选取此节点的所有子节点

/ 从根节点选取

// 选择任意位置的某个节点

. 选取当前节点

.. 选取当前节点的父节点

@ 选取属性

根据路径表达式的规则，我们对上文的的 XML 文档进行节点选取，如下表所示。

XPath路径表达式含义

bookstore 选取 bookstore 元素的所有子节点

/bookstore 选取根元素 bookstore

/bookstore/book/text() 选取属于 bookstore 子元素的 book 元素下的所有文本内容

//book 选取所有 book 子元素，而不管它们在文档中位置

//@eng 选取名为 eng 的所有属性

上面选取的例子最后实现的效果都是选取了所有符合条件的节点，是否能选取某个特定的节点或者包含某一个指定的值的节点呢?这就需要用到谓语，谓语被嵌在方括号中，谓语的用法如下表所示。

XPath路径表达式含义

/bookstore/book[1] 选取属于 bookstore 子元素的第一个 book 元素

/bookstore/book[last()] 选取属于 bookstore 子元素的最后一个 book 元素

/bookstore/book[last()-1] 选取属于 bookstore 子元素的倒数第二个 book 元素

/bookstore/book[position()❤️] 选取最前面的两个属于 bookstore 元素的子元素的 book 元素

//title[@lang] 选取所有拥有名为 lang 的属性的 title 元素

//title[@lang='eng'] 选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性

//title[@lang='eng' and @class="good"] 选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性和值为good的class属性

/bookstore/book[price>35.00] 选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00

/bookstore/book[price>35.00]/title 选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00

XPath 在进行节点选取的时候可以使用通配符*匹配未知的元素，同时使用操作符|一次选取多条路径，使用示例如下表所示。

XPath路径表达式含义

/bookstore/* 选取 bookstore 元素的所有子元素

//* 选取文档中的所有元素

//title[@*] 选取所有带有属性的 title 元素

//book/title 丨 //book/price 选取 book 元素的所有 title 和 price 元素

//title 丨 //price 选取文档中的所有 title 和 price 元素

/bookstore/book/title 丨 //price 选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素

表达式	描述
nodename	选取此节点的所有子节点
/	从根节点选取
//	选择任意位置的某个节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

XPath路径表达式	含义
bookstore	选取 bookstore 元素的所有子节点
/bookstore	选取根元素 bookstore
/bookstore/book/text()	选取属于 bookstore 子元素的 book 元素下的所有文本内容
//book	选取所有 book 子元素，而不管它们在文档中位置
//@eng	选取名为 eng 的所有属性

XPath路径表达式	含义
/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素
/bookstore/book[position()❤️]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素
//title[@lang='eng']	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性
//title[@lang='eng' and @class="good"]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性和值为good的class属性
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00

XPath路径表达式	含义
/bookstore/*	选取 bookstore 元素的所有子元素
//*	选取文档中的所有元素
//title[`@*`]	选取所有带有属性的 title 元素
//book/title 丨 //book/price	选取 book 元素的所有 title 和 price 元素
//title 丨 //price	选取文档中的所有 title 和 price 元素
/bookstore/book/title 丨 //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素

案例

# -*- coding:utf-8 -*-
# 解析下载图片数据 http://pic.netbian.com/4kmeinv/
import requests
from lxml import etree
import os

if __name__ == "__main__":
    url = 'http://pic.netbian.com/4kmeinv/'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
    }
    response = requests.get(url=url, headers=headers)
    # 手动设定响应数据的编码格式
    # response.encoding = 'utf-8'
    page_text = response.text

    # 数据解析：src的属性值  alt属性
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li')

    # 创建一个文件夹
    if not os.path.exists('./picLibs'):
        os.mkdir('./picLibs')

    for li in li_list:
        img_src = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/img/@alt')[0] + '.jpg'
        # 通用处理中文乱码的解决方案
        img_name = img_name.encode('iso-8859-1').decode('gbk')

        # print(img_name,img_src)
        # 请求图片进行持久化存储
        img_data = requests.get(url=img_src, headers=headers).content
        img_path = 'picLibs/' + img_name
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
            print(img_name, '下载成功！！！')

3.验证码处理

超级鹰的使用

网站：超级鹰验证码识别-专业的验证码云端识别服务,让验证码识别更快速、更准确、更强大 (chaojiying.com)

1.注册一个账号和软件

2.下载python（或其他语言）的示例代码

3.替换掉代码中的账号和密码，上传想要识别的图片，进行识别

注：讲的可能比较简单，其实也不会很复杂，要还不懂就去搜搜

4.高性能异步爬虫

异步爬虫的方式：
    - 1.多线程，多进程（不建议）：
        好处：可以为相关阻塞的操作单独开启线程或者进程，阻塞操作就可以异步执行。
        弊端：无法无限制的开启多线程或者多进程。
    - 2.线程池、进程池（适当的使用）：
        好处：我们可以降低系统对进程或者线程创建和销毁的一个频率，从而很好的降低系统的开销。
        弊端：池中线程或进程的数量是有上限。

- 3.单线程+异步协程（推荐）：
    event_loop：事件循环，相当于一个无限循环，我们可以把一些函数注册到这个事件循环上，
    当满足某些条件的时候，函数就会被循环执行。

    coroutine：协程对象，我们可以将协程对象注册到事件循环中，它会被事件循环调用。
    我们可以使用 async 关键字来定义一个方法，这个方法在调用时不会立即被执行，而是返回
    一个协程对象。

    task：任务，它是对协程对象的进一步封装，包含了任务的各个状态。

    future：代表将来执行或还没有执行的任务，实际上和 task 没有本质区别。

    async 定义一个协程.

    await 用来挂起阻塞方法的执行。

4.1 线程池的基本使用

import time
# 导入线程池模块对应的类
from multiprocessing.dummy import Pool

# 使用线程池方式执行
start_time = time.time()


def get_page(str):
    print("正在下载 ：", str)
    time.sleep(2)
    print('下载成功：', str)


name_list = ['xiaozi', 'aa', 'bb', 'cc']

# 实例化一个线程池对象
pool = Pool(4)
# 将列表中每一个列表元素传递给get_page进行处理。
pool.map(get_page, name_list)
pool.close()
pool.join()
end_time = time.time()
print(end_time - start_time)

pool.close告诉池不要接受任何新工作。
pool.join告诉池等待所有作业完成然后退出，有效地清理池。

爬虫案例

import requests
from lxml import etree
import re
from multiprocessing.dummy import Pool

# 需求：爬取梨视频的视频数据
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}
# 原则：线程池处理的是阻塞且较为耗时的操作

# 对下述url发起请求解析出视频详情页的url和视频的名称
url = 'https://www.pearvideo.com/category_5'
page_text = requests.get(url=url, headers=headers).text

tree = etree.HTML(page_text)
li_list = tree.xpath('//ul[@id="listvideoListUl"]/li')
urls = []  # 存储所有视频的链接and名字
for li in li_list:
    detail_url = 'https://www.pearvideo.com/' + li.xpath('./div/a/@href')[0]
    name = li.xpath('./div/a/div[2]/text()')[0] + '.mp4'
    # 对详情页的url发起请求
    detail_page_text = requests.get(url=detail_url, headers=headers).text
    # 从详情页中解析出视频的地址（url）
    ex = 'srcUrl="(.*?)",vdoUrl'
    video_url = re.findall(ex, detail_page_text)[0]
    dic = {
        'name': name,
        'url': video_url
    }
    urls.append(dic)


# 对视频链接发起请求获取视频的二进制数据，然后将视频数据进行返回
def get_video_data(dic):
    url = dic['url']
    print(dic['name'], '正在下载......')
    data = requests.get(url=url, headers=headers).content
    # 持久化存储操作
    with open(dic['name'], 'wb') as fp:
        fp.write(data)
        print(dic['name'], '下载成功！')


# 使用线程池对视频数据进行请求（较为耗时的阻塞操作）
pool = Pool(4)
pool.map(get_video_data, urls)

pool.close()
pool.join()

注：看结构，用不了了

4.2 协程

import asyncio

async def request(url):
    print('正在请求的url是',url)
    print('请求成功,',url)
    return url

#async修饰的函数，调用之后返回的一个协程对象
c = request('www.baidu.com')

# #创建一个事件循环对象
# loop = asyncio.get_event_loop()
#
# #将协程对象注册到loop中，然后启动loop
# loop.run_until_complete(c)

#task的使用
# loop = asyncio.get_event_loop()
# #基于loop创建了一个task对象
# task = loop.create_task(c)
# print(task)
#
# loop.run_until_complete(task)
#
# print(task)

#future的使用
# loop = asyncio.get_event_loop()
# task = asyncio.ensure_future(c)
# print(task)
# loop.run_until_complete(task)
# print(task)

def callback_func(task):
    #result返回的就是任务对象中封装的协程对象对应函数的返回值
    print(task.result())

#绑定回调
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(c)
#将回调函数绑定到任务对象中
task.add_done_callback(callback_func)
loop.run_until_complete(task)

多任务协程

import asyncio
import time


async def request(url):
    print('正在下载', url)
    # 在异步协程中如果出现了同步模块相关的代码，那么就无法实现异步。
    # time.sleep(2)
    # 当在asyncio中遇到阻塞操作必须进行手动挂起
    await asyncio.sleep(2)
    print('下载完毕', url)


start = time.time()
urls = [
    'www.baidu.com',
    'www.sogou.com',
    'www.goubanjia.com'
]

# 任务列表：存放多个任务对象
stasks = []
for url in urls:
    c = request(url)
    task = asyncio.ensure_future(c)
    stasks.append(task)

loop = asyncio.get_event_loop()
# 需要将任务列表封装到wait中
loop.run_until_complete(asyncio.wait(stasks))

print(time.time() - start)

多任务异步协程

import requests
import asyncio
import time

start = time.time()
urls = [
    'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom'
]


async def get_page(url):
    print('正在下载', url)
    # requests.get是基于同步，必须使用基于异步的网络请求模块进行指定url的请求发送
    # aiohttp:基于异步网络请求的模块
    response = requests.get(url=url)
    print('下载完毕：', response.text)


tasks = []

for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

end = time.time()

print('总耗时:', end - start)

aiohttp实现多任务异步协程

# 环境安装：pip install aiohttp
# 使用该模块中的ClientSession
import requests
import asyncio
import time
import aiohttp

start = time.time()
# urls = [
#     'http://127.0.0.1:5000/bobo','http://127.0.0.1:5000/jay','http://127.0.0.1:5000/tom',
#     'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom',
#     'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom',
#     'http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom',
#
# ]
from multiprocessing.dummy import Pool

pool = Pool(2)

urls = []
for i in range(10):
    urls.append('http://127.0.0.1:5000/bobo')
print(urls)


async def get_page(url):
    async with aiohttp.ClientSession() as session:
        # get()、post():
        # headers,params/data,proxy='http://ip:port'
        async with await session.get(url) as response:
            # text()返回字符串形式的响应数据
            # read()返回的二进制形式的响应数据
            # json()返回的就是json对象
            # 注意：获取响应数据操作之前一定要使用await进行手动挂起
            page_text = await response.text()
            print(page_text)


tasks = []

for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()

loop.run_until_complete(asyncio.wait(tasks))

end = time.time()

print('总耗时:', end - start)

异步爬虫实练

import requests
from lxml import etree
import time
import os
start = time.time()
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
}
if not os.path.exists('./libs'):
    os.mkdir('./libs')
url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
a = []
for page in range(2,50):
    new_url = format(url%page)
    page_text = requests.get(url=new_url,headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//div[@class="slist"]/ul/li')
    for li in li_list:
        img_src = 'http://pic.netbian.com' + li.xpath('./a/img/@src')[0]
        name = img_src.split('/')[-1]
        data = requests.get(url=img_src).content
        path = './libs/'+name
        with open(path,'wb') as fp:
            fp.write(data)
            print(name,'下载成功')
        a.append(name)
print(len(a))
print('总耗时：',time.time()-start)

未完待续...

posted @ 2022-08-04 21:44 ihuahua1415 阅读(71) 评论(0) 收藏举报

刷新页面返回顶部

hnu-hua

爬虫笔记

爬虫

1.requests模块基础

1.1 零散知识点

1.2 json.dump使用

2.数据清洗

2.0 数据的返回类型

2.1 re

常用函数

2.2 bs4

2.3 xpath(*)

3.验证码处理

4.高性能异步爬虫

4.1 线程池的基本使用

4.2 协程

公告