python学习：python爬虫

Python爬虫

在网站上运行自动化脚本，获取相关信息。爬虫分为定向和非定向，定向指定爬取网站类型，非定向脚本自动爬取网站类型。

爬虫过程

1、请求网站地址，拿到网站的html代码；

2、筛选出对应链接地址，信息；

使用data=requests.get(网址),获取网站。data.text可以拿到网站html代码。

使用beautisoup模块，将html代码文本转换为一个对象。

安装对应模块：

安装requests模块，pip3 install requests

安装beautifulsoup4,pip 3 install beautifulsoup4

示例1：

创建pachong.py

#爬虫

#爬取汽车资讯信息

import  requests

from  bs4 import  BeautifulSoup

response = requests.get("https://www.autohome.com.cn/all/")

#将编码格式指定为网页编码格式

response.encoding=response.apparent_encoding

print(response.text)

#使用beautifulsoup将文本转换为对象

#features：转换引擎,默认"html.parser"，也可使用""

soup = BeautifulSoup(response.text,features="html.parser")

print(soup)

#查找到id为"auto-channel-lazyload-article"的节点div

target = soup.find(id="auto-channel-lazyload-article")

#查找一个li标签

li_obj = target.find("li")

#查找所有li标签

li_list = target.find_all("li")

 

#循环list

for item in li_list:

    #查找到每个li中a标签

    a_obj =  item.find("a")

    if a_obj:

        #获取a标签属性字典

        print(a_obj.attrs)

        print(a_obj.attrs["href"])

    #查找h3标签

    h3_obj = item.find("h3")

    if h3_obj:

        #拿到标签的文本

        print(h3_obj.text)

#查找img标签

img_obj = item.find("img")

if img_obj:

    img_url = "http:" + img_obj.attrs["src"]

    img_data = requests.get(img_url)

    #生成文件名

    import uuid

    img_name = str(uuid.uuid4()) + ".jpg"

    with open(img_name,"wb") as  f:

        #使用content返回二进制数据，text是字符串文本

        f.write(img_data.content)

模块使用

requests模块

data = requests.get(url):获取访问网址,get请求

data=requests.post(url,cookie={},data={}):post请求，data发送数据，cookie发送的cookies；

data.text:网址的html文本；

data.content :网址的二进制文本；

data.Encoding:编码，指定编码为data.apparent_encoding,原网页编码；

data.cookies.get_dict():获取cookies字典；

beautifulsoup4模块

导入from bs4 import BeautifulSoup;

使用soup = BeautifulSoup(data.text,features=””),将requests返回html文本转换为对象，features指定转换引擎；

soup.find(“”),查找对应的tag标签；

soup.find_all(“”),查找所有对应tag标签；

示例2：

创建pachong1.py

#爬虫

#自动登陆github

import requests

from bs4 import  BeautifulSoup

#模拟登陆github过程

#get请求访问https://github.com/login

data_get = requests.get("https://github.com/login")

#查看cookies字典

dict_cookies = data_get.cookies.get_dict()

# print(dict_cookies)

# print(data_get.text)

 

#转换成对象

soup = BeautifulSoup(data_get.text,features="html.parser")

 

#获取登陆表单信息

form_dict = {}

target_form = soup.find("form")

input_list =  target_form.find_all("input")

for item in input_list:

    if item.attrs.get("value",None):

        form_dict[item.attrs["name"]] = item.attrs["value"]

 

#填写登陆信息

#登陆表单信息

form_dict["login"] = "账号"

form_dict["password"] = "密码"

 

print(form_dict)

#提交登陆请求

data_post  = requests.post(

    url="https://github.com/session",

    data=form_dict,

    cookies = dict_cookies

)

#获取cookies字典

print(data_post.text)

requests模块详解

默认提交方式的方法：

requests.get();

requests.post();

requests.put();

requests.delete();

request()方法

requests.request():

参数：

1、method:提交方式；

2、url:提交地址；

3、params:URL上传递的参数,就是？参数=值形式，params接收一个字典值；

4、data:传递数据，post请求body的内容，data接收一个字典，字符串(“k1=v1&k2=v2”),文件对象；

5、json:post请求体传递数据,将数据整体使用json格式化；

json方式提交

请求头：content-type : application/json

请求体：“{k1:v1,k2:v2}”

data方式提交

请求头：content-type : application/url-form-encoding

请求体：{k1:v1,k2:v2}

6、cookies : Cookies,放在请求头中

7、files:文件，上传文件,files接收格式可以是字典;

字典值是一个元组，元组第一个值是上传到服务器自定义文件名，第二个值是文件对象open(“a.txt”,”wb”)

{

“k1”：（“test.txt”,open(“a.txt”,”wb”)，）

}

8、auth:将用户密码账号进行简单加密；

9、timeout:设置超时，接收参数float或者元组；

10、allow_redirects:是否允许重定向，布尔值；

11、proxies：代理，接收字典

{

“http”:”地址”

}

12、stream:True or False,流方式下载文件，不会全部取到，而是像水流一样读取；

13、verify:True or False，False忽略证书是否存在；

14、cert:证书，https请求时，如果没有证书，访问失败；

requests.Session()方法

保存客户端历史访问信息

session = requests.Session()

使用session.get(),session.post()可以省略掉传递cookies参数的步骤，因为使用session提交请求时，自动将之前请求的cookies，请求头信息传递到下次请求。

BeautifulSoup模块详解

soup = BeautifulSoup(requests返回html, features=””)

BeautifulSoup类实例化参数：

参数1：传入requests获取到的requests.get().text()

参数2：features解析引擎，python默认的是html.parser,速度适中，容错率强；lxml：速度快，容错率高，需要安装c语言库；xml：速度快，支持XML解析，需要安装c语言库；

如果要使用lxml,使用安装pip install lxml

查找返回对应标签对象：

查找指定标签

soup.find(“标签名”)

find参数，recursive = True or False是否递归查找

find参数，attrs = {“class”:”c1”},传入字典

find参数，text = “”

find参数，name=””

find参数，class_ = “”

一般直接使用attrs作为条件，将其他属性条件放在attrs字典中。

查找所有指定标签

soup.find_all(“标签名”)

find的参数对find_all()都适用

属性也可以传列表

soup.find_all(name =[“div”,”a”])

查找指定id，select()可以根据CSS选择器查找

soup.select(“#id”)

soup.select(“.cls”)

soup.select(“body a”)

soup.select(“body > a”)

soup.select(“span,a“)

使用正则表达式查找

import re

rep = re.compile(“div.”)

soup.find(name =rep)

查找当前标签在父标签index

body_tag = soup.find(“body”)

查找body中的一个div的index

body_tag.index(soup.find(“div”))

标签对象<class 'bs4.element.Tag'>

属性：

tag.name:标签名<a>、<div>等；

tag.attrs:属性字典，通过属性字典键值实现赋值修改等操作；

tag.string:获取标签内容，也可以tag.string=””修改标签内容

关联标签属性

当前标签后标签

tag.next

tag.next_element

tag.next_elements

tag.next_sibling

tag.next_siblings

当前标签前标签

tag.previous

tag.previous_element

tag.previous_elements

tag.previous_sibling

tag.previous_siblings

当前标签父标签

tag.parent

tag.parents

tag.children:获取所有子标签，包括换行；

方法：

tag.extract():删除标签，及子标签包括自己，返回删除标签；

tag.decompose():删除标签，及子标签包括自己；

tag.clear()：清空所有子标签，保留自身标签名；

tag.decode():将当前标签转换为字符串，decode_contents()只将标签内子标签转换为字符串，不包括当前标签；

tag.encode():将当前标签转换为字节；

增加标签内容

tag.append():当前标签追加内容,如果加入内容是已有标签，会将原有内容移动到当前标签最后；

tag.insert(index,tag):当前标签插入内容,index:指定位置；

tag.insert_after():当前标签后插入内容；

tag.insert_before():当前标签前插入内容；

tag.replace_with():当前标签替换为指定内容；

查找关联标签

tag.find_next()；

tag.find_next_sibling()；

tag.find_next_siblings()；

tag.find_previous()；

tag.find_previous_sibling()；

tag.find_previous_siblings()；

tag.find_parent()；

tag.find_parents()；

包裹标签

tag.wrap():当前标签包裹指定标签

tag.unwrap()：当前标签去掉标签，只保留内容

实例：

#爬虫

#BeautifulSoup使用详解

from  bs4 import  BeautifulSoup

#读取html文件

with open("test.html") as f:

    data = f.read()

#获取BeautifulSoup对象

soup = BeautifulSoup(data,features="html.parser")

 

#soup的查找标签

#find方法（）

#查找指定tag名,recursive=True是否递归查找，

tag = soup.find(name = "p",recursive=True)

#查找指定tag内容

tag = soup.find(text="Apache Cordova")

#查找指定id

tag = soup.find(id = "deviceready")

#查找指定class

tag = soup.find(class_ = "event listening")

#查找指定属性值的tag,attrs传入字典，多个属性值条件筛选

tag = soup.find(attrs={"class":"event listening","about":"test"})

 

#find_all()方法

tags = soup.find_all(name="p",recursive=True)

#条件可以传入列表

tags = soup.find_all(name=["p","span"])

 

#select()方法,通过CSS选择器查找

#id选择器，返回一个Tag对象

tags = soup.select("#deviceready")

#类选择器查找，返回一个列表

tags = soup.select(".app")

#组合选择器

tags = soup.select("div .event")

 

#使用正则表达式匹配

import re

rep = re.compile("h(\d)")

#查找span的tag名的标签

tags = soup.find(name = rep)

 

#标签对象Tag的属性方法

body_tag = soup.find("body")

#Tag属性

# #标签名

# print(body_tag.name)

# #标签属性

# print(body_tag.attrs)

# #标签内容,如果 标签包含标签，内容为None,

# print(body_tag.string)

# print(soup.find("span").string)

# #查找结点下一个结点，

# print(soup.find("h1").next)

# #查找父标签Tag,返回Tag对象

# print(soup.find("h1").parent)

 

#Tag方法

#将Tag内部标签全部删除

# body_tag.clear()

#将Tag内部标签删除，包括自己，返回删除标签

#rest = body_tag.extract()

#删除Tag标签包括自己

#body_tag.decompose()

#将本标签对象转换为字符串,返回

# rest = body_tag.decode()

#将标签对象转换为字节

# rest = body_tag.encode()

#加入标签

from bs4 import  Tag

#创建Tag对象

tag_a = Tag(name="a",attrs={"id":"append_a","class":"ctest","href":"#"})

tag_a.string = "hello world"

#将标签加入body中，追加

# body_tag.append(tag_a)

#将标签插入body指定位置

# body_tag.insert(0,tag_a)

#在标签前后插入

# h1_tag = soup.find("h1")

#在当前标签后插入

# h1_tag.insert_after(tag_a)

#在当前标签前插入

# h1_tag.insert_before(tag_a)

爬虫性能优化

单个request请求，爬取网站；

多个request请求并发操作，爬取网站；

多线程实现并发请求

#爬虫

#性能优化，多线程实现并发请求

from concurrent.futures import  ThreadPoolExecutor,ProcessPoolExecutor

import requests

#计算时，需要使用CPU,每个进程有GIL锁，进程中只有一个线程可以使用cpu,

#所有计算密集型，开多线程没意义

#当遇到IO密集型，使用多线程

#计算密集型，使用多进程

#创建线程池

pool = ThreadPoolExecutor(5)

#创建进程池

# ppool = ProcessPoolExecutor

#请求列表

url_list=[

    "https://www.baidu.com",

    "https://www.bootcss.com/",

    "https://www.runoob.com/",

    "https://www.cnblogs.com/"

]

#请求页面

def task(url):

    response = requests.get(url)

    print(url,response)

#请求成功，回掉函数

def result_back(future,*args,**kwargs):

    #future类对象,result()获取到response

    # response = future.result()

    #获取不到response,result返回None

    print(type(future))

    print(future.result())

if __name__ == "__main__":

    #请求列表使用线程池

    for item in url_list:

        #开始线程

        v = pool.submit(task,item)

        #设置回调函数

        v.add_done_callback(result_back)

    pool.shutdown(wait=True)

 

使用异步I/O，单线程多任务并发请求

# #使用gevent+requests

# from gevent import monkey

# monkey.patch_all()

# import gevent

# import requests

# #执行请求方法

# def func_async(method,url,req_kwargs):

#     print(url)

#     response = requests.request(method=method,url=url,**req_kwargs)

#     print(response.url)

#     print(response.content.decode(encoding="utf-8"))

# #创建协程池（最大协程数量）

# from  gevent.pool import  Pool

# pool = Pool(5)

# gevent.joinall([

#     gevent.spawn(func_async,method="get",url="https://www.baidu.com/",req_kwargs={}),

#     gevent.spawn(func_async,method="get",url="https://www.cnblogs.com/",req_kwargs={}),

#     gevent.spawn(func_async,method="get",url="https://www.runoob.com/",req_kwargs={}),

# ])

 

 

# #使用grequest模块

# import grequests

# #请求列表

# request_lists=[

#     grequests.get("https://www.baidu.com"),

#     grequests.get("https://www.cnblogs.com/"),

# ]

# #执行请求列表

# reponse_lists = grequests.map(request_lists)

# print(reponse_lists)

posted @ 2021-01-07 20:33 渔歌晚唱阅读(225) 评论(0) 收藏举报

刷新页面返回顶部

渔歌晚唱