爬虫基础new

1.爬虫相关概述

爬虫概念:

通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
模拟:浏览器就是一款纯天然的原始的爬虫工具

爬虫分类:

通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据

风险分析

合理的的使用
爬虫风险的体现:
爬虫干扰了被访问网站的正常运营;
爬虫抓取了受到法律保护的特定类型的数据或信息。
避免风险:
严格遵守网站设置的robots协议;
在规避反爬虫措施的同时,需要优化自己的代码,避免干扰被访问网站的正常运行;
在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的个人信息、隐私或者他人的商业秘密的,应及时停止并删除。

反爬机制

反反爬策略 
robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.

常用的头信息

User-Agent:请求载体的身份标识
Connection:close
content-type

如何鉴定页面中是否有动态加载的数据?

局部搜索 全局搜索

对一个陌生网站进行爬取前的第一步做什么?
确定你要爬取的数据是否为动态加载的!!!

2.requests模块的基本使用

requests模块
概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
编码流程:
指定url
进行请求的发送
获取响应数据(爬取到的数据)
持久化存储
import requests
url = 'https://www.sogou.com'
#返回值是一个响应对象
response = requests.get(url=url)
#text返回的是字符串形式的响应数据
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
    f.write(data)

基于搜狗编写一个简易的网页采集器

解决乱码问题

解决UA检测问题

import requests

wd = input('输入key:')
url = 'https://www.sogou.com/web'
# 存储的就是动态的请求参数
params = {
    'query': wd
}
#params参数表示的是对请求url参数的封装
#headers 解决反爬机制,实现UA伪装
headers = {
    'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手动修改响应数据的编码,解决中文乱码
response.encoding = 'utf-8'

data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
    f.write(data)
print(wd, "下载成功")

1.爬取豆瓣电影的详细数据

分析

当滚轮滑动到底部的时候,发起ajax的请求,且请求到了一组电影数据
动态加载的数据:就是通过另一个额外的请求请求到的数据
ajax生成动态加载的数据
js生成动态加载的数据
import requests
limit = input("排行榜前多少的数据:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
    "type": "5",
    "interval_id": "100:90",
    "action": "",
    "start": "0",
    "limit": limit
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的对象
data_list = (response.json())

with open('douban.txt', "w", encoding='utf-8') as f:
    for i in data_list:
        name = i['title']
        score = i['score']
        f.write(name+""+score+""+"\n")
print("成功")

2.爬取肯德基地理位置信息

import requests

url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
    "cname": "",
    "pid": "",
    "keyword": "青岛",
    "pageIndex": "1",
    "pageSize": "10"
}

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
# json返回的是序列化好的对象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
    for i in data_list["Table1"]:
        name = i['storeName']
        addres = i['addressDetail']
        f.write(name + "," + addres  + "\n")
print("成功")

3.腾讯课堂

import requests

res = requests.post(
    url='https://ke.qq.com/cgi-proxy/course_list/search_course_list?bkn=&r=0.1427',
    #request payload 这里填写json而不是data 一般From的时候使用data
    json={"word":"python","page":"2","visitor_id":"9283127513403748","finger_id":"f62c588e17fd13645de79684bdcc3017","platform":3,"source":"search","count":24,"need_filter_contact_labels":1},
    headers={
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
        "Referer":"https://ke.qq.com/course/list/python?page=2"
    }
)

print(res.json())

4. B站信息和cookie

获取b站的个人登录信息 这里的cokkie是登录完成之后的cookie


import requests

res = requests.get(
    url='https://api.bilibili.com/x/member/web/account?web_location=333.33',
    headers={
        'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
        'Cookie':"buvid3=4D2C694C-3D16-25A9-AD4A-ACB2D65C90C918352infoc; b_nut=1705987918; _uuid=51064FE53-451E-6E74-E16F-6F8B92624F9718713infoc; buvid_fp=c00b689fc9db5c29f662cf2fb6bd01a6; buvid4=18A11919-E414-FBFC-1B8D-8BA88C1BCCB619628-024012305-nX8zov%2FsHs%2FB%2BKaOt7gP%2BA%3D%3D; CURRENT_FNVAL=4048; rpdid=|(J~|~~m)l|J0J'u~|lm)mmRJ; b_lsid=225B1026E_18D34F399DC; bsource=search_baidu; enable_web_push=DISABLE; header_theme_version=CLOSE; csrf_state=f9a3e7fd70aeb8dd0ee9ec51bb9a0e67; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MDYyNDk2MTksImlhdCI6MTcwNTk5MDM1OSwicGx0IjotMX0.HZcDZSE0m-wZuevYHxD_f7SH-EkBX5ktM86m92q0SP4; bili_ticket_expires=1706249559; SESSDATA=9ba7b60b%2C1721542419%2C457da%2A11CjAb6kWlfz4IljS2QO8Sh_ohT9JXLK_vgAx0PTRmispeHAUur9NoA4gyvptuJ2pC-UcSVjhUajBkblptWHc2ZUNvN2NMMHRJeUdhR3F4SzRQU0tPRjUxMkJObWRzZGdTY3Z1UFk3bk1CRmhFdlo4am0yNnFndi02SzlfaWZ3V3NqaEtCLVhFNDZBIIEC; bili_jct=2312bb4ef19779446149d14e74fabb37; DedeUserID=388839130; DedeUserID__ckMd5=4cfa0a5f7f7d631f; sid=gfnbaeos; PVID=2; home_feed_column=5; browser_resolution=1440-791"
}
)

print(res.json())

#{'code': 0, 'message': '0', 'ttl': 1, 'data': {'mid': 388839130, 'uname': '追梦nan', 'userid': 'bili_43698205152', 'sign': '', 'birthday': '1970-01-01', 'sex': '男', 'nick_free': False, 'rank': '正式会员'}}

5. 响应体格式

res.conntent  #原始响应体  视频 文件  图片
res.text  #原始文本
res.json()   #json

3.数据解析

解析:根据指定的规则对数据进行提取

作用:实现聚焦爬虫

聚焦爬虫的编码流程:

指定url
发起请求
获取响应数据
数据解析
持久化存储

数据解析的方式:

正则
bs4
xpath
pyquery(拓展)

数据解析的通用原理是什么?

数据解析需要作用在页面源码中(一组html标签组成的)

html的核心作用是什么?

展示数据

html是如何展示数据的呢?

html所要展示的数据一定是被放置在html标签之中,或者是在属性中

通用原理:

1.标签定位
2.取文本or取属性

1.正则解析

1.爬取糗事百科糗图数据

爬取单张

import requests

url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte类型的数据
img_data = (response.content)
with open('./123.jpg', "wb") as f:
        f.write(img_data)
print("成功")



爬取单页

<div class="thumb">

<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"

headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
    src = "https:" + src
    img_name = src.split('/')[-1]
    img_path = dir_name + "/" + img_name
    response = requests.get(src, headers).content
    # 对图片地址发请求获取图片数据
    with open(img_path, "wb") as f:
        f.write(response)
print("成功")


爬取多页

import re
import os
import requests

dir_name = "./img"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
    url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
    print(f"正在爬取第{i}页的图片")
    img_text = requests.get(url, headers=headers).text
    ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
    img_list = re.findall(ex, img_text, re.S)
    for src in img_list:
        src = "https:" + src
        img_name = src.split('/')[-1]
        img_path = dir_name + "/" + img_name
        response = requests.get(src, headers).content
        # 对图片地址发请求获取图片数据
        with open(img_path, "wb") as f:
            f.write(response)
print("成功")

2.bs4解析

环境安装

pip install beautifulsoup4

bs4的解析原理

实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取

根据标签名称,获取标签(只获取找到的第1个)

from bs4 import BeautifulSoup

html_string = """<div>
    <h1 class="item">zz</h1>
    <ul class="item">
        <li>篮球</li>
        <li>足球</li>
    </ul>
    <div id='x3'>
        <span>5xclass.cn</span>
        <a href="www.xxx.com" class='info'>pythonav.com</a>
    </div>
</div>"""

soup = BeautifulSoup(html_string, features="html.parser")

tag = soup.find(name='a')

print(tag)       # 标签对象
print(tag.name)  # 标签名字 a
print(tag.text)  # 标签文本 pythonav.com
print(tag.attrs) # 标签属性 {'href': 'www.xxx.com', 'class': ['info']}

根据属性获取标签(只获取找到的第1个)

from bs4 import BeautifulSoup
html_string = """<div>
    <h1 class="item">zz</h1>
    <ul class="item">
        <li>篮球</li>
        <li>足球</li>
    </ul>
    <div id='x3'>
        <span>5xclass.cn</span>
        <a href="www.xxx.com" class='info'>pythonav.com</a>
    </div>
</div>"""

soup = BeautifulSoup(html_string, features="html.parser")

tag = soup.find(name='div', attrs={"id": "x3"})  #

print(tag)

嵌套读取,先找到某个标签,然后再去孩子标签中寻找

from bs4 import BeautifulSoup

html_string = """<div>
    <h1 class="item">zz</h1>
    <ul class="item">
        <li>篮球</li>
        <li>足球</li>
    </ul>
    <div id='x3'>
        <span>5xclass.cn</span>
        <a href="www.xxx.com" class='info'>pythonav.com</a>
        <span class='xx1'>zz</span>
    </div>
</div>"""
soup = BeautifulSoup(html_string, features="html.parser")
parent_tag = soup.find(name='div', attrs={"id": "x3"})

child_tag = parent_tag.find(name="span", attrs={"class": "xx1"})

print(child_tag)

读取所有标签(多个)

html_string = """<div>
    <h1 class="item">zz</h1>
    <ul class="item">
        <li>篮球</li>
        <li>足球</li>
    </ul>
    <div id='x3'>
        <span>5xclass.cn</span>
        <a href="www.xxx.com" class='info'>pythonav.com</a>
        <span class='xx1'>zz</span>
    </div>
</div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_string, features="html.parser")
tag_list = soup.find_all(name="li")  #递归所有
#tag_list = soup.find_all(recursive=Flase) 只找儿子 
print(tag_list)

# 输出
# [<li>篮球</li>, <li>足球</li>]
html_string = """<div>
    <h1 class="item">zz</h1>
    <ul class="item">
        <li>篮球</li>
        <li>足球</li>
    </ul>
    <div id='x3'>
        <span>5xclass.cn</span>
        <a href="www.xxx.com" class='info'>pythonav.com</a>
        <span class='xx1'>zz</span>
    </div>
</div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_string, features="html.parser")
tag_list = soup.find_all(name="li")   #找所有
for tag in tag_list:
    print(tag.text)

# 输出
篮球
足球

1.爬取易车网数据

from bs4 import BeautifulSoup
import requests

url = 'https://car.yiche.com/'
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
res = requests.get(url, headers=headers)

soup = BeautifulSoup(res.text,features="html.parser")

tag_list = soup.find_all(name="div",attrs={"class":"item-brand"})

# for tag  in  tag_list:
#     print(tag.attrs["data-name"])

for  tag in tag_list:
    child = tag.find(name='div',attrs={'class':'brand-name'})
    print(child.text)

2.爬取网易云数据

import requests
from bs4 import BeautifulSoup

res = requests.get(
    url="https://music.163.com/discover/playlist/?cat=%E5%8D%8E%E8%AF%AD",
    headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
        "Referer": "https://music.163.com/"
    }
)

soup = BeautifulSoup(res.text, features="html.parser")

parent_tag = soup.find(name='ul', attrs={"id": "m-pl-container"})

for child in parent_tag.find_all(recursive=False):
    title = child.find(name="a", attrs={"class": "tit f-thide s-fc0"}).text
    image_url = child.find(name='img').attrs['src']
    print(title, image_url)

    # 每个封面下载下来
    img_res = requests.get(url=image_url)
    file_name = title.split()[0]
    with open(f"{file_name}.jpg", mode='wb') as f:
        f.write(img_res.content)

3.xpath解析

环境安装

pip install lxml

xpath的解析原理

实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取

etree对象的实例化

tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永远是一个列表

标签定位

tree.xpath("")
在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
xpath表达式中最左侧的//表示可以从任意位置进行标签定位
xpath表达式中非最左侧的//表示的是多个层级的意思
xpath表达式中非最左侧的/表示的是一个层级的意思

属性定位://div[@class='ddd']

索引定位://div[@class='ddd']/li[3] #索引从1开始
索引定位://div[@class='ddd']//li[2] #索引从1开始

提取数据

取文本:
tree.xpath("//p[1]/text()"):取直系的文本内容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
取属性:
tree.xpath('//a[@id="feng"]/@href')

1.爬取糗事百科

爬取作者,和文章。注意作者有匿名和实名之分

from lxml import etree
import requests


url = "https://www.qiushibaike.com/text/page/4/"
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)

for div in div_list:
#用户名分为匿名用户和注册用户
    author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
    content = div.xpath('.//div[@class="content"]/span//text()')
    content = ''.join(content)
    print(author, content)


2.爬取网站图片

from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
    if i == 1:
        url = "http://pic.netbian.com/4kmeinv/"
    else:
        url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"

    page_text = requests.get(url, headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
    for li in li_list:
        img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
        img_name = li.xpath('./a/b/text()')[0]
        #解决中文乱码
        img_name = img_name.encode('iso-8859-1').decode('gbk')
        response = requests.get(img_src).content
        img_path = dir_name + "/" + f"{img_name}.jpg"
        with open(img_path, "wb") as f:
            f.write(response)
    print(f"第{i}页成功")

posted @ 2024-01-30 17:20  追梦nan  阅读(23)  评论(0编辑  收藏  举报