爬虫

爬虫基本流程

1.指定url

url = "https://www.aqistudy.cn/historydata/"

2.UA伪装、防盗链

模拟浏览器

 headers = {
        'user-agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36'  # UA 伪装
     "Referer": "....."   # 防盗链
    }

3.请求参数的处理

data = {
    "page":"",
    "query": ""
}

4.发起请求

response = requests.post(url=url, data=data, headers=headers)

# 模拟登录发送请求
session = reuqests.Session()
session.post(headers=headers,url=url)

5.获取响应数据

# 网页源码
data =response.text

#图片数据  二进制数据
data =response.content

#json数据
data =response.json()

6.持久化存储

with open("filename","w",enconding="utf-8") as f:
    f.write(data)  
    # 将数据写到filename 文件中
    
# 二进制文件
with open("filename","wb") as f:
    f.write(data)
    # 将二进制数据data写入filename中

中文乱码解决

1.查看源代码格式：

打开开发者工具 2、在console 中输入“document.charset”查看页面编码

2.解决：

如果编码格式是utf-8:

response = requests.get(url=url, headers=headers)
response.encoding = "utf-8"  # 最重要
tree = etree.HTML(response.text)

如果编码格式是GBK:

response = requests.get(url=url, headers=headers)
response.encoding = "GBK"
tree = etree.HTML(response.text)

bs4

数据解析的原理

1.实例化一个BeautifulSoup对象，并且将页面源码数据加载到该对象中
2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取

环境安装

pip install bs4
pip install lxml

数据解析的方法和属性

标签定位

soup.tagName # 返回的是文档中第一次出现的tagName对应的标签

soup.find():

find('tagName') # 等同于soup.div

属性定位：

soup.find('div',class_/id/attr='song')

soup.find_all('tagName'):返回符合要求的所有标签（列表）

select：

select('某种选择器（id，class，标签...选择器）'),返回的是一个列表。

层级选择器：

soup.select('.tang > ul > li > a') # >表示的是一个层级

oup.select('.tang > ul a') # 空格表示的多个层级

获取标签之间的文本数据：

soup.a.text/string/get_text()

text/get_text() # 可以获取某一个标签中所有的文本内容

string：# 只可以获取该标签下面直系的文本内容

获取标签中属性值：

soup.a['href']

使用

实例化BeautifulSoup对象

将本地的html文档中的数据加载到该对象中

fp = open('./test.html','r',encoding='utf-8')
soup = BeautifulSoup(fp,'lxml')

将互联网上获取的页面源码加载到该对象中

page_text = response.text
soup = BeatifulSoup(page_text,'lxml')

案列

import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":
    url = "http://www.shicimingju.com/book/sanguoyanyi.html"
    headers = {
        'user - agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36'
    }
    data = requests.get(url=url, headers=headers)
    data.encoding = "utf-8"
    soup = BeautifulSoup(data.text, "lxml")
    all_title = soup.select(".book-mulu>ul a")
    for title in all_title:
        url2 = "https://www.shicimingju.com" + title["href"]
        res = requests.get(url=url2, headers=headers)
        res.encoding = "utf-8"
        soup = BeautifulSoup(res.text, "lxml")
        title = title.text.split("·")[1]
        with open(f"三国/{title}", "w", encoding="utf-8") as f:
            f.write(soup.select(".chapter_content")[0].text)

xpath

解析原理

1.实例化一个etree的对象，且需要将被解析的页面源码数据加载到该对象中。
2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。

环境的安装

- pip install lxml

实例化一个etree对象

	from lxml import etree

1.将本地的html文档中的源码数据加载到etree对象中：
etree.parse(filePath)
2.可以将从互联网上获取的源码数据加载到该对象中
etree.HTML('page_text')

xpath表达式

/:表示的是从根节点开始定位。表示的是一个层级。

//:表示的是多个层级。可以表示从任意位置开始定位。

# 属性定位
//div[@class='song'] /tag[@attrName="attrValue"]

# 索引定位
//div[@class="song"]/p[3] 索引是从1开始的。

# 取文本
    /text() 获取的是标签中直系的文本内容
    //text() 标签中非直系的文本内容（所有的文本内容）
    
# 取属性
    /@attrName     ==>img/src

案例

彼岸图片爬取

import requests
from lxml import etree

if __name__ == "__main__":
    url = 'https://pic.netbian.com/4k/index_61.html'
    headers = {
        'user-agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36'
    }
    response = requests.get(url=url, headers=headers)
    response.encoding = "GBK"
    tree = etree.HTML(response.text)
    img_all_url = tree.xpath('//ul[@class="clearfix"]//img/@src')
    for img_url in img_all_url:
        url = "https://pic.netbian.com" + img_url
        img_content = requests.get(url=url, headers=headers).content
        img_name = url.split("/")[-1]
        with open(f"彼岸图/{img_name}","wb") as f:
            f.write(img_content)

    print("over")

模拟登陆

1.创建session对象：

session = reuqests.Session()

2.使用session对象进行模拟登录post请求的发送

session.post(headers=headers,url=url)

代理ip

破解封ip这种反爬机制

代理服务器

作用：

1. 突破自身IP访问的限制

2.隐藏自身真实的IP

代理ip类型：

http：应用到http协议对应的url中

https：应用到https协议对应的url中

代理ip的匿名度;

透明：服务器知道该次请求使用代理，也知道请求对应的真实ip

匿名：知道使用了代理，不知道真实ip

高匿：不知道使用了代理，更不知道真实的ip

使用

proxies = {
    "http": '49.85.112.173:7890'
}

requests.get(url=url, headers=headers, proxies=proxies, params=params)

实例:

import requests

url = 'https://www.baidu.com/s'
headers = {
    "user-agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36"
}

proxies = {
    "http": '49.85.112.173:7890'
}
wd = input("请输入搜索的内容：")
params = {
    'ie': 'utf-8',
    'f': '8',
    'rsv_bp': 1,
    'rsv_idx': 1,
    'tn': 'baidu',
    'wd': wd,
}
response = requests.get(url=url, headers=headers, proxies=proxies, params=params).text
print(response)
print("over")

高性能异步爬虫

在爬虫中使用异步实现高性能的数据爬取操作

多线程，多进程(不建议)

好处:可以为相关堵塞的操作单独开启线程或进程，堵塞操作就可以异步执行。

弊端:无法无限制的开启多线程或者多进程。

线程池、进程池（适当使用）

好处:我们可以降低系统对进程或者线程和销毁一个频率，从而更好的降低系统的开销。

弊端:池中线程或进程的数量是有上限。

案例线程池

# 梨视频爬取
import requests
from multiprocessing.dummy import Pool
import time
from lxml import etree

start_time = time.time()
headers = {
    "user-agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36"
}

url = "https://www.pearvideo.com/category_5"
response = requests.get(url=url, headers=headers).text
tree = etree.HTML(response)
all_href = tree.xpath('//ul[@id="categoryList"]/li/div/a/@href')
video_all_list = []
for href in all_href:
    video_url = "https://www.pearvideo.com/" + href
    video_dic = {"contId": href.split("_")[1], "url": video_url}
    video_all_list.append(video_dic)


def down_img(data_list):
    json_headers = {
        "user-agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36",
        "Referer": data_list["url"]  # 防盗链
    }
    contId = data_list["contId"]
    video_json_url = f"https://www.pearvideo.com/videoStatus.jsp?contId={contId}"
    response_json = requests.get(url=video_json_url, headers=json_headers).json()
    mp4_url = response_json["videoInfo"]["videos"]["srcUrl"]
    cont = f"cont-{contId}"
    real_url = mp4_url.replace(mp4_url.split("-")[0].split("/")[-1], cont)
    print(f"正在下载{contId}")
    mp4_content = requests.get(url=real_url, headers=json_headers).content
    with open(f"梨视频/{contId}.mp4", "wb") as f:
        f.write(mp4_content)
        print(f"下载完成{contId}")


pool = Pool(8)  # 创建8个线程池
pool.map(down_img, video_all_list)
pool.close()  # 关闭线程池
pool.join() # 主进程堵塞后 让子进程继续进行完成，子进程完成后，再把主进程全部关闭
now = time.time()
print(f"用时{now - start_time}")

"""
 json_headers = {
        "user-agent": "Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 94.0.4606.61Safari / 537.36",
        "Referer": data_list["url"]  # 防盗链
    }
    
 data_list是一个列表
 pool.map(down_img, video_all_list)
 当运行到这段代码时
 相当于
 for data_list in data_list
 所有能从data_list通过key取值value
 """

异步协程

1.用async修饰一个函数，调用之后返回一个协程对象

2.将协程封装到Task对象中并添加到事件循环的任务列表中，等待事件循环去执行

3.创建一个事件循环对象

4.将协程对象注册到loop中，启动loop

import requests
import asyncio
import time
import aiohttp

urls = []
start = time.time()
for i in range(10):
    urls.append('http://127.0.0.1:5000/bobo')
print(urls)


async def get_page(url):  # async修饰一个函数
    async with aiohttp.ClientSession() as session:
        # get()、post():
        # headers,params/data,proxy='http://ip:port'
        async with await session.get(url) as response:
            # text()返回字符串形式的响应数据
            # read()返回的二进制形式的响应数据
            # json()返回的就是json对象
            
            # 注意：获取响应数据操作之前一定要使用await进行手动挂起
            page_text = await response.text()
            print(page_text)


tasks = []

for url in urls:
    c = get_page(url)  # 返回一个协程对象
    task = asyncio.ensure_future(c) 
    tasks.append(task) #  将协程封装到Task对象中并添加到事件循环的任务列表中

方法一:
    '''
    loop = asyncio.get_event_loop() # 创建一个事件循环对象
    result = asyncio.wait(tasks)
    loop.run_until_complete(result) # 将协程对象注册到loop中，启动loop
    '''
 方法二:
    '''
    本质上方式一是一样的，内部先 创建事件循环 然后执行 run_until_complete，一个简便的写法。
    python 3.7
    result = asyncio.wait(tasks)
    asyncio.run(result)
    '''
end = time.time()

print('总耗时:', end - start)

标签定位

#执行一组js程序
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')

#点击搜索按钮
bro.find_element_by_css_selector('.btn-search').click()
#回退
bro.back()
sleep(2)
#前进
bro.forward()
# 关闭浏览器
bro.quit()
bro.close()
#标签交互
search_input = bro.find_element_by_id('q')
search_input.send_keys('Iphone') # 如果是input标签将在标签中输入Iphone


#导入动作链对应的类
from selenium.webdriver import ActionChains

#动作链
action = ActionChains(bro)
#点击长按指定的标签
action.click_and_hold(div)
# 移动动作链  x轴17 y轴0
action.move_by_offset(17,0).perform()
#释放动作链
action.release()


#如果定位的标签是存在于iframe标签之中的则必须通过如下操作在进行标签定位
bro.switch_to.frame('iframeResult') # 切换浏览器标签定位的作用域
bro.switch_to.frame(定位ifrme元素)

selenium

from selenium import webdriver
from time import sleep
#实现无可视化界面
from selenium.webdriver.chrome.options import Options
#实现规避检测
from selenium.webdriver import ChromeOptions

#实现无可视化界面的操作
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

#实现规避检测
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

#如何实现让selenium规避被检测到的风险
bro = webdriver.Chrome(executable_path='./chromedriver',chrome_options=chrome_options,options=option)

#无可视化界面（无头浏览器 不弹出浏览器） phantomJs
bro.get('https://www.baidu.com')

print(bro.page_source)
sleep(2)
bro.quit()

12306模拟登录

from selenium import webdriver
import time
from selenium.webdriver.support import wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

driver = webdriver.Chrome(r"chromedriver_win32/chromedriver.exe")
driver.get("https://kyfw.12306.cn/otn/resources/login.html")
driver.find_element_by_xpath('//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a').click()
driver.find_element_by_id("J-userName").send_keys(15237674912)
time.sleep(0.5)
driver.find_element_by_id('J-password').send_keys(""*******")
driver.find_element_by_id('J-login').click()
# 关闭selenium不可使用滑块的限制
script = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined,});'
driver.execute_script(script)
# 等待5秒时间家长出类  nc_iconfont.btn_slide
slide_btn = wait.WebDriverWait(driver, 5).until(
    EC.presence_of_element_located((By.CLASS_NAME, 'nc_iconfont.btn_slide')))
# click_and_hold按住鼠标左键在源元素上，点击并且不释放
ActionChains(driver).click_and_hold(on_element=slide_btn).perform()
# move_by_offset向右移动 x轴300 y轴0
ActionChains(driver).move_by_offset(xoffset=300, yoffset=0).perform()

12306登录老版超级鹰验证

# 下述代码为超级鹰提供的示例代码
import requests
from hashlib import md5


class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password = password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
                          headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


# chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')   #用户中心>>软件ID 生成一个替换 96001
# im = open('12306.jpg', 'rb').read()                                      #本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
# print(chaojiying.PostPic(im, 9004)['pic_str'])
# 上述代码为超级鹰提供的示例代码

# 使用selenium打开登录页面
from selenium import webdriver
import time
from PIL import Image
from selenium.webdriver import ActionChains

bro = webdriver.Chrome(executable_path='./chromedriver')
bro.get('https://kyfw.12306.cn/otn/login/init')
time.sleep(1)

# save_screenshot就是将当前页面进行截图且保存
bro.save_screenshot('aa.png')

# 确定验证码图片对应的左上角和右下角的坐标（裁剪的区域就确定）
code_img_ele = bro.find_element_by_xpath('//*[@id="loginForm"]/div/ul[2]/li[4]/div/div/div[3]/img')
location = code_img_ele.location  # 验证码图片左上角的坐标 x,y
print('location:', location)
size = code_img_ele.size  # 验证码标签对应的长和宽
print('size:', size)
# 左上角和右下角坐标
rangle = (
    int(location['x']), int(location['y']), int(location['x'] + size['width']), int(location['y'] + size['height']))
# 至此验证码图片区域就确定下来了

i = Image.open('./aa.png')
code_img_name = './code.png'
# crop根据指定区域进行图片裁剪
frame = i.crop(rangle)
frame.save(code_img_name)

# 将验证码图片提交给超级鹰进行识别
chaojiying = Chaojiying_Client('bobo328410948', 'bobo328410948', '899370')  # 用户中心>>软件ID 生成一个替换 96001
im = open('code.png', 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
print(chaojiying.PostPic(im, 9004)['pic_str'])
result = chaojiying.PostPic(im, 9004)['pic_str']
all_list = []  # 要存储即将被点击的点的坐标  [[x1,y1],[x2,y2]]
if '|' in result:
    list_1 = result.split('|')
    count_1 = len(list_1)
    for i in range(count_1):
        xy_list = []
        x = int(list_1[i].split(',')[0])
        y = int(list_1[i].split(',')[1])
        xy_list.append(x)
        xy_list.append(y)
        all_list.append(xy_list)
else:
    x = int(result.split(',')[0])
    y = int(result.split(',')[1])
    xy_list = []
    xy_list.append(x)
    xy_list.append(y)
    all_list.append(xy_list)
print(all_list)
# 遍历列表，使用动作链对每一个列表元素对应的x,y指定的位置进行点击操作
for l in all_list:
    x = l[0]
    y = l[1]
    ActionChains(bro).move_to_element_with_offset(code_img_ele, x, y).click().perform()
    time.sleep(0.5)

bro.find_element_by_id('username').send_keys('www.zhangbowudi@qq.com')
time.sleep(2)
bro.find_element_by_id('password').send_keys('bobo_15027900535')
time.sleep(2)
bro.find_element_by_id('loginSub').click()
time.sleep(30)
bro.quit()

scrapy框架

环境的安装

 python
   - mac or linux：
   		pip install scrapy
   - windows:
          pip install wheel
          # 下载twisted，下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
          # 安装twisted：pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
          pip install pywin32
          pip install scrapy
  ```

基础指令

创建一个工程：
    scrapy startproject xxxPro
         cd xxxPro  # 切换到这个工程
         在spiders子目录中创建一个爬虫文件
            scrapy genspider spiderName www.xxx.com
         执行工程：
             scrapy crawl spiderName

持久化存储

基于终端

基于终端指令：
        - 要求：只可以将parse方法的返回值存储到本地的文本文件中
        - 注意：持久化存储对应的文本文件的类型只可以为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle
        - 指令：scrapy crawl xxx -o filePath
        - 好处：简介高效便捷
        - 缺点：局限性比较强（数据只可以存储到指定后缀的文本文件中）

基于管道

在items.py 中封装属性对象

import scrapy
# 封装了 content time 2个对象

class DuanziproItem(scrapy.Item):
    # define the fields for your item here like:
    content = scrapy.Field()
    time = scrapy.Field()

将解析到数据值储存到items对象

duanzi.py

item = DuanziproItem()
item['time'] = time
item['content'] = content

在管道文件中编写代码完成数据存储的操作

pipelines.py

class DuanziproPipeline:
    fp = None

    # 重写父类的一个方法：该方法只在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        print('开始爬虫......')
        self.fp = open('./duanzi.txt', 'w', encoding='utf-8')

    # 专门用来处理item类型对象
    # 该方法可以接收爬虫文件提交过来的item对象
    # 该方法没接收到一个item就会被调用一次
    def process_item(self, item, spider):
        time = item['time']
        content = item['content']

        self.fp.write(time + ':' + content + '\n')

        return item  # 就会传递给下一个即将被执行的管道类

    def close_spider(self, spider):
        print('结束爬虫！')
        self.fp.close()

在配置文件settings.py中开启管道操作

ITEM_PIPELINES = {
    'duanziPro.pipelines.DuanziproPipeline': 300,
}

存储到数据库

duanzi.py

import scrapy
from ..items import DuanziproItem


class DuanziSpider(scrapy.Spider):
    name = 'duanzi'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://duanzixing.com/']

    def parse(self, response):
        all_content = response.xpath('//article[@class="excerpt"]//h2/a/text()').extract()
        all_time = response.xpath('//p[@class="meta"]/time/text()').extract()
        for i in range(0, len(all_time)):
            time = all_time[i]
            content = all_content[i]
            item = DuanziproItem()
            item['time'] = time
            item['content'] = content
            # 2.将item对象提交给管道
            yield item

在item.py中封装对象

import scrapy
# 封装了 content time 2个对象

class DuanziproItem(scrapy.Item):
    # define the fields for your item here like:
    content = scrapy.Field()
    time = scrapy.Field()

pipelines.py 增加一个类

import pymysql


class QiubaiproPipeline(object):
    conn = None  # mysql的连接对象声明
    cursor = None  # mysql游标对象声明

    def open_spider(self,spider):
        print('开始爬虫')

        # 链接数据库
        # host 本机的ip地址
        # 在命令行输入 ipconfig查看
        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='duanzi',charset='utf8')


    # 该方法可以接受爬虫文件中提交过来的item对象，并且对item对象的页面数据进行持久化处理
    # 参数：item表示的就是接受到的item对象
    def process_item(self, item, spider):
        # 1.链接数据库
        # 执行sql语句


        # 插入数据
        sql = 'insert into db1(time,content) values("%s","%s")'%(item['author'], item['content'])
        # 获取游标
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()

        # 提交事务
        return item
    # 该方法只会在爬虫结束的时候被调用一次
    def close_spider(self,spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

Spider的全站数据爬取

通过定制一个url模板

使用模板自定义修改获取一个新的url 回调 parse()

yield scrapy.Request(url=new_url, callback=self.parse)

# 彼岸图爬取
import scrapy


class BiantuSpider(scrapy.Spider):
    name = 'biantu'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://pic.netbian.com/index_1.html']
    # 定制模板
    url = "https://pic.netbian.com/index_%d.html"
    page_num = 2

    def parse(self, response):
        alt_list = response.xpath('//ul[@class="clearfix"]/li/a/img/@alt').extract()
        for title in alt_list:
            print(title)

        if self.page_num <= 5:
            new_url = format(self.url % self.page_num)
            self.page_num += 1
            # 回调 parse函数
            yield scrapy.Request(url=new_url, callback=self.parse)

五大核心

引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

posted @ 2022-03-14 16:25 Gentry-Yang 阅读(141) 评论(0) 收藏举报

刷新页面返回顶部

Gentry-Yang

爬虫

爬虫基本流程

中文乱码解决

bs4

数据解析的原理

环境安装

数据解析的方法和属性

使用

xpath

解析原理

环境的安装

实例化一个etree对象

xpath表达式

案例

模拟登陆

代理ip

高性能异步爬虫

多线程，多进程(不建议)

线程池、进程池（适当使用）

异步协程

selenium

scrapy框架

环境的安装

基础指令

持久化存储

存储到数据库

Spider的全站数据爬取

五大核心

公告