爬虫的数据解析类型

一、正则解析

详情请看：https://www.cnblogs.com/cyberkid/p/15876547.html

二、BS4解析(Beautiful Soup 4)

定义：BS4全称是Beatiful Soup，它提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为tiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编一下原始编码方式就可以了。

BS4的4中对象：

Tag对象：是html中的一个标签，用BeautifulSoup就能解析出来Tag的具体内容，具体的格式为‘soup.name‘,其中name是html下的标签。
BeautifulSoup对象：整个html文本对象，可当作Tag对象
NavigableString对象：标签内的文本对象
Comment对象：是一个特殊的NavigableString对象，如果html标签内存在注释，那么它可以过滤掉注释符号保留注释文本

最常用的还是BeautifulSoup对象和Tag对象

1. 环境安装

2.1.1 `bs4`&`lxml`安装

pip install bs4   # 模块
pip install lxml  # 解析器

2.数据解析原理

对象的实例化：
- 1.将本地的html文档中的数据加载到该对象中 fp = open('./test.html','r',encoding='utf-8') soup = BeautifulSoup(fp,'lxml')
- 2.将互联网上获取的页面源码加载到该对象中 page_text = response.text soup = BeatifulSoup(page_text,'lxml')
提供的用于数据解析的方法和属性：
- soup.tagName:返回的是文档中第一次出现的tagName对应的标签
- soup.find():
  - find('tagName'):等同于soup.div
  - 属性定位：
    - soup.find('div',class_/id/attr='song')
- soup.find_all('tagName'):返回符合要求的所有标签（列表）
select：
- select('某种选择器（id，class，标签...选择器）'),返回的是一个列表。
  - 层级选择器：
  - soup.select('.tang > ul > li > a')：>表示的是一个层级
  - oup.select('.tang > ul a')：空格表示的多个层级
- 获取标签之间的文本数据：
  - soup.a.text/string/get_text()
  - text/get_text():可以获取某一个标签中所有的文本内容
  - string：只可以获取该标签下面直系的文本内容
- 获取标签中属性值：
  - soup.a['href']

示例代码：

from bs4 import BeautifulSoup

if __name__ == '__main__':
    # 将本地的html文档中的数据加载到该对象中
    fp = open('test.html', 'r', encoding='utf-8')
    soup = BeautifulSoup(fp, 'lxml')

    # 将互联网上获取的页面源码加载到该对象中
    # page_text = requests.get(url=url, param=param, headers=headers).text
    # soup = BeautifulSoup(page_text, 'lxml')

    # soup.tagName 返回的是html中第一次出现的tagName标签
    print(soup.a)

    # find('tagName')：等同于soup.tagName
    print(soup.find('a'))
    print(soup.find('div', class_='song'))  # 属性定位

    # soup.findall('tagName'):返回符合要求的所有标签(列表)
    print(soup.find_all('a'))

    # select('某种选择器（id（#），类选择器（.），标签选择器）')，返回的是列表
    print(soup.select('.tang'))
    print(soup.select('.tang > ul > li > a')[0])  # 层级选择器，”>“ 表示的是一个层级，空格表示的是多个层级

    # 获取标签中的文本数据
    # soup.a.text/string/get_text()
    # text/get_text():可以获取某一个标签中所有的文本内容，包含所有子标签的文本内容，string只可以获取该标签自己的文本内容

    # 获取标签中的属性值：
    """
    soup.a['href']
    """

BS4案例：

# 需求：爬取三国演义全本 https://www.shicimingju.com/book/sanguoyanyi.html
import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # UA伪装
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }
    # 对首页页面数据进行捕获
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'

    # 提交请求&获取页面源码,解析出章节的标题和详情页的url
    # 1. 实例化BeautifulSoup对象，需要将页面源码数据加载到该对象中
    page_text = requests.get(url=url, headers=headers)
    page_text.encoding = 'utf-8'
    soup = BeautifulSoup(page_text.text, 'lxml')

    # 解析章节标题和详情页的url
    li_list = soup.select('.book-mulu > ul > li')
    # print(li_list)
    fp = open('download/三国演义.txt', 'w', encoding='utf-8')

    for li in li_list:
        title = li.a.string
        detail_url = 'https://www.shicimingju.com/' + li.a['href']
        # 对详情页发起请求，解析出内容
        detail_page_text = requests.get(url=detail_url, headers=headers)
        detail_page_text.encoding = 'utf-8'
        detail_soup = BeautifulSoup(detail_page_text.text, 'lxml')
        detail_content = detail_soup.find('div', class_='chapter_content').text

        # 解析到了章节的内容
        fp.write(title + '\n' + detail_content + '\n\n')
        print(title, '爬取成功!!')

    print("ova!!!!")

三、`xpath`解析（推荐）

定义：XPath，全称 XML Path Language，即 XML 路径语言，它是一门在 XML 文档中查找信息的语言。最初是用来搜寻 XML 文档的，但同样适用于 HTML 文档的搜索。所以在做爬虫时完全可以使用 XPath 做相应的信息抽取。

XPath 的选择功能十分强大，它提供了非常简洁明了的路径选择表达式。另外，它还提供了超过 100 个内建函数，用于字符串、数值、时间的匹配以及节点、序列的处理等，几乎所有想要定位的节点都可以用 XPath 来选择。

官方帮助文档：https://www.w3.org/TR/xpath/

xpath 使用规则：

1.环境安装

3.1.1 python3环境下安装命令

pip install lxml

2.使用步骤

3.2.1 常用数据的导入

本地html文件导入

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

通过字符串解析成html格式

import os, traceback
from lxml import etree

text = '''
<p>
    <ul>
        <li class="item-0"><a href="https://s1.bdstatic.com/">item 0 </a></li>
        <li class="item-1"><a href="https://s2.bdstatic.com/">item 1 </a></li>
        <li class="item-2"><a href="https://s3.bdstatic.com/">item 2 </a></li>
        <li class="item-3"><a href="https://s4.bdstatic.com/">item 3 </a></li>
        <li class="item-4"><a href="https://s5.bdstatic.com/">item 4 </a></li>
        <li class="item-5"><a href="https://s6.bdstatic.com/">item 5 </a></li>
    </ul>     
</p>
'''
# 利用 etree.HTML 把字符串解析成 HTML 文件
html = etree.HTML(text)
#decode() 方法将其转化为 str 类型
s = etree.tostring(html).decode()
print(s)

3.2.2 绝对路径查找

获取某个标签的内容注意，获取a标签的所有内容，a后面就不用再加正斜杠，否则报错。

# 方式一
html_data = html.xpath('/html/body/ul/li/a')
for i in html_data:
    print(i.text)
    
# 方式二
html_data = html.xpath('/html/body/ul/li/a/text()')

获取标签属性
- 打印指定路径下a标签的属性这里可以通过遍历拿到某个属性的值，查找标签的内容，通过@属性名获取
```
html = etree.HTML(text)
html_data = html.xpath('/html/body/ul/li/a/@href')
for i in html_data:
    print(i)
```
获取指定标签对应属性值的内容
- 使用xpath拿到得都是一个个的ElementTree对象，如果需要查找内容的话，还需要遍历拿到数据的列表。查到绝对路径下a标签属性等于https://s4.bdstatic.com/的内容。
```
html = etree.HTML(text)
html_data = html.xpath('/html/body/ul/li/a[@href="https://s4.bdstatic.com/"]/text()')
for i in html_data:
    print(i)
```

3.2.3 相对路径查找（常用）

查找所有li标签下的a标签内容

html = etree.HTML(text)
html_data = html.xpath('//li/a/text()')  # 双斜杠代表多级查找
print(html_data)

查找一下l相对路径下li标签下的a标签下的href属性的值，注意，a标签后面需要双//
```
html = etree.HTML(text)
html_data = html.xpath('//li/a//@href')
print(html_data)
```

查找a标签下属性href值为https://s4.bdstatic.com/的内容

html = etree.HTML(text)
html_data = html.xpath('//li/a[@href="https://s4.bdstatic.com/"]/text()')
print(html_data)

属性匹配:匹配时可以用@符号进行属性过滤
```
//li[@class="item-5"]
```

属性获取：@符号相当于过滤器，可以直接获取节点的属性值

result = html.xpath('//li/a/@href')
print(result)
# 运行结果：['https://s1.bdstatic.com/', 'https://s2.bdstatic.com/', 'https://s3.bdstatic.com/', 'https://s4.bdstatic.com/', 'https://s5.bdstatic.com/', 'https://s6.bdstatic.com/']

文本获取

from lxml import etree
# 第一种
html_data = html.xpath('//li[@class="item-1"]/a/text()')
print(html_data)
# 第二种
html_data = html.xpath('//li[@class="item-1"]//text()')   # 该方法会获取到补全代码时换行产生的特殊字符
print(html_data)

属性多值匹配：某些节点的某个属性可能有多个值：

from lxml import etree

text = '''
<li class="zxc  asd  wer"><a href="https://s2.bdstatic.com/">1 item</a></li>
<li class="ddd  asd  eee"><a href="https://s3.bdstatic.com/">2 item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "asd")]/a/text()')
print(result)

# 运行结果：['1 item', '2 item']

多属性匹配：当前节点有多个属性时，需要同时进行匹配

from lxml import etree

text = '''
<li class="zxc  asd  wer" name="222"><a href="https://s2.bdstatic.com/">1 item</a></li>
<li class="ddd  zxc  eee" name="111"><a href="https://s3.bdstatic.com/">2 item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "zxc") and @name="111"]/a/text()')
print(result)

# 运行结果：['2 item']

xpath案例：

# 项目需求：解析出所有城市名称 https://www.aqistudy.cn/historydata/
import requests
from lxml import etree

if __name__ == '__main__':
    # UA伪装
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36'
    }
    url = 'https://www.aqistudy.cn/historydata/'
    page_text = requests.get(url=url, headers=headers).text
    hottest_city_name = []
    entire_city_name = {}
    # 数据解析
    tree = etree.HTML(page_text)
    hot_li_list = tree.xpath('//div[@class="bottom"]/ul/li')
    whole_li_list = tree.xpath('//div[@class="all"]/div[2]/ul')
    # 热门城市
    for li in hot_li_list:
        hot_city_name = li.xpath('./a/text()')[0]
        hottest_city_name.append(hot_city_name)
        # print(hot_city_name)

    with open('download/cities.txt', 'w', encoding='utf-8') as fp:
        fp.write('热门城市:\n')
        for item in hottest_city_name:
            fp.write('{}\t'.format(item))
        fp.write('\n全部城市:')

    # 全部城市
    for ul in whole_li_list:
        city_group = ul.xpath('./div[1]/b/text()')
        for key in city_group:
            city_name_list = ul.xpath('./div[2]/li/a/text()')
            entire_city_name = {
                key: city_name_list,
            }
        # print(entire_city_name)
        with open('download/cities.txt', 'a+', encoding='utf-8') as fp:
            for group, city in entire_city_name.items():
                fp.write('\n{} '.format(group))
                for item in city:
                    fp.write('{} '.format(item))

    print('ova!!!')

参考文档：

XPath 的用法：http://www.w3school.com.cn/xpath/index.asp

Python lxml 的用法：http://lxml.de

官方文档：https://www.w3.org/TR/xpath/

posted @ 2022-02-09 20:11 楽仔阅读(59) 评论(0) 收藏举报

刷新页面返回顶部

楽仔