爬虫之数据提取: lxml模块

7. lxml模块的安装与使用示例

lxml模块是一个第三方模块,安装之后使用

7.1 lxml模块的安装

对发送请求获取的xml或html形式的响应内容进行提取

pip/pip3 install lxml

知识点:了解 lxml模块的安装

 

7.2 爬虫对html提取的内容

  • 提取标签中的文本内容

  • 提取标签中的属性的值

    • 比如,提取a标签中href属性的值,获取url,进而继续发起请求

7.3 lxml模块的使用

  1. 导入lxml 的 etree 库

    from lxml import etree
  2. 利用etree.HTML,将html字符串(bytes类型或str类型)转化为Element对象,Element对象具有xpath的方法,返回结果的列表

    html = etree.HTML(text) 
    ret_list = html.xpath("xpath语法规则字符串")
  3. xpath方法返回列表的三种情况
    • 返回空列表:根据xpath语法规则字符串,没有定位到任何元素
    • 返回由字符串构成的列表:xpath字符串规则匹配的一定是文本内容或某属性的值
    • 返回由Element对象构成的列表:xpath规则字符串匹配的是标签,列表中的Element对象可以继续进行xpath

 

 

7.4 lxml模块使用示例

运行下面的代码,查看打印的结果

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from lxml import etree

text = ''' 
<div> 
  <ul> 
    <li class="item-1">
      <a href="link1.html">first item</a>
    </li> 
    <li class="item-1">
      <a href="link2.html">second item</a>
    </li> 
    <li class="item-inactive">
      <a href="link3.html">third item</a>
    </li> 
    <li class="item-1">
      <a href="link4.html">fourth item</a>
    </li> 
    <li class="item-0">
      <a href="link5.html">fifth item</a>
  </ul> 
</div>
'''

# 1.创建element对象
html = etree.HTML(text)

# 2.获取href和text()
# print(html.xpath("//a[@href='link1.html']/text()"))

# text_list = html.xpath("//a/text()")
# link_list = html.xpath("//a/@href")
# all_list = html.xpath("//a/text() | //a/@href ")
# print(text_list)
# print(link_list)
# print(all_list)

# low 写法
# for text in text_list:
#     myindex = text_list.index(text)
#     link = link_list[myindex]
#     print(text, link)

# 写法一:
# for text, link in zip(text_list, link_list):
#     print(text,link)

# 写法二:
el_list = html.xpath('//a')
for el in el_list:
    # print(el.xpath('./text()'))
    # print(el.xpath('.//text()'))
    # print(el.xpath('text()'))
    # print(el.xpath('//text()'))   # 获取到其他数据

    print(el.xpath('./@href')[0], el.xpath('./text()')[0])
View Code

 

 

8 练习

练习一:从传智播客贴吧中获取所有帖子的title和href(注意:要自动进行下一页,直至尾页)    传智播客贴吧

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests
from lxml import etree

"""
HTML代码有部分被注释掉,导致xpath无法识别的解决:
方案一:修改User-Agent,改为low一点的浏览器
self.headers = {
            # "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36" #高端浏览器
            "User-Agent": "Mozilla/4.0 (Macintosh; compatieba; MSIE 5.01; Windows NT 5.0; DigExt)"  # 低端浏览器
        }
# UA为高端浏览器,获得的HTML代码可能有部分被注释掉,而高端浏览器仍然可以成功渲染,但是我们通过xpath获取数据时无法获取被注释的数据,所以获得的数据长度为0
# UA为低端浏览器时,获得的HTML没有被注释掉,xpath可以直接获取所有数据,所以获得的长度不为0

方案二:继续使用高端浏览器,对获得的byte类型的数据进行处理
data = response.content.decode().replace("<!--","").replace("-->","")
"""


class Tieba(object):

    def __init__(self,name):
        self.url = "https://tieba.baidu.com/f?ie=utf-8&kw={}&fr=search".format(name)
        print(self.url)

        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
            # "User-Agent": "Mozilla/4.0 (Macintosh; compatieba; MSIE 5.01; Windows NT 5.0; DigExt)"
        }

    def get_data(self, url):
        response = requests.get(url, headers = self.headers)
        with open('temp.html','wb')as f:
            f.write(response.content)
        return response.content

    def parse_data(self, data):
        # 创建elements对象
        data = data.decode().replace("<!--", "").replace("-->", "")
        html = etree.HTML(data)
        el_list = html.xpath('//li[@class=" j_thread_list clearfix thread_item_box"]/div/div[2]/div[1]/div[1]/a')

        data_list = []
        for el in el_list:
            temp = {}
            temp["title"] = el.xpath("./text()")[0]
            temp["href"] = 'https://tieba.baidu.com' + el.xpath("./@href")[0]
            data_list.append(temp)

        # 获取下一页url
        try:
            next_url = 'https:'+html.xpath('//a[contains(text(),"下一页>")]/@href')[0]
        except:
            next_url = None

        return data_list, next_url

    def save_data(self, data_list):
        for data in data_list:
            print(data)

    def run(self):
        # 1.url
        # 2.headers
        next_url = self.url

        while True:
            # 3.发送请求,获取响应
            data = self.get_data(next_url)
            # 4.从响应中提取数据(数据和翻页用的url)
            data_list, next_url = self.parse_data(data)
            self.save_data(data_list)

            print('next_url:', next_url)
            # 5.判断是否终结
            if next_url == None:
                break


if __name__ == '__main__':
    tieba = Tieba('传智播客')
    tieba.run()
View Code

 

知识点:掌握 lxml模块中使用xpath语法定位元素提取属性值或文本内容

 

9 lxml模块中etree.tostring函数的使用

> 运行下边的代码,观察对比html的原字符串和打印输出的结果

from lxml import etree
html_str = ''' <div> <ul>
<li class="item-1"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul> </div> '''

html = etree.HTML(html_str)

handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

现象和结论

打印结果和原来相比:

  1. 自动补全原本缺失的li标签

  2. 自动补全html等标签

输出结果

<html><body><div> <ul> 
<li class="item-1"><a href="link1.html">first item</a></li> 
<li class="item-1"><a href="link2.html">second item</a></li> 
<li class="item-inactive"><a href="link3.html">third item</a></li> 
<li class="item-1"><a href="link4.html">fourth item</a></li> 
<li class="item-0"><a href="link5.html">fifth item</a> 
</li></ul> </div> </body></html>

 

结论

  • lxml.etree.HTML(html_str)可以自动补全标签

  • lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串

  • 爬虫如果使用lxml来提取数据,应该以lxml.etree.tostring的返回结果作为提取数据的依据



知识点:掌握 lxml模块中etree.tostring函数的使用

 

posted @ 2021-07-06 12:55  麟灬  阅读(165)  评论(0)    收藏  举报