爬虫之数据提取: lxml模块
7. lxml模块的安装与使用示例
lxml模块是一个第三方模块,安装之后使用
7.1 lxml模块的安装
对发送请求获取的xml或html形式的响应内容进行提取
pip/pip3 install lxml
知识点:了解 lxml模块的安装
7.2 爬虫对html提取的内容
-
提取标签中的文本内容
-
提取标签中的属性的值
-
比如,提取a标签中href属性的值,获取url,进而继续发起请求
-
7.3 lxml模块的使用
-
导入lxml 的 etree 库
from lxml import etree
-
利用etree.HTML,将html字符串(bytes类型或str类型)转化为Element对象,Element对象具有xpath的方法,返回结果的列表
html = etree.HTML(text) ret_list = html.xpath("xpath语法规则字符串")
- xpath方法返回列表的三种情况
-
- 返回由字符串构成的列表:xpath字符串规则匹配的一定是文本内容或某属性的值
7.4 lxml模块使用示例
运行下面的代码,查看打印的结果
#!/usr/bin/env python # -*- coding:utf-8 -*- from lxml import etree text = ''' <div> <ul> <li class="item-1"> <a href="link1.html">first item</a> </li> <li class="item-1"> <a href="link2.html">second item</a> </li> <li class="item-inactive"> <a href="link3.html">third item</a> </li> <li class="item-1"> <a href="link4.html">fourth item</a> </li> <li class="item-0"> <a href="link5.html">fifth item</a> </ul> </div> ''' # 1.创建element对象 html = etree.HTML(text) # 2.获取href和text() # print(html.xpath("//a[@href='link1.html']/text()")) # text_list = html.xpath("//a/text()") # link_list = html.xpath("//a/@href") # all_list = html.xpath("//a/text() | //a/@href ") # print(text_list) # print(link_list) # print(all_list) # low 写法 # for text in text_list: # myindex = text_list.index(text) # link = link_list[myindex] # print(text, link) # 写法一: # for text, link in zip(text_list, link_list): # print(text,link) # 写法二: el_list = html.xpath('//a') for el in el_list: # print(el.xpath('./text()')) # print(el.xpath('.//text()')) # print(el.xpath('text()')) # print(el.xpath('//text()')) # 获取到其他数据 print(el.xpath('./@href')[0], el.xpath('./text()')[0])
#!/usr/bin/env python # -*- coding:utf-8 -*- import requests from lxml import etree """ HTML代码有部分被注释掉,导致xpath无法识别的解决: 方案一:修改User-Agent,改为low一点的浏览器 self.headers = { # "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36" #高端浏览器 "User-Agent": "Mozilla/4.0 (Macintosh; compatieba; MSIE 5.01; Windows NT 5.0; DigExt)" # 低端浏览器 } # UA为高端浏览器,获得的HTML代码可能有部分被注释掉,而高端浏览器仍然可以成功渲染,但是我们通过xpath获取数据时无法获取被注释的数据,所以获得的数据长度为0 # UA为低端浏览器时,获得的HTML没有被注释掉,xpath可以直接获取所有数据,所以获得的长度不为0 方案二:继续使用高端浏览器,对获得的byte类型的数据进行处理 data = response.content.decode().replace("<!--","").replace("-->","") """ class Tieba(object): def __init__(self,name): self.url = "https://tieba.baidu.com/f?ie=utf-8&kw={}&fr=search".format(name) print(self.url) self.headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36" # "User-Agent": "Mozilla/4.0 (Macintosh; compatieba; MSIE 5.01; Windows NT 5.0; DigExt)" } def get_data(self, url): response = requests.get(url, headers = self.headers) with open('temp.html','wb')as f: f.write(response.content) return response.content def parse_data(self, data): # 创建elements对象 data = data.decode().replace("<!--", "").replace("-->", "") html = etree.HTML(data) el_list = html.xpath('//li[@class=" j_thread_list clearfix thread_item_box"]/div/div[2]/div[1]/div[1]/a') data_list = [] for el in el_list: temp = {} temp["title"] = el.xpath("./text()")[0] temp["href"] = 'https://tieba.baidu.com' + el.xpath("./@href")[0] data_list.append(temp) # 获取下一页url try: next_url = 'https:'+html.xpath('//a[contains(text(),"下一页>")]/@href')[0] except: next_url = None return data_list, next_url def save_data(self, data_list): for data in data_list: print(data) def run(self): # 1.url # 2.headers next_url = self.url while True: # 3.发送请求,获取响应 data = self.get_data(next_url) # 4.从响应中提取数据(数据和翻页用的url) data_list, next_url = self.parse_data(data) self.save_data(data_list) print('next_url:', next_url) # 5.判断是否终结 if next_url == None: break if __name__ == '__main__': tieba = Tieba('传智播客') tieba.run()
知识点:掌握 lxml模块中使用xpath语法定位元素提取属性值或文本内容
9 lxml模块中etree.tostring函数的使用
> 运行下边的代码,观察对比html的原字符串和打印输出的结果
from lxml import etree html_str = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(html_str) handeled_html_str = etree.tostring(html).decode() print(handeled_html_str)
现象和结论
打印结果和原来相比:
自动补全原本缺失的
li标签自动补全
html等标签
输出结果
<html><body><div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li></ul> </div> </body></html>
结论:
-
lxml.etree.HTML(html_str)可以自动补全标签
-
lxml.etree.tostring函数可以将转换为Element对象再转换回html字符串 -
爬虫如果使用lxml来提取数据,应该以
lxml.etree.tostring的返回结果作为提取数据的依据

浙公网安备 33010602011771号