爬虫基础之三种数据解析方式

数据解析方式一：正则解析

单字符：
        . : 除换行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一个字符
        \d ：数字  [0-9]
        \D : 非数字
        \w ：数字、字母、下划线、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
        \S : 非空白

数量修饰：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可无  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次
    边界：
        $ : 以某某结尾 
        ^ : 以某某开头
    分组：
        (ab)  
    贪婪模式 .*
    非贪婪（惰性）模式 .*?

    re.I : 忽略大小写
    re.M ：多行匹配
    re.S ：单行匹配
    re.sub(正则表达式, 替换内容, 字符串)

数据解析方式二：xpath

XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航

表达式用法：

/bookstore/book           选取根节点bookstore下面所有直接子节点book
//book                    选取所有book
/bookstore//book          查找bookstore下面所有的book
/bookstore/book[1]        bookstore里面的第一个book
/bookstore/book[last()]   bookstore里面的最后一个book
/bookstore/book[position()<3]  前两个book
//title[@lang]            所有的带有lang属性的title节点
//title[@lang='eng']      所有的lang属性值为eng的title节点

属性定位
            //li[@id="hua"]
            //div[@class="song"]
层级定位&索引
            //div[@id="head"]/div/div[2]/a[@class="toindex"]
            【注】索引从1开始
            //div[@id="head"]//a[@class="toindex"]
            【注】双斜杠代表下面所有的a节点，不管位置
逻辑运算
            //input[@class="s_ipt" and @name="wd"]
模糊匹配 ：
          contains
                //input[contains(@class, "s_i")]
                所有的input，有class属性，并且属性中带有s_i的节点
                //input[contains(text(), "爱")]
            starts-with
                //input[starts-with(@class, "s")]
                所有的input，有class属性，并且属性以s开头
取文本
            //div[@id="u1"]/a[5]/text()  获取节点内容
            //div[@id="u1"]//text()      获取节点里面不带标签的所有内容
      取属性
            //div[@id="u1"]/a[5]/@href

代码中如何使用xpath

1.导包：from lxml import etree
2.将html文档或者xml文档转换成一个etree对象，然后调用对象中的方法查找指定的节点
2.1 本地文件：tree = etree.parse(文件名)
2.2 网络数据：tree = etree.HTML(网页内容字符串)

#安装：
pip install lxml

#导入
from lxml import etree

练习

需求：获取好段子中段子的内容和作者 http://www.haoduanzi.com

# 需求：爬去笑话网中段子内容和作者
# https://www.xiaohua.com/duanzi/

from lxml import etree
import requests

url = 'https://www.xiaohua.com/duanzi/'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

response = requests.get(url=url,headers=headers)
page_text = response.text

# 实例化一个 etree 对象，并且将页面数据放置到 etree 对象中
tree = etree.HTML(page_text)    # 对象类型为 Element类型
# etree.parse('./douban.html')
div_list = tree.xpath('//div[@class="content-left"]/div[@class="one-cont"]')
fp = open('./xiaohua.txt','w',encoding='utf-8')
for div in div_list:
    # 获取作者
    author = div.xpath('.//div[@class="one-cont-font clearfix"]/a/i/text()')[0]
    content = div.xpath('./p/a//text()')[0]

    fp.write(author+":"+content+"\n\n")
fp.close()

View Code

数据解析方式三：bs4

环境安装

- 需要将pip源设置为国内源，阿里源、豆瓣源、网易源等
   - windows
    （1）打开文件资源管理器(文件夹地址栏中)
    （2）地址栏上面输入 %appdata%
    （3）在这里面新建一个文件夹  pip
    （4）在pip文件夹里面新建一个文件叫做  pip.ini ,内容写如下即可
        [global]
        timeout = 6000
        index-url = https://mirrors.aliyun.com/pypi/simple/
        trusted-host = mirrors.aliyun.com
   - linux
    （1）cd ~
    （2）mkdir ~/.pip
    （3）vi ~/.pip/pip.conf
    （4）编辑内容，和windows一模一样
  - 需要安装：pip install bs4
    bs4在使用时候需要一个第三方库，把这个库也安装一下
    pip install lxml

简单使用规则

from bs4 import BeautifulSoup
    使用方式：可以将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的内容
     （1）转化本地文件：
          soup = BeautifulSoup(open('本地文件'), 'lxml')
     （2）转化网络文件：
          soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
     （3）打印soup对象显示内容为html文件中的内容
  
  （1）根据标签名查找
       soup.a   只能找到第一个符合要求的标签
  （2）获取属性
       soup.a.attrs  获取a所有的属性和属性值，返回一个字典
       soup.a.attrs['href']   获取href属性
       soup.a['href']   也可简写为这种形式
  （3）获取内容
       soup.a.string
       soup.a.text
       soup.a.get_text()
     【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容
  （4）find：找到第一个符合要求的标签
       soup.find('a')  找到第一个符合要求的
       soup.find('a', title="xxx")
       soup.find('a', alt="xxx")
       soup.find('a', class_="xxx")
       soup.find('a', id="xxx")
  （5）find_all：找到所有符合要求的标签
       soup.find_all('a')
       soup.find_all(['a','b']) 找到所有的a和b标签
       soup.find_all('a', limit=2)  限制前两个
  （6）select:soup.select('#feng')
       根据选择器选择指定的内容
       常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
           层级选择器：
              div .dudu #lala .meme .xixi  下面好多级
              div > p > a > .lala          只能是下面一级
      【注意】select选择器返回永远是列表，需要通过下标提取指定的对象

练习

需求：使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储 http://www.shicimingju.com/book/sanguoyanyi.html

# 使用bs4 爬取诗词名句中三国演义所有的章节

import requests
from bs4 import BeautifulSoup

# 1
url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}

# 2
response = requests.get(url=url,headers=headers)

#3
page_text = response.text

#4 进行章节标题和内容解析
#实例化一个bs4 对象
soup = BeautifulSoup(page_text,'lxml')

# 解析章节标题
li_list = soup.select(".book-mulu > ul > li")
fp = open('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
    title = li.a.string
    content_url = "http://www.shicimingju.com" + li.a["href"]
    content_page_text = requests.get(url=content_url,headers=headers).text
    soup = BeautifulSoup(content_page_text,"lxml")
    # 章节对应的小说文本内容
    content = soup.find("div",class_="chapter_content").text

    fp.write(title+":"+content)
fp.close()