Spider(二)——正则&数据持久化存储

一、解析模块

1、数据的分类

     1、结构化数据
           特点 :有固定的格式,如 :HTML、XML、JSON
     2、非结构化数据
           示例 :图片、音频、视频,这类数据一般都以二进制方式存储

2、正则表达式 re

     1、使用流程
         1、创建编译对象 :p = re.compile("正则表达式")
         2、对字符串进行匹配 :r = p.match("字符串")
         3、获取匹配结果 :print(r.group())
     2、常用方法
         1、match(html) :字符串开头的第1个,返回对象
         2、search(html):从开始往后找,匹配第1个,返回对象
         3、findall(html):所有全部匹配,返回一个列表
         4、group()  :从match或search返回对象中取值
     3、表达式
         . :匹配任意字符(不包括\n)
         \d:数字
         \s:空白字符
         \S:非空白字符    # [\s\S] 匹配任意字符
         \w:字母、数字、(下划线)_
         [...]:包含[]内容 :A[BCD]E  --> ABE  ACE  ADE

         *  0次或多次
         ?  0次或1次
         +  1次或多次
         {m} m次
         {m,n} m-n次  AB{1,3}C --> ABC ABBC ABBBC
     4、贪婪匹配与非贪婪匹配
         贪婪匹配(.*) :在整个表达式匹配成功的前提下,尽可能多的匹配*
         非贪婪匹配(.*?) :在整个表达式匹配成功的前提下,尽可能少的匹配*

    # re.S使 . 能匹配\n在内的所有字符

import re

html = """<div><p>仰天大笑出门去</p></div>
<div><p>成也风云,败也风云</p></div>
"""
# 贪婪匹配,re.S使 . 能匹配\n在内的所有字符
p = re.compile('<div><p>.*</p></div>',re.S)
r = p.findall(html)
print(r)  #['<div><p>仰天大笑出门去</p></div>\n<div><p>成也风云,败也风云</p></div>']

# 非贪婪匹配
p = re.compile('<div><p>.*?</p></div>',re.S)
r = p.findall(html)
print(r) #['<div><p>仰天大笑出门去</p></div>', '<div><p>成也风云,败也风云</p></div>']
01_贪婪匹配和非贪婪匹配示例.py

     5、findall()的分组
         import re
         #解释 :先按照整体匹配出来,然后再匹配()中的
         # 如果有2个或者多个(),则以元组的方式取显示

import re

s = "A B C D"
p1 = re.compile('\w+\s+\w+')
r1 = p1.findall(s)
print(r1)  #['A B', 'C D']

# 第1步:['A B','C D']
# 第2步:['A','C']
p2 = re.compile('(\w+)\s+\w+')
r2 = p2.findall(s)
print(r2)

# 第1步:['A B','C D']
# 第2步:[('A','B'),('C','D')]
p3 = re.compile('(\w+)\s+(\w+)')
r3 = p3.findall(s)
print(r3)
02_findll分组示例.py

     6、练习
         <div class="animal">
           <p class="name">
             <a title="tiger"></a>
           </p>
          
           <p class="contents">
             Two tigers two tigers run fast       
           </p>
         </div>
        
         <div class="animal">
           <p class="name">
             <a title="rabbit"></a>
           </p>
          
           <p class="contents">
             Small white rabbit white and white
           </p>
         </div>
         第一步:
           [('tiger', 'Two tigers two tigers run fast'), ('rabbit', 'Small white rabbit white and white')]
         第二步:
           动物名称:tiger
           动物描述:Two tiger ...
           正则:p = re.compile('<div class="animal".*?title="(.*?)">.*?contents">(.*?)</p>',re.S)

import re

html = """<div class="animal">
    <p class="name">
        <a title="tiger"></a>
    </p>

    <p class="contents">
        Two tigers two tigers run fast
    </p>
</div>
      
<div class="animal">
    <p class="name">
        <a title="rabbit"></a>
    </p>

    <p class="contents">
        Small while rabbit white and white
    </p>
</div>"""


p = re.compile('<div class="animal">.*?title="(.*?)".*?class="contents">(.*?)</p>.*?</div>',re.S)
r_list = p.findall(html)
#print(r_list)

for animal in r_list:
    print("动物名称:",animal[0].strip())
    print("动物描述:",animal[1].strip())
    print()
03_findall分组练习.py

3、案例1:内涵段子脑筋急转弯抓取

     网址 :www.neihan8.com
     步骤:
         1、找URL规律
           第1页:https://www.neihan8.com/njjzw/index.html
           第2页:https://www.neihan8.com/njjzw/index_2.html
           第3页:https://www.neihan8.com/njjzw/index_3.html
         2、用正则匹配出 题目 和 答案
           p = re.compile('<div class="text-.*?title="(.*?)".*?<div class="desc">(.*?)</div>',re.S)
         3、写代码
           1、发请求
           2、用正则匹配
             <div class="text-column-item.*?title="(.*?)">.*?class="desc">(.*?)</div>
           3、写入本地文件
     见 :04_内涵8脑筋急转弯抓取.py

import urllib.request
import re

class NeihanSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0"}
        self.baseurl = "https://www.neihan8.com/njjzw/"
        self.page = 1
    
    # 获取页面
    def getPage(self,url):
        req = urllib.request.Request(url,headers=self.headers)
        res = urllib.request.urlopen(req) 
        html = res.read().decode("utf-8")
        self.parsePage(html)
    
    # 解析页面
    def parsePage(self,html):
        p = re.compile('<div class="text-column-item.*?title="(.*?)">.*?class="desc">(.*?)</div>',re.S)
        r_list = p.findall(html)
#        print(r_list)
        # r_list: [("二虎","公母"),(),()]
        self.writePage(r_list)
    # 保存数据
    def writePage(self,r_list):
        for r_tuple in r_list:
            for r_str in r_tuple:
                with open("急转弯.txt","a",encoding="gb18030") as f:
                    f.write(r_str.strip() + "\n")
            # 每个急转弯之间有两个空行
            with open("急转弯.txt","a") as f:
                f.write("\n\n")
            
    # 主函数
    def workOn(self):
        self.getPage(self.baseurl)
        while True:
            c = input("成功,是否继续(y/n):")
            if c.strip().lower() == "y":
                self.page += 1
                url = self.baseurl + "index_" + str(self.page) + ".html"
                self.getPage(url)
            else:
                print("爬取结束,谢谢使用本爬虫")
                break
        
if __name__ == "__main__":
    spider = NeihanSpider()
    spider.workOn()
04_内涵8脑筋急转弯.py

4、猫眼电影top100榜单,存到csv表格文件里

     网址:猫眼电影 - 榜单 - top100榜
     目标:抓取电影名、主演、上映时间
     1、知识点讲解
         1、csv模块的使用流程
           1、打开csv文件
               with open("测试.csv","a",newline="") as f:
           2、初始化写入对象
               writer = csv.writer(f)
           3、写入数据(列表)
               writer.writerow([列表])

import csv

with open("test.csv","a",newline="") as f:
    # 初始化写入对象
    writer = csv.writer(f)
    # 写入对象的writerow方法
    writer.writerow(["霸王别姬","张国荣","1993"])
    writer.writerow(["英雄","梁朝伟","2000"])
    writer.writerow(["蜘蛛侠","Foreigner","2000"])
     
05_csv模块示例.py

         2、见 :06_猫眼电影top100.py
     2、准备工作
         1、找URL
           第1页:http://maoyan.com/board/4?offset=0
           第2页:http://maoyan.com/board/4?offset=10
           第n页:offset = (n-1)*10
         2、正则匹配
           <div class="movie-item-info">.*?title="(.*?)".*?class=="star">(.*?)</p>.*?class="releasetime">(.*?)</p>   
         3、写代码

import urllib.request
import re
import csv

class MaoyanSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"}
        self.baseurl = "http://maoyan.com/board/4?offset="
        self.offset = 0
        self.page = 1
        
    # 获取页面
    def getPage(self,url):
        req = urllib.request.Request(url,headers=self.headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode("utf-8")
        self.parsePage(html)
    
    # 解析页面
    def parsePage(self,html):
        p = re.compile('<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>',re.S)
        r_list = p.findall(html)
        # r_list:[("霸王别姬","张国荣","1993"),(),()]
        self.writeTocsv(r_list)
        
    # 保存数据
    def writeTocsv(self,r_list):
        for r_tuple in r_list:
#            L = list(r_tuple)
            L = [r_tuple[0].strip(),r_tuple[1].strip(),r_tuple[2].strip()]
            with open("猫眼电影.csv","a",newline="") as f:
                writer = csv.writer(f)
                writer.writerow(L)
        
    # 主函数
    def workOn(self):
        while True:
            c = input("爬按y,退出按q:")
            if c.strip().lower() == "y":  
                url = self.baseurl + str(self.offset)
                self.getPage(url)
                self.page += 1
                self.offset = (self.page - 1)*10 
            else:
                print("爬取结束")
                break
                
if __name__ == "__main__":
    spider = MaoyanSpider()
    spider.workOn()
    
    
06_猫眼电影top100.py

二、数据持久化存储

1、存入mongodb数据库(pymongo模块回顾)

     import pymongo
     # 创建连接对象
     conn = pymongo.MongoClient("localhost",27017)
     # 创建数据库对象
     db = conn.库名
     # 创建集合对象
     myset = db.集合名
     # 插入数据
     myset.insert(字典)
     >>>mongo
     >>>show dbs
     >>>use 库名
     >>>show collections
     >>>db.集合名.find().pretty()
     >>>db.dropDatabase()
     >>>db.集合名.count()

import pymongo

# 创建连接对象
conn = pymongo.MongoClient("localhost",27017)
# 数据库对象,spiderdb为库名
db = conn.spiderdb
# 集合对象,t1是集合名
myset = db.t1
# 插入数据,类型为字典
myset.insert({"name":"Jim"})


以下为终命令
mongo 
show dbs
use spiderdb
show collections
db.t1.find().pretty()
db.dropDatabase()
07_pymongo回顾.py
import urllib.request
import re
import pymongo

class MaoyanSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"}
        self.baseurl = "http://maoyan.com/board/4?offset="
        self.offset = 0
        self.page = 1
        # 连接对象
        self.conn = pymongo.MongoClient("localhost",27017)
        # 数据库对象
        self.db = self.conn.myfilm
        # 集合对象
        self.tab = self.db.top100

    # 获取页面
    def getPage(self,url):
        req = urllib.request.Request(url,headers=self.headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode("utf-8")
        self.parsePage(html)
    
    # 解析页面
    def parsePage(self,html):
        p = re.compile('<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>',re.S)
        r_list = p.findall(html)
        # r_list:[("霸王别姬","张国荣","1993"),(),()]
        self.writeTomongo(r_list)
        
    # 保存数据
    def writeTomongo(self,r_list):
        for r_tuple in r_list:
            name = r_tuple[0].strip()
            star = r_tuple[1].strip()
            time = r_tuple[2].strip()

            d = {"name":name,"star":star,"time":time}
            self.tab.insert(d)
            print("存入数据库成功")
                   
    # 主函数
    def workOn(self):
        while True:
            c = input("爬按y,退出按q:")
            if c.strip().lower() == "y":  
                url = self.baseurl + str(self.offset)
                self.getPage(url)
                self.page += 1
                self.offset = (self.page - 1)*10 
            else:
                print("爬取结束")
                break
                
if __name__ == "__main__":
    spider = MaoyanSpider()
    spider.workOn()
    
    
    
    
08_猫眼电影top100mongo.py

2、存入MySQL数据库(pymysql模块回顾)

'''创建库spiderdb,创建表t1,插入1条记录'''
import pymysql
import warnings

# 创建数据库连接对象
db = pymysql.connect("localhost","root","123456",charset="utf8")
# 创建游标对象
cursor = db.cursor()
# execute(sql命令)
c_db = 'create database if not exists spiderdb charset utf8'
u_db = 'use spiderdb'
c_tab = 'create table if not exists t1(id int)'
ins = 'insert into t1 values(1)'

# 过滤警告
warnings.filterwarnings("ignore")
try:
    cursor.execute(c_db)
    cursor.execute(u_db)
    cursor.execute(c_tab)
    cursor.execute(ins)
    db.commit() #提交
except Warning:
    pass

# 关闭游标
cursor.close()
# 断开数据库连接
db.close()
09_pymysql回顾.py
import urllib.request
import re
import pymysql
import warnings

class MaoyanSpider:
    def __init__(self):
        self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60"}
        self.baseurl = "http://maoyan.com/board/4?offset="
        self.offset = 0
        self.page = 1
        # 数据库连接对象
        self.db = pymysql.connect("localhost","root","123456",charset="utf8")
        # 游标对象
        self.cursor = self.db.cursor()


    # 获取页面
    def getPage(self,url):
        req = urllib.request.Request(url,headers=self.headers)
        res = urllib.request.urlopen(req)
        html = res.read().decode("utf-8")
        self.parsePage(html)
    
    # 解析页面
    def parsePage(self,html):
        p = re.compile('<div class="movie-item-info">.*?title="(.*?)".*?class="star">(.*?)</p>.*?class="releasetime">(.*?)</p>',re.S)
        r_list = p.findall(html)
        # r_list:[("霸王别姬","张国荣","1993"),(),()]
        self.writeTomysql(r_list)
        
    # 保存数据
    def writeTomysql(self,r_list):
        c_db = 'create database if not exists myfilm charset utf8'
        u_db = 'use myfilm'
        c_tab = 'create table if not exists top100(\
                 id int primary key auto_increment,\
                 name varchar(50),\
                 star varchar(100),\
                 time varchar(50)\
                 )'
        ins = 'insert into top100(name,star,time) \
               values(%s,%s,%s)'
        # 过滤警告
        warnings.filterwarnings("ignore")
        try:
            self.cursor.execute(c_db)
            self.cursor.execute(u_db)
            self.cursor.execute(c_tab)
        except:
            pass

        # 插入记录
        for r_tuple in r_list:
            L = [r_tuple[0].strip(),r_tuple[1].strip(),r_tuple[2].strip()]
            # execute(ins,[列表])
            self.cursor.execute(ins,L)
            self.db.commit()
            print("存入数据库成功")

    # 主函数
    def workOn(self):
        while True:
            c = input("爬按y,退出按q:")
            if c.strip().lower() == "y":  
                url = self.baseurl + str(self.offset)
                self.getPage(url)
                self.page += 1
                self.offset = (self.page - 1)*10 
            else:
                print("爬取结束")
                #加一个关闭数据库
                break
                
if __name__ == "__main__":
    spider = MaoyanSpider()
    spider.workOn()
    
    
10_猫眼电影top100mysql.py

三、requests模块

1、安装(用管理员身份去打开Anaconda Prompt)

     Anaconda   : conda install requests
     Windows cmd: python -m pip install requests
           ## python -m是以管理员身份去执行pip安装命令
     Ubuntu     :sudo pip3 install requests

2、常用方法

     1、requests.get(url,headers=headers)
          发起请求,获取响应对象
     2、响应对象的response属性
          1、res.text   :返回字符串类型
          2、res.content: 返回bytes类型
            应用场景 :爬取非结构化数据(图片,视频)

import requests

url = "http://dingyue.nosdn.127.net/XYcTX8guzubcSJvYgIh2x3I0=4NSI9w47v63rX4ACf7m=1541413847099.jpg"
headers = {"User-Agent":"Mozilla/5.0"}

res = requests.get(url,headers=headers)
res.encoding = "utf-8"
html = res.content

with open("颖宝.jpg","wb") as f:
    f.write(html)

print("颖宝到计算机了")
12_你们的颖宝.py

          3、res.encoding:指定字符编码(一般为ISO-8859-1)
            response.encoding = "utf-8"
          4、res.status_code:返回服务器响应码
          5、res.url        : 返回实际数据的URL地址

import requests

url = "http://www.baidu.com/"
headers = {"User-Agent":"Mozilla/5.0"}

res = requests.get(url,headers=headers)
# print(res.encoding) #查看响应编码,百度给的默认ISO-8859-1
res.encoding = "utf-8"
# 查看响应内容(字符串)
print(res.text)
# 查看响应内容(字节流)
print(res.content)
# 查看HTTP响应码
print(res.status_code)
# 返回实际数据的URL
print(res.url)
11_requests.get方法.py

     3、get()使用场景
         1、没有查询参数
           res = requests.get(url,headers=headers)
         2、有查询参数(params={})
           res = requests.get(url,params=params,headers=headers)
           注 :params参数必须为字典,自动进行编码

import requests

url = "http://www.baidu.com/s?"
headers = {"User-Agent":"Mozilla/5.0"}
key = input("请输入搜索内容:")
params = {"wd":key}

res = requests.get(url,params=params,headers=headers)
res.encoding = "utf-8"
print(res.text)
13_requests.get查询参数.py

     4、post() 参数名 :data
         1、data = {}
         2、示例 :10_有道翻译post.py

posted on 2018-11-24 16:53  破天荒的谎言、谈敷衍  阅读(431)  评论(0)    收藏  举报

导航