Scrapy框架（1）

一、移动端数据爬取

　　参考博客：https://www.cnblogs.com/bobo-zhang/p/10068994.html

1、步骤

　　- 配置fiddler

　　- tools -> options -> connection ->

二、scrapy简介

1、什么是scrapy？

　　scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，非常出名，非常强悍。所谓的框架就是一个已经被集成了各种功能（高性能异步下载，队列，分布式，解析，持久化等）的具有很强通用性的项目模板。对于框架的学习，重点是要学习其框架的特性、各个功能的用法即可。

2、安装

　　linux：

　　　　pip3 install scrapy

　　windows：

　　　　a）pip3 install scrapy

　　　　b）下载twisted，地址：http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

　　　　c）进入下载目录，执行 pip3 install Twisted-18.9.0-cp36-cp36m-win_amd64.whl

　　　　d）pip3 install pywin32

　　　　e）pip3 install scrapy

　　验证（以在windows上安装为例）：

3、使用

1）创建一个scrapy项目，如下

2）进入项目，创建一个爬虫文件，如下

3）修改部分文件内容如下

# settings.py 

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'

ROBOTSTXT_OBEY = False

# first.py 文件

# -*- coding: utf-8 -*-
import scrapy

class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, response):
        print(response)

4）启动项目

　　也可以不打印日志开启，如下

三、示例

1、使用scrapy框架实现数据的基本解析操作（以爬取糗事百科文字模块的数据为例）

# firstblood/firstblood/spiders/first.py 

# -*- coding: utf-8 -*-
import scrapy

class FirstSpider(scrapy.Spider):
    # 爬虫文件的名称
    name = 'first'
    # 允许的域名，不想做限制的话注释掉即可
    # allowed_domains = ['www.xxx.com']
    # 起始url，依次请求
    start_urls = ['http://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content-left"]/div')
        # print(div_list)
        # [<Selector xpath='//*[@id="content-left"]/div' data='<div class="article block untagged mb15 '>, ...]
        for div in div_list:
            # 如果可以保证xpath返回的列表中只有一个列表元素则可以使用extract_first()，否则就用extract()
            author_name = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span/text()').extract()
            content = ''.join(content)
            print(author_name, content)

2、使用scrapy实现解析 + 持久化存储

2.1、基于终端指令的持久化存储（只可以将parse方法的返回值持久化存储到本地的文本中），如下

# firstblood/firstblood/spiders/first.py

# -*- coding: utf-8 -*-
import scrapy

class FirstSpider(scrapy.Spider):
    # 爬虫文件的名称
    name = 'first'
    # 允许的域名，不想做限制的话注释掉即可
    # allowed_domains = ['www.xxx.com']
    # 起始url，依次请求
    start_urls = ['http://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//*[@id="content-left"]/div')
        all_data = []
        for div in div_list:
            author_name = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span/text()').extract()
            content = ''.join(content)

            data_dic = {
                "author":author_name,
                "content":content
            }
            all_data.append(data_dic)

        return all_data         # 因为终端指令方式持久化只能将parse方法的返回值持久化存储到本地文件中，因此要将持久化内容构建到一个对象（此处用列表套字典形式）中返回

　　使用命令启动爬虫文件，且带上持久化存储参数，如下：

　　保存文件后缀名必须是以下几种：

2.2、基于管道的持久化存储

　　我们新建一个scrapy工程项目，并配置好settings.py文件，以爬取boss直聘数据为例。项目结构目录如下：

# bossPro/bossPro/items.py 文件

import scrapy

class BossproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job_name = scrapy.Field()
    sylary = scrapy.Field()
    company = scrapy.Field()

# bossPro/bossPro/spiders/boss.py 文件

# -*- coding: utf-8 -*-
import scrapy

from bossPro.items import BossproItem

class BossSpider(scrapy.Spider):
    name = 'boss'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']

    def parse(self, response):
        li_list = response.xpath('//div[@class="job-list"]/ul/li')
        for li in li_list:
            job_name = li.xpath('./div/div[@class="info-primary"]/h3/a/div/text()').extract_first()
            sylary = li.xpath('./div/div[@class="info-primary"]/h3/a/span/text()').extract_first()
            company = li.xpath('./div/div[@class="info-company"]/div/h3/a/text()').extract_first()

            # 实例化一个item对象
            item = BossproItem()
            # 将解析到的数据全部封装到item对象中
            item["job_name"] = job_name
            item["sylary"] = sylary
            item["company"] = company

            # 将item提交给管道
            yield item

　　示例一：持久化存储入文件中，如下

# bossPro/bossPro/pipelines.py 文件

class BossproPipeline(object):
    fp = None
    def open_spider(self, spider):
        print('开始爬虫')
        self.fp = open('./boss.txt','w',encoding='utf8')


    def close_spider(self, spider):
        print('结束爬虫')
        self.fp.close()

    # 爬虫文件每向管道提交一次item，则该方法就会被调用一次，参数item就是管道接收到的item对象
    def process_item(self, item, spider):
        # print('这是item',item)
        self.fp.write(item["job_name"] + ':' + item["sylary"] + ':' + item["company"] + '\n')
        return item

# 注意一定要在bossPro/bossPro/settings.py 中注册，数字表示优先级，数字越小越优先执行

ITEM_PIPELINES = {
    'bossPro.pipelines.BossproPipeline': 300,
}

　　示例一：持久化存储入mysql数据库中，如下

# bossPro/bossPro/pipelines.py 文件

class MysqlPipeline(object):
    conn = None
    cursor = None
    def open_spider(self, spider):
        import pymysql
        self.conn = pymysql.Connect(host='127.0.0.1',user='root',port=3306,password='',db='scrapydb',charset='utf8')

    def close_spider(self, spider):
        self.conn.close()
        self.cursor.close()

    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute('insert into boss values ("%s","%s","%s")' % (item["job_name"], item["sylary"], item["company"]))
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()

# 注意在bossPro/bossPro/settings.py 中注册

ITEM_PIPELINES = {
    'bossPro.pipelines.MysqlproPipeline': 301,
}

　　示例二：持久化存储入redis数据库中，如下

# bossPro/bossPro/pipelines.py 文件

class RedisPipeline(object):
    conn = None
    def open_spider(self, spider):
        from redis import Redis
        self.conn = Redis(host='127.0.0.1',port=6379)


    def process_item(self, item, spider):
        data_dic = {
            'job_name':item["job_name"],
            'sylary':item["sylary"],
            'company':item["company"]
        }
        self.conn.lpush('boss',data_dic)

# 注意：如果redis数据库在存储字典的时候出现报错，原因是当前使用的redis模块不支持存储字典类型的数据，可以在终端执行命令： pip install -U redis==2.10.6

# 注意在bossPro/bossPro/settings.py 中注册

ITEM_PIPELINES = {
    'bossPro.pipelines.RedisproPipeline': 302,
}

2.3、手动请求的发送（以分页爬取boss直聘数据为例）

四、思考

　　没有return会怎么样？

posted @ 2019-03-01 12:17 勇敢的巨蟹座阅读(286) 评论(0) 编辑收藏举报

刷新页面返回顶部