scrapy使用三：某网站自动翻页采集

爬虫项目编写流程：

创建项目：scrapy project 项目名称
创建爬虫名称：scrapy genspider 爬虫名称 "限制域"
明确需求：编写items.py
编写spiders/xxx.py，编写爬虫文件，处理请求和响应，以及提取数据(yield item)
编写pipelines.py，编写管道文件，处理spider返回的item数据，比如本地持久化存储等
编写settings.py，启动管道组件：ITEM_PIPELINES={}，以及其它header等中间件设置
执行爬虫：scrapy crawl 爬虫名称

1.创建项目

scrapy startproject Tencent
cd Tencent # 进入到项目目录

2.创建一个爬虫

scrapy genspider tencent.job "www.tencent.com"

在引用模块、settings配置时，不是从项目文件夹开始，面是从项目文件夹的下一层开始。

3.打开网址，查看数据：https://hr.tencent.com/position.php?keywords=python&lid=2218&tid=87&start=50#a

4.明确需求，要爬取的字段，编写Tencent/items.py：

# Tencent/Tencent/items.py
import scrapy
class TencentItem(scrapy.Item):
    # 职位：
    positionName = scrapy.Field()
    # 职位详情链接：
    positionLink = scrapy.Field()
    # 职位类别
    positionType = scrapy.Field()
    # 招聘人数
    peopleNumber = scrapy.Field()
    # 工作地址
    workLocation = scrapy.Field()
    # 发布时间
    publishTime = scrapy.Field()

5.编写爬虫，Tencent/spiders/tencent_job.py：

import scrapy
class TencentJobSpider(scrapy.Spider):
    name = "tencent.job"
    allowed_domains = ["tencent.com"]
    base_url = "https://hr.tencent.com/position.php?keywords=python&lid=2218&tid=87&start="
    offset = 0  # 起始页start=0，每10个换一页
    start_url = [base_url + str(offset)]

分析页面，获取我们想要的字段。

在我们想要的信息中，点击检查，查看elements，可知节点为：

//tr[@class='even']

或者 //tr[@class='odd']

将两者合在一起： //tr[@class='even'] | //tr[@class='odd']

每个tr下面，是td列表，这些td列表就是每个职位的相关信息。

职位名称： //tr[@class='even'] | /td[1]/a/text()

注意：xpath的下标是从1开始

职位详情链情： //tr[@class='even'] | /td[1]/a/@href

............

import scrapy
from Tencent.items import TencentItem
class TencentJobSpider(scrapy.Spider):
    name = "tencent.job"
    allowed_domains = ["tencent.com"]
    base_url = "https://hr.tencent.com/position.php?keywords=python&lid=2218&tid=87&start="
    offset = 0  # 起始页start=0，每10个换一页
    start_url = [base_url + str(offset)]

    def parse(self, response):
        node_list = response.xpath("//tr[@class='even'] |  //tr[@class='odd']")
        # response.xpath之后的对象，仍然是xpath对象
        for node in node_list:
            #xpath方法之后，返回的仍然是xpath对象；xpath的extract()方法，以unicode编码返回xpath对象的内容
            positionName = node.xpath('./td[1]/a/text()').extract()
            # 注意要在td前面加上"./"，表示此td是node_list的下一级
            positionLink = node.xpath('./td[1]/a/@href').extract()
            positionType = node.xpath('./td[2]/text()').extract()
            peopleNumber = node.xpath('./td[3]/text()').extract()
            workLocation = node.xpath('./td[4]/text()').extract()
            publishTime = node.xpath('./td[5]/text()').extract()
            item = TencentItem()
            # xpath及其extract返回的都是列表，因此取其第一个元素；并使用encode("utf-8")编码为utf8
            item['positionName'] = positionName[0].encode("utf-8") if positionName else ""
            item['positionType'] = positionType[0].encode("utf-8") if positionType else ""
            item['positionLink'] = positionLink[0].encode("utf-8") if positionLink else ""
            item['peopleNumber'] = peopleNumber[0].encode("utf-8") if peopleNumber else ""
            item['workLocation'] = workLocation[0].encode("utf-8") if workLocation else ""
            item['publishTime'] = publishTime[0].encode("utf-8") if publishTime else ""
            yield item
        if self.offset < 2190:
            self.offset += 10 # 每页10条数据
            url = self.base_url + self.offset
            # 构建下一页的url请求对象
            yield scrapy.Request(url, callback=self.parse)
            # 1.如果下面的请求内容不一样，则需要自己再写一个回调方法，回调自定义的方法

            # 2.这里返回的是url对象，引擎接受到以后判断它不是item对象，不会发送给管道处理；
            # 是Request对象，引擎将发送给调度器，去请处理请求

            # 3.这里要使用yield，不断的返回，直到self.offset >=2190。

调用管道时，注意：

每个response对应一个parse()方法；for循环中，循环一次，生成一个item对象；for循环中所有item对象，对应一个管道process_item()方法。

即for循环中的有item对象，共用一个管道对象process_item()；即管道类只会实例化一次，多次调用process_item()方法，调用process_items()时，都是一个管道类对象；因此，也只会初始化一次，也只需要打开和关闭文件一次。

6.编写管道文件：保存数据

1).编写管道类：Tencent/pipelines.py

import json
class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w")
    def process_item(self, item, spider):
        # 要处理中文，ensure_ascii要改为False
        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        # 此时，不再需要encode("utf-8")，因为在得到item的时侯已经encode("utf-8")
        # 总之，从网络请求来的数据，要encode("utf-8")编码一次。
        self.f.write(content)
        return item

    def close_spider(self, spider):
        self.f.close()

2)在Tencent/settings.py文件中，启用管道，并将以上管道类加入其中

ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
}

3).确保爬虫文件Tencent/spiders/xx.py中的爬虫类的parse()方法，返回的item类的数据，才可以使用管道。

7.执行爬虫：

crapy crawl 爬虫名称

8.优化：

以上示例中，总页数是假设的；适用于没有下一页标签的网站。

而本项目中，实际上网站有下一页可点击，当到最后一页的时侯，下一页不可点击。

因此，可以根据下一而是否可以点击，来判断是否是最后一页。

查看页面，得到下一页的节点为：//a[@id='text']/@href

最后一页节点的href内容为"javascript:;"，其标签为：//a[@class='noative']

但是，第一页的前一页的标签也为：//a[@class='noative']

但是id不一样，第一页的前一页的id为'prev'

因此，综合这两个条件： //a[@class='noative' and @id='next']

            yield item
        if not response.xpath("//a[@class='noative' and @id='next']"):
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            # 构建下一页的url请求对象
            yield scrapy.Request(self.base_url + url, callback=self.parse)

注意：同一个标签内，xpah可以用and，or表示；不是同一个标签内，要使用 | &表示

posted on 2018-10-04 21:17 myworldworld 阅读(843) 评论(0) 收藏举报

刷新页面返回顶部

myworldworld

scrapy使用三：某网站自动翻页采集

导航

公告