5.修改settings.py文件
1)开启 DEFAULT_REQUEST_HEADERS
修改如下
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}

2)将 ROBOTSTXT_OBEY = True 改为 ROBOTSTXT_OBEY = False
说明：
默认为True，就是要遵守robots.txt 的规则
将此配置项设置为 False ，拒绝遵守 Robot协议

3)开启 ITEM_PIPELINES
ITEM_PIPELINES = {
'baidubaike.pipelines.BaidubaikePipeline': 300,
}
其中，ITEM_PIPELINES是一个字典文件，键为要打开的ItemPipeline类，值为优先级，ItemPipeline是按照优先级来调用的，值越小，优先级越高。

6.修改pipelines.py文件
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#第一种方式
#import json
#
#class BaidubaikePipeline(object):
#    def __init__(self):
#       #pass
#       self.fp = open('baike.json', 'w', encoding='utf-8')
#
#    def open_spider(self, spider):
#        print('爬虫开始了。。')
#
#    def process_item(self, item, spider):
#       item_json = json.dumps(dict(item), ensure_ascii=False)
#        self.fp.write(item_json+ '\n')
#        return item
#
#    def close_spider(self, spider):
#        self.fp.close()
#        print('爬虫结束了。。')
#

#第二种方式
#from scrapy.exporters import JsonItemExporter
#
#class BaidubaikePipeline(object):
#    def __init__(self):
#       #pass
#       self.fp = open('baike.json', 'wb')
#    self.exporter = JsonItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')
#    self.exporter.start_exporting()
#
#    def open_spider(self, spider):
#        print('爬虫开始了。。')
#
#    def process_item(self, item, spider):
#        self.exporter.export_item(item)
#        return item
#
#    def close_spider(self, spider):
#        self.exporter.finish_exporting()
#     self.fp.close()
#        print('爬虫结束了。。')

#第三种方式
from scrapy.exporters import JsonLinesItemExporter

class BaidubaikePipeline(object):
def __init__(self):
      #pass
      self.fp = open('baike.json', 'wb')
      self.exporter = JsonLinesItemExporter(self.fp, ensure_ascii=False, encoding='utf-8')

def open_spider(self, spider):
print('爬虫开始了。。')

  def process_item(self, item, spider):
      self.exporter.export_item(item)
      return item

  def close_spider(self, spider):
      self.fp.close()
      print('爬虫结束了。。')

7.运行爬虫
scrapy crawl 爬虫名

d:\pythonCode\spiderProject\baidubaike\baidubaike>scrapy crawl baike

posted on 2019-09-27 18:00 WebLinuxStudy 阅读(1572) 评论(0) 收藏举报

刷新页面返回顶部

导航