scrapy入门案例-自己总结

制作 Scrapy 爬虫一共需要4步：

新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标（编写items.py）：明确你想要抓取的目标
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容（pipelines.py）：设计管道存储爬取内容

一. 新建项目(scrapy startproject)

scrapy startproject waduanziSpider

二、明确目标(tencentSpider/items.py)

我们打算抓取：http://www.waduanzi.com/ 网站里的段子信息

打开waduanziSpider目录下的items.py
Item 定义结构化数据字段，用来保存爬取到的数据，有点像Python中的dict，但是提供了一些额外的保护减少错误。
可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field的类属性来定义一个Item（可以理解成类似于ORM的映射关系）。
接下来，创建一个WaduanzispiderItem类，和构建item模型（model）。

1 import scrapy
2 
3 class WaduanzispiderItem(scrapy.Item):
4     # define the fields for your item here like:
5     title = scrapy.Field()
6     content=scrapy.Field()
7     zan=scrapy.Field()

三、制作爬虫（spiders/waduanzi.py）

在当前目录下输入命令，将在waduanziSpider/spider目录下创建一个名为waduanzi的爬虫，并指定爬取域的范围

　　scrapy genspider waduanzi "waduanzi.com"

scrapy shell调试，可检查xpath是否正确获取到值

进入项目的根目录，执行下列命令来启动shell:

scrapy shell "http://waduanzi.com/"

打开 tencentSpider/spider目录里的 tencent.py，代码如下：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from ..items import WaduanzispiderItem
 4 
 5 class WaduanziSpider(scrapy.Spider):
 6     name = 'waduanzi'
 7     allowed_domains = ['waduanzi.com']
 8 
 9     url='http://www.waduanzi.com/page/'
10     offset=1
11 
12     start_urls = [url+str(offset)]
13 
14     def parse(self, response):
15         duanzi_list=response.xpath("//div[@class='panel panel20 post-item post-box']")
16         for each in duanzi_list:
17             item=WaduanzispiderItem()
18             item['title']=each.xpath("./div[2]/h2/a/text()").extract()[0]
19             # normalize-space（）遇到有&nbsp;和<br>后，文本内容正常取。注：scrapy这里的xpatn ormalize-space（）用法不需要加/text()，不知道为什么
20             content=each.xpath("normalize-space(//div[@class='panel panel20 post-item post-box']/div[2]/div)").extract()[0]
21             item['content']=content.replace("\xa0","")
22             item['zan'] = each.xpath("./div[3]/ul/li[1]/a/text()").extract()[0]
23 
24             yield item
25 
26         if self.offset<6:
27             self.offset+=1
28 
29         # 每次处理完一页的数据之后，重新发送下一页页面请求
30         # self.offset自增1，同时拼接为新的url，并调用回调函数self.parse处理Response
31         yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

四、存储内容（pipelines.py)

为了启用Item Pipeline组件，必须将它的类添加到 settings.py文件ITEM_PIPELINES 配置：

1 # Configure item pipelines
2 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
3 ITEM_PIPELINES = {
4     #设置好在管道文件里写的类
5    'waduanziSpider.pipelines.WaduanzispiderPipeline': 300,
6 }

settings.py文件设置报文头

1 # Override the default request headers:
2 DEFAULT_REQUEST_HEADERS = {
3   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
4   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
5   'Accept-Language': 'en',
6 }

编写pipelines.py文件

 1 import json
 2 
 3 class WaduanzispiderPipeline(object):
 4     # __init__方法是可选的，做为类的初始化方法
 5     def __init__(self):
 6         self.filename=open('waduanzi.json','w')
 7 
 8     # process_item方法是必须写的，用来处理item数据
 9     def process_item(self, item, spider):
10         jsontext=json.dumps(dict(item),ensure_ascii=False)+',\n'
11         self.filename.write(jsontext)
12         # 这个方法必须返回一个 Item 对象，被丢弃的item将不会被之后的pipeline组件所处理
13         return item
14 
15     # close_spider方法是可选的，结束时调用这个方法
16     def close_spider(self,spider):
17         self.filename.close()

命令执行：

posted on 2020-03-10 00:46 cherry_ning 阅读(159) 评论(0) 收藏举报

刷新页面返回顶部

scrapy入门案例-自己总结

制作 Scrapy 爬虫一共需要4步：

一. 新建项目(scrapy startproject)

二、明确目标(tencentSpider/items.py)

三、制作爬虫（spiders/waduanzi.py）

四、存储内容（pipelines.py)

导航

公告

scrapy入门案例-自己总结

制作 Scrapy 爬虫 一共需要4步：

一. 新建项目(scrapy startproject)

二、明确目标(tencentSpider/items.py)

三、制作爬虫 （spiders/waduanzi.py）

四、存储内容 （pipelines.py)

导航

公告

制作 Scrapy 爬虫一共需要4步：

三、制作爬虫（spiders/waduanzi.py）

四、存储内容（pipelines.py)