scrapy CrawlSpiders-爬取url实例2

一. 新建项目(scrapy startproject)

scrapy startproject dongguanSpider

二、明确目标(dongguanSpider/items.py)

1 import scrapy
2 
3 class DongguanspiderItem(scrapy.Item):
4     # define the fields for your item here like:
5     url = scrapy.Field()
6     title = scrapy.Field()
7     content = scrapy.Field()
8     number = scrapy.Field()

三、制作爬虫（spiders/donguan.py或spiders/xixi.py）

scrapy genspider -t crawl donguan "wz.sun0769.com"

1、spiders/donguan.py使用CrawlSpider方法提取url

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from ..items import DongguanspiderItem
 6 
 7 class DongguanSpider(CrawlSpider):
 8     name = 'dongguan'
 9     allowed_domains = ['wz.sun0769.com']
10     start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=']
11 
12     rules = (
13         #follow是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback 为None，follow 默认设置为True，否则默认为False
14         Rule(LinkExtractor(allow=r'id=1&page=\d+'),follow=True),  #表示链接需要跟进，假如这里follow=False,那只会爬取第一页的数据，之后的页面没有根据就不会爬取
15         #Rule(LinkExtractor(allow=r'id=1&page=\d+'),process_links = "deal_links"),   # 如果url被web服务器篡改，需要调用process_links来处理提取出来的url
16         Rule(LinkExtractor(allow='index\?id=\d+'), callback='parse_item'),   #这个问号坑爹啊，正则表达式问号要转义！！！！！！！对正则还是不熟没能及时发现问题
17     )
18 
19 
20     # # 如果url被web服务器篡改，需要调用process_links来处理提取出来的url
21     # def deal_links(self,links):  # links 是当前response里提取出来的链接列表
22     #     for each in links:
23     #         each.url=each.url.replace("?","&").replace("Type&","Type?")
24     #         return links
25     #
26 
27     def parse_item(self, response):
28         print(response.url)
29         item = DongguanspiderItem()
30         item['title']=response.xpath("//div[@class='mr-three']/p/text()").extract()[0]
31         item['content']=response.xpath("//div[@class='details-box']/p/text()").extract()[0].replace("\n","").strip()   #strip()去掉空格
32         item['url']=response.url
33         item['number']=response.xpath("//div[@class='mr-three']/div/span[4]/text()").extract()[0].split("：")[-1]
34 
35         yield item

2、或者spiders/xixi.py使用Spider方法爬虫（有时候CrawlSpider提取的url可能被篡改，发送请求发不出去，不想深挖解决的话。可以使用Spider）

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from ..items import DongguanspiderItem
 4 
 5 class XixiSpider(scrapy.Spider):
 6     name = 'xixi'
 7     allowed_domains = ['wz.sun0769.com']
 8 
 9     url="http://wz.sun0769.com/political/index/politicsNewest?id=1&page="
10     offset=0
11     start_urls = [url+str(offset)]
12 
13     def parse(self, response):
14         # 每一页里的所有帖子的链接集合
15         links=response.xpath("//li[@class='clear']/span[3]/a/@href").extract()
16         # 迭代取出集合里的链接
17         for link in links:
18             # 提取列表里每个帖子的链接，发送请求放到请求队列里,并调用self.parse_item来处理
19             yield scrapy.Request(r'http://wz.sun0769.com/'+link,callback=self.parse_item)
20 
21         # 页面终止条件成立前，会一直自增offset的值，并发送新的页面请求，调用parse方法处理
22         if self.offset<=20:
23             self.offset+=1
24             # 发送请求放到请求队列里，调用self.parse处理response
25             yield scrapy.Request(self.url+str(self.offset),callback=self.parse)
26 
27     # 处理每个帖子的response内容
28     def parse_item(self,response):
29         print(response.url)
30         item=DongguanspiderItem()
31         item['title'] = response.xpath("//div[@class='mr-three']/p/text()").extract()[0]
32         item['content'] = response.xpath("//div[@class='details-box']/p/text()").extract()[0].replace("\n","").strip()  # strip()去掉空格
33         item['url'] = response.url
34         item['number'] = response.xpath("//div[@class='mr-three']/div/span[4]/text()").extract()[0].split("：")[-1]
35 
36         # 交给管道
37         yield item

四、存储内容（pipelines.py)

修改settings.py以下几个地方(可将日志打印在文件里)：

 1 # 保存日志信息的文件名
 2 LOG_FILE='dongguanlog.log'
 3 # 保存日志等级，低于|等于此等级的信息都被保存
 4 LOG_LEVEL='DEBUG'
 5 
 6 DEFAULT_REQUEST_HEADERS = {
 7   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
 8   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 9   'Accept-Language': 'en',
10 }
11 
12 ITEM_PIPELINES = {
13    'wdzurlSpider.pipelines.DongguanspiderPipeline': 300,
14 }

编写pipelines.py文件

1 import json
2 
3 class DongguanspiderPipeline(object):
4     def process_item(self, item, spider):
5         jsontext=json.dumps(dict(item),ensure_ascii=False)+',\n'
6         with open('dongguan.json','a') as f:
7             f.write(jsontext)
8         return item