scrapy CrawlSpiders-爬取url实例2
一. 新建项目(scrapy startproject)
scrapy startproject dongguanSpider
二、明确目标(dongguanSpider/items.py)
1 import scrapy 2 3 class DongguanspiderItem(scrapy.Item): 4 # define the fields for your item here like: 5 url = scrapy.Field() 6 title = scrapy.Field() 7 content = scrapy.Field() 8 number = scrapy.Field()
三、制作爬虫 (spiders/donguan.py或spiders/xixi.py)
scrapy genspider -t crawl donguan "wz.sun0769.com"
1、spiders/donguan.py使用CrawlSpider方法提取url
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.linkextractors import LinkExtractor 4 from scrapy.spiders import CrawlSpider, Rule 5 from ..items import DongguanspiderItem 6 7 class DongguanSpider(CrawlSpider): 8 name = 'dongguan' 9 allowed_domains = ['wz.sun0769.com'] 10 start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page='] 11 12 rules = ( 13 #follow是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果callback 为None,follow 默认设置为True,否则默认为False 14 Rule(LinkExtractor(allow=r'id=1&page=\d+'),follow=True), #表示链接需要跟进,假如这里follow=False,那只会爬取第一页的数据,之后的页面没有根据就不会爬取 15 #Rule(LinkExtractor(allow=r'id=1&page=\d+'),process_links = "deal_links"), # 如果url被web服务器篡改,需要调用process_links来处理提取出来的url 16 Rule(LinkExtractor(allow='index\?id=\d+'), callback='parse_item'), #这个问号坑爹啊,正则表达式问号要转义!!!!!!!对正则还是不熟没能及时发现问题 17 ) 18 19 20 # # 如果url被web服务器篡改,需要调用process_links来处理提取出来的url 21 # def deal_links(self,links): # links 是当前response里提取出来的链接列表 22 # for each in links: 23 # each.url=each.url.replace("?","&").replace("Type&","Type?") 24 # return links 25 # 26 27 def parse_item(self, response): 28 print(response.url) 29 item = DongguanspiderItem() 30 item['title']=response.xpath("//div[@class='mr-three']/p/text()").extract()[0] 31 item['content']=response.xpath("//div[@class='details-box']/p/text()").extract()[0].replace("\n","").strip() #strip()去掉空格 32 item['url']=response.url 33 item['number']=response.xpath("//div[@class='mr-three']/div/span[4]/text()").extract()[0].split(":")[-1] 34 35 yield item
2、或者spiders/xixi.py使用Spider方法爬虫(有时候CrawlSpider提取的url可能被篡改,发送请求发不出去,不想深挖解决的话。可以使用Spider)
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import DongguanspiderItem 4 5 class XixiSpider(scrapy.Spider): 6 name = 'xixi' 7 allowed_domains = ['wz.sun0769.com'] 8 9 url="http://wz.sun0769.com/political/index/politicsNewest?id=1&page=" 10 offset=0 11 start_urls = [url+str(offset)] 12 13 def parse(self, response): 14 # 每一页里的所有帖子的链接集合 15 links=response.xpath("//li[@class='clear']/span[3]/a/@href").extract() 16 # 迭代取出集合里的链接 17 for link in links: 18 # 提取列表里每个帖子的链接,发送请求放到请求队列里,并调用self.parse_item来处理 19 yield scrapy.Request(r'http://wz.sun0769.com/'+link,callback=self.parse_item) 20 21 # 页面终止条件成立前,会一直自增offset的值,并发送新的页面请求,调用parse方法处理 22 if self.offset<=20: 23 self.offset+=1 24 # 发送请求放到请求队列里,调用self.parse处理response 25 yield scrapy.Request(self.url+str(self.offset),callback=self.parse) 26 27 # 处理每个帖子的response内容 28 def parse_item(self,response): 29 print(response.url) 30 item=DongguanspiderItem() 31 item['title'] = response.xpath("//div[@class='mr-three']/p/text()").extract()[0] 32 item['content'] = response.xpath("//div[@class='details-box']/p/text()").extract()[0].replace("\n","").strip() # strip()去掉空格 33 item['url'] = response.url 34 item['number'] = response.xpath("//div[@class='mr-three']/div/span[4]/text()").extract()[0].split(":")[-1] 35 36 # 交给管道 37 yield item
四、存储内容 (pipelines.py)
修改settings.py以下几个地方(可将日志打印在文件里):
1 # 保存日志信息的文件名
2 LOG_FILE='dongguanlog.log'
3 # 保存日志等级,低于|等于此等级的信息都被保存
4 LOG_LEVEL='DEBUG'
5
6 DEFAULT_REQUEST_HEADERS = {
7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
8 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
9 'Accept-Language': 'en',
10 }
11
12 ITEM_PIPELINES = {
13 'wdzurlSpider.pipelines.DongguanspiderPipeline': 300,
14 }
编写pipelines.py文件
1 import json 2 3 class DongguanspiderPipeline(object): 4 def process_item(self, item, spider): 5 jsontext=json.dumps(dict(item),ensure_ascii=False)+',\n' 6 with open('dongguan.json','a') as f: 7 f.write(jsontext) 8 return item
命令执行:

或:



posted on 2020-03-14 14:35 cherry_ning 阅读(190) 评论(0) 收藏 举报

浙公网安备 33010602011771号