scrapy CrawlSpiders-爬取url实例
一. 新建项目(scrapy startproject)
scrapy startproject wdzurlSpider
二、明确目标(wdzurlSpider/items.py)
1 import scrapy 2 3 class WdzurlspiderItem(scrapy.Item): 4 # define the fields for your item here like: 5 name = scrapy.Field() 6 title=scrapy.Field() 7 zan=scrapy.Field()
三、制作爬虫 (spiders/wdzurl.py)
1、scrapy genspider -t crawl wdzurl "waduanzi.com"
scrapy shell调试:


***********************************************************************************************************************************
LinkExtractor调试:
scrapy shell "http://www.waduanzi.com/"

from scrapy.linkextractors import LinkExtractor
a=LinkExtractor(allow=('archives/\d+'))
a.extract_links(response)

b=LinkExtractor(allow=('archives/\d+$'))
b.extract_links(response)

2、打开wdzurlSpider/spider目录里的 wdzurl.py,代码如下:
1 # -*- coding: utf-8 -*- 2 3 # 导入CrawlSpider类和Rule 4 from scrapy.spiders import CrawlSpider,Rule 5 # 导入链接规则匹配类,用来提取符合规则的连接 6 from scrapy.linkextractors import LinkExtractor 7 from ..items import WdzurlspiderItem 8 9 class WdzurlSpider(CrawlSpider): 10 name = 'wdzurl' 11 allowed_domains = ['waduanzi.com'] 12 start_urls = ['http://waduanzi.com/'] 13 14 # Response里链接的提取规则,返回的符合匹配规则的链接匹配对象的列表 15 pagelink=LinkExtractor(allow=('archives/\d+$')) #这里只爬取了第一页里面的所有符合该正则的链接 16 17 rules=[ 18 # 获取这个列表里的链接,依次发送请求。当follow=True时,继续跟进,调用指定回调函数处理 19 Rule(pagelink,callback="parseWdz",follow=False) #该页面不需要再取链接,所以这里不用继续跟进 20 ] 21 22 # 指定的回调函数 23 def parseWdz(self,response): 24 item=WdzurlspiderItem() 25 item['name']=response.xpath("//div[@class='post-author']/a/text()").extract()[0] 26 27 #发现title有空值的情况,所以我才做了如下判断 28 title=response.xpath("//div[@class='item-title']/h1/text()").extract() 29 if len(title)==0: 30 item['title'] ='' 31 else: 32 item['title']=title[0] 33 item['zan']=response.xpath("//div[@class='item-toolbar']/ul/li[1]/a/text()").extract()[0] 34 35 yield item 36 37 # def parse(self, response): //不能用这个函数名 38 # pass
四、存储内容 (pipelines.py)
修改settings.py以下几个地方(可将日志打印在文件里):
1 # 保存日志信息的文件名 2 LOG_FILE='wdzurllog.log' 3 # 保存日志等级,低于|等于此等级的信息都被保存 4 LOG_LEVEL='DEBUG' 5 6 DEFAULT_REQUEST_HEADERS = { 7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36", 8 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 9 'Accept-Language': 'en', 10 } 11 12 ITEM_PIPELINES = { 13 'wdzurlSpider.pipelines.WdzurlspiderPipeline': 300, 14 }
编写pipelines.py文件
1 import json 2 3 class WdzurlspiderPipeline(object): 4 def process_item(self, item, spider): 5 jsontext=json.dumps(dict(item),ensure_ascii=False)+',\n' 6 with open('wdz.json','a') as f: 7 f.write(jsontext) 8 return item
命令执行:



posted on 2020-03-12 23:46 cherry_ning 阅读(304) 评论(0) 收藏 举报
浙公网安备 33010602011771号