进入文件夹
(blogproject_env) H:\>c:
(blogproject_env) C:\Users\admin>cd C:\Users\admin\Desktop\xiangmu
创建scrapy工程
(blogproject_env) C:\Users\admin\Desktop\xiangmu>scrapy startproject ArticleSpider
安装成功:
New Scrapy project 'ArticleSpider', using template directory 'h:\\blog\\blogproject_env\\lib\\site-packages\\scrapy\\templates\\project', created in: C:\Users\admin\Desktop\xiangmu\ArticleSpider You can start your first spider with: cd ArticleSpider scrapy genspider example example.com
创建要爬取的项目
目录结构:
ArticleSpider spiders 爬虫目录,如:创建文件,编写爬虫规则 __init__.py __init__.py items.py 设置数据存储模板,用于结构化数据,如:Django的Mode middlewares.py pipelines.py 数据处理行为,如:一般结构化的数据持久化 setting.py 配置文件,如:递归的层数、并发数,延迟下载等 scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
(blogproject_env) C:\Users\admin\Desktop\xiangmu>cd ArticleSpider (blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>scrapy genspider jobbole bolg.gobbole.com Created spider 'jobbole' using template 'basic' in module: ArticleSpider.spiders.jobbole (blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>

小提示
scrapy不支持调试,所以要自己创建个目录调试,详细代码
# _*_ coding:utf-8 _*_ __author__ = 'admin' __date__ = 2017 / 10 / 29 from scrapy.cmdline import execute #引入scrapy脚本调试的包文件 import sys,os #引入 sys.path.append(os.path.dirname( os.path.abspath(__file__))) #os.path.abspath(__file__) main文件目录 #print(sys.path.append(os.path.dirname( os.path.abspath(__file__))) ) #H:\blog\blogproject_env\Scripts\python.exe C:/Users/admin/Desktop/xiangmu/ArticleSpider/main.py execute(['scrapy','crawl','jobbole']) #配置上就可以打断点了

运行 scrapy
(blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>scrapy crawl jobbole
出现以下错误
ImportError: No module named 'win32api'
原因是缺少了win32缺少了这个文件,下载安装就可以了
pip install -i https:pypi.douban.com/simple pypiwin32
设置setting
# Obey robots.txt rules #表示robbots协议是否遵循 ROBOTSTXT_OBEY = False
再根目录创建main
# _*_ coding:utf-8 _*_ __author__ = 'admin' __date__ = 2017 / 10 / 8 from scrapy.cmdline import execute import sys,os sys.path.append(os.path.dirname(os.path.abspath(__file__) )) #os.path.abspath(__file__) #获取当前py文件所在路径 #获取当前父文件夹的目录 os.path.dirname() #print(os.path.dirname(os.path.abspath(__file__) )) #C:\Users\admin\Desktop\xiangmu\ArticleSpider execute(['scrapy','crawl','jobbole'])
再spiders下搭建
import scrapy class JobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/110287/']#要爬去的链接 def parse(self, response): re_selector = response.xpath('//*[@id="post-110287"]/div[1]/h1')
scrapy shell 使用方法 再pycharm里运行调试时比较低效的,所有scrapy提供了shell 可以再cmd里调试
scrapy shell http://blog.jobbole.com/110287/
打印出的标题内容
2017-10-08 16:53:04 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-10-08 16:53:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-10-08 16:53:04 [scrapy.core.engine] INFO: Spider opened 2017-10-08 16:53:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/110287/> (referer: None) [s] Available Scrapy objects: [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc) [s] crawler <scrapy.crawler.Crawler object at 0x01CE6830> [s] item {} [s] request <GET http://blog.jobbole.com/110287/> [s] response <200 http://blog.jobbole.com/110287/> [s] settings <scrapy.settings.Settings object at 0x04A86F70> [s] spider <JobboleSpider 'jobbole' at 0x4be8fd0> [s] Useful shortcuts: [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed) [s] fetch(req) Fetch a scrapy.Request and update local objects [s] shelp() Shell help (print this help) [s] view(response) View response in a browser
使用方法
>>> title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()') >>> title [<Selector xpath='//*[@id="post-110287"]/div[1]/h1/text()' data='2016 腾讯软件开发面试题(部分)'>] >>> title.extract() ['2016 腾讯软件开发面试题(部分)'] >>>
打印出时间
>>> ctrate_time = response.xpath("//p[@class='entry-meta-hide-on-mobile']") >>> ctrate_time [<Selector xpath="//p[@class='entry-meta-hide-on-mobile']" data='<p class="entry-meta-hide-on-mobile">\r\n\r'>] >>> ctrate_time.extract() ['<p class="entry-meta-hide-on-mobile">\r\n\r\n 2017/02/18 · <a href="http://blog.jobbole.com/category/career/" rel="category tag">职场</a>\r\n \r\n · <a href="#article-comment"> 7 评论 </a>\r\n \r\n\r\n \r\n · <a href="http://blog.jobbole.com/tag/%e9%9d%a2%e8%af%95/">面 试</a>\r\n \r\n</p>'] >>> ctrate_time = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip() >>> ctrate_time '2017/02/18 ·'
>>> ctrate_time.replace('·',' ') '2017/02/18 '
打印出以herf 开头锚点
comment_nums = response.xpath('//a[@href="#article-comment"]/span/text()').extract()[0]
对于对个class,想取单一属性可以用 contains:

response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]
列表生成式:不是以评论结尾的打印
>>> tag_list=['职场', ' 7 评论 ', '面试'] >>> [element for element in tag_list if not element.strip().endswith('评论')] ['职场', '面试'] >>> tag_list=['职场', ' 7 评论 ', '面试'] >>> [elt for elt in tag_list if not elt.strip().endswith('评论')] ['职场', '面试'] >>>
完整代码
#JOBBOLE.py
# -*- coding: utf-8 -*- import scrapy import re from scrapy.http import Request from urllib import parse from ArticleSpider.items import JobBloleArticleItem from ArticleSpider.utils.common import get_mad class GobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): '''获取文章列表页中的url,并交给scrapy下载并进行解析 2、获取下一页的url,并交给scrapy进行下载,下载完成狗交给parse''' post_nodes= response.css('#archive .floated-thumb .post-thumb a') for post_node in post_nodes: image_url = post_node.css('img::attr(src)').extract_first('') post_url = post_node.css('::attr(href)').extract_first('') yield Request(url=parse.urljoin(response.url,post_url),meta={'front_image_url':image_url},callback=self.parse_detail) next_url = response.css('.next.page-numbers::attr(href)').extract_first('') if next_url: yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse) def parse_detail(self,response): #提取文章的具体字段 # title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()').extract_first()#标题 # create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace('·','').strip()#创建时间 # praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] #点赞 # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] #收藏 # match_re = re.match(".*?(\d+).*",fav_nums) # if match_re: # fav_nums = int(match_re.group(1)) # else: # fav_nums = 0 # comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] # 收藏 # match_re = re.match(".*?(\d+).*", comment_nums) # if match_re: # comment_nums =int(match_re.group(1)) # else: # comment_nums = 0 # content = response.xpath("//div[@class='entry']").extract()[0] # tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() # tag_list = [element for element in tag_list if not element.strip().endswith('评论')] # tag = ','.join(tag_list) #通过css选择器来提取字段 article_item = JobBloleArticleItem() front_image_url=response.meta.get('front_image_url','') #文章封面图 title = response.css('.entry-header h1::text').extract()[0] create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip() praise_nums = response.css('.vote-post-up h10::text').extract()[0] fav_nums = response.css('span.bookmark-btn::text').extract()[0] #收藏 match_re = re.match(".*?(\d+).*",fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0] match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 content = response.css("div.entry").extract()[0] tag_list=response.css("p.entry-meta-hide-on-mobile a::text").extract() tag_list = [element for element in tag_list if not element.strip().endswith('评论')] tags = ','.join(tag_list) article_item['url_object_url'] =get_mad(response.url) article_item['title']=title article_item['create_date']=create_date article_item['url']=response.url article_item['front_image_url']=[front_image_url] article_item['praise_nums']=praise_nums article_item['comment_nums']=comment_nums article_item['tags']=tags article_item['fav_nums']=fav_nums article_item['content']=content yield article_item #调用yield之后 会传到pipelines中去 pass
Items设置
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class ArticlespiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass class JobBloleArticleItem(scrapy.Item): title =scrapy.Field() create_date =scrapy.Field() url= scrapy.Field() url_object_id =scrapy.Field() front_image_url =scrapy.Field() front_image_path = scrapy.Field() praise_nums = scrapy.Field() comment_nums= scrapy.Field() tags= scrapy.Field() fav_nums = scrapy.Field() content= scrapy.Field()
setting 设置
import os,sys ITEM_PIPELINES = { 'ArticleSpider.pipelines.ArticlespiderPipeline': 300, # 'scrapy.pipelines.images.ImagesPipeline':1, 'ArticleSpider.pipelines.ArticleImagepiple':1, } #pipeline 会流经这里,后面数字越小 处理的时间越早 IMAGES_URLS_FIELD = 'front_image_url'#配置这个 就去去item找这个字段 project_dir = os.path.abspath(os.path.dirname(__file__)) IMAGES_STORE=os.path.join(project_dir,'images') #articlespider下面的 IMAGES_MIX_HEIGHT=100 #最小高度和最小宽度 IMAGES_MIX_WIDTH=100
pipelines 图片下载路径配置
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.images import ImagesPipeline class ArticlespiderPipeline(object): def process_item(self, item, spider): return item class ArticleImagepiple(ImagesPipeline): def item_completed(self, results, item, info): #这里击沉了image里面的方法,results,里面有两个value,其中有一个path路径,下面可以继承过来做修改 for ok,value in results: image_file_path =value['path'] item['front_image_path']=image_file_path return item
url MD5设置

pass

浙公网安备 33010602011771号