喵吉欧尼酱

  博客园 :: 首页 :: 新随笔 :: 联系 :: 订阅 :: 管理 ::

进入文件夹

(blogproject_env) H:\>c:

(blogproject_env) C:\Users\admin>cd C:\Users\admin\Desktop\xiangmu

创建scrapy工程

(blogproject_env) C:\Users\admin\Desktop\xiangmu>scrapy startproject ArticleSpider

安装成功:

  

New Scrapy project 'ArticleSpider', using template directory 'h:\\blog\\blogproject_env\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\admin\Desktop\xiangmu\ArticleSpider

You can start your first spider with:
cd ArticleSpider
scrapy genspider example example.com

创建要爬取的项目

目录结构:

ArticleSpider  
    spiders    爬虫目录,如:创建文件,编写爬虫规则
        __init__.py
    __init__.py
    items.py    设置数据存储模板,用于结构化数据,如:Django的Mode
    middlewares.py
    pipelines.py    数据处理行为,如:一般结构化的数据持久化
    setting.py   配置文件,如:递归的层数、并发数,延迟下载等
    scrapy.cfg      项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
 


(blogproject_env) C:\Users\admin\Desktop\xiangmu>cd ArticleSpider (blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>scrapy genspider jobbole bolg.gobbole.com Created spider 'jobbole' using template 'basic' in module: ArticleSpider.spiders.jobbole (blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>


 小提示

    scrapy不支持调试,所以要自己创建个目录调试,详细代码

# _*_ coding:utf-8 _*_
__author__ = 'admin'
__date__ = 2017 / 10 / 29

from scrapy.cmdline import execute  #引入scrapy脚本调试的包文件
import sys,os   #引入
sys.path.append(os.path.dirname( os.path.abspath(__file__)))    #os.path.abspath(__file__) main文件目录
#print(sys.path.append(os.path.dirname( os.path.abspath(__file__))) )  #H:\blog\blogproject_env\Scripts\python.exe C:/Users/admin/Desktop/xiangmu/ArticleSpider/main.py
execute(['scrapy','crawl','jobbole'])  #配置上就可以打断点了

 

 

运行  scrapy

(blogproject_env) C:\Users\admin\Desktop\xiangmu\ArticleSpider>scrapy crawl jobbole

出现以下错误

ImportError: No module named 'win32api'

原因是缺少了win32缺少了这个文件,下载安装就可以了

pip install -i https:pypi.douban.com/simple pypiwin32

设置setting

# Obey robots.txt rules   #表示robbots协议是否遵循
ROBOTSTXT_OBEY = False  

 再根目录创建main

# _*_ coding:utf-8 _*_
__author__ = 'admin'
__date__ = 2017 / 10 / 8
from scrapy.cmdline import  execute
import sys,os
sys.path.append(os.path.dirname(os.path.abspath(__file__) ))

#os.path.abspath(__file__)  #获取当前py文件所在路径
#获取当前父文件夹的目录 os.path.dirname()
#print(os.path.dirname(os.path.abspath(__file__) ))  #C:\Users\admin\Desktop\xiangmu\ArticleSpider
execute(['scrapy','crawl','jobbole'])

再spiders下搭建

import scrapy


class JobboleSpider(scrapy.Spider):
    name = 'jobbole'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/110287/']#要爬去的链接

    def parse(self, response):
        re_selector = response.xpath('//*[@id="post-110287"]/div[1]/h1')

scrapy  shell 使用方法   再pycharm里运行调试时比较低效的,所有scrapy提供了shell 可以再cmd里调试

scrapy shell  http://blog.jobbole.com/110287/

 

打印出的标题内容

2017-10-08 16:53:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-08 16:53:04 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-08 16:53:04 [scrapy.core.engine] INFO: Spider opened
2017-10-08 16:53:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/110287/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x01CE6830>
[s]   item       {}
[s]   request    <GET http://blog.jobbole.com/110287/>
[s]   response   <200 http://blog.jobbole.com/110287/>
[s]   settings   <scrapy.settings.Settings object at 0x04A86F70>
[s]   spider     <JobboleSpider 'jobbole' at 0x4be8fd0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

 

使用方法

>>> title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()')
>>> title
[<Selector xpath='//*[@id="post-110287"]/div[1]/h1/text()' data='2016 腾讯软件开发面试题(部分)'>]
>>> title.extract()
['2016 腾讯软件开发面试题(部分)']
>>>

打印出时间

>>> ctrate_time = response.xpath("//p[@class='entry-meta-hide-on-mobile']")
>>> ctrate_time
[<Selector xpath="//p[@class='entry-meta-hide-on-mobile']" data='<p class="entry-meta-hide-on-mobile">\r\n\r'>]
>>> ctrate_time.extract()
['<p class="entry-meta-hide-on-mobile">\r\n\r\n            2017/02/18 ·  <a href="http://blog.jobbole.com/category/career/" rel="category tag">职场</a>\r\n            \r\n                            · <a href="#article-comment"> 7 评论 </a>\r\n            \r\n\r\n            \r\n             ·  <a href="http://blog.jobbole.com/tag/%e9%9d%a2%e8%af%95/">面 试</a>\r\n            \r\n</p>']

>>> ctrate_time = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip()
>>> ctrate_time
'2017/02/18 ·'


>>> ctrate_time.replace('·',' ') '2017/02/18 '

 打印出以herf 开头锚点

comment_nums = response.xpath('//a[@href="#article-comment"]/span/text()').extract()[0]

 

对于对个class,想取单一属性可以用 contains:

 

response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0]

 

 

 

列表生成式:不是以评论结尾的打印

>>> tag_list=['职场', ' 7 评论 ', '面试']
>>> [element for element in tag_list if not element.strip().endswith('评论')]
['职场', '面试']
>>> tag_list=['职场', ' 7 评论 ', '面试']
>>> [elt for elt in tag_list if not elt.strip().endswith('评论')]
['职场', '面试']
>>>

完整代码

#JOBBOLE.py
#
-*- coding: utf-8 -*- import scrapy import re from scrapy.http import Request from urllib import parse from ArticleSpider.items import JobBloleArticleItem from ArticleSpider.utils.common import get_mad class GobboleSpider(scrapy.Spider): name = 'jobbole' allowed_domains = ['blog.jobbole.com'] start_urls = ['http://blog.jobbole.com/all-posts/'] def parse(self, response): '''获取文章列表页中的url,并交给scrapy下载并进行解析 2、获取下一页的url,并交给scrapy进行下载,下载完成狗交给parse''' post_nodes= response.css('#archive .floated-thumb .post-thumb a') for post_node in post_nodes: image_url = post_node.css('img::attr(src)').extract_first('') post_url = post_node.css('::attr(href)').extract_first('') yield Request(url=parse.urljoin(response.url,post_url),meta={'front_image_url':image_url},callback=self.parse_detail) next_url = response.css('.next.page-numbers::attr(href)').extract_first('') if next_url: yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse) def parse_detail(self,response): #提取文章的具体字段 # title = response.xpath('//*[@id="post-110287"]/div[1]/h1/text()').extract_first()#标题 # create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace('·','').strip()#创建时间 # praise_nums = response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0] #点赞 # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract()[0] #收藏 # match_re = re.match(".*?(\d+).*",fav_nums) # if match_re: # fav_nums = int(match_re.group(1)) # else: # fav_nums = 0 # comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0] # 收藏 # match_re = re.match(".*?(\d+).*", comment_nums) # if match_re: # comment_nums =int(match_re.group(1)) # else: # comment_nums = 0 # content = response.xpath("//div[@class='entry']").extract()[0] # tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract() # tag_list = [element for element in tag_list if not element.strip().endswith('评论')] # tag = ','.join(tag_list) #通过css选择器来提取字段 article_item = JobBloleArticleItem() front_image_url=response.meta.get('front_image_url','') #文章封面图 title = response.css('.entry-header h1::text').extract()[0] create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip() praise_nums = response.css('.vote-post-up h10::text').extract()[0] fav_nums = response.css('span.bookmark-btn::text').extract()[0] #收藏 match_re = re.match(".*?(\d+).*",fav_nums) if match_re: fav_nums = int(match_re.group(1)) else: fav_nums = 0 comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0] match_re = re.match(".*?(\d+).*", comment_nums) if match_re: comment_nums = int(match_re.group(1)) else: comment_nums = 0 content = response.css("div.entry").extract()[0] tag_list=response.css("p.entry-meta-hide-on-mobile a::text").extract() tag_list = [element for element in tag_list if not element.strip().endswith('评论')] tags = ','.join(tag_list) article_item['url_object_url'] =get_mad(response.url) article_item['title']=title article_item['create_date']=create_date article_item['url']=response.url article_item['front_image_url']=[front_image_url] article_item['praise_nums']=praise_nums article_item['comment_nums']=comment_nums article_item['tags']=tags article_item['fav_nums']=fav_nums article_item['content']=content yield article_item #调用yield之后 会传到pipelines中去 pass

Items设置

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ArticlespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass



class JobBloleArticleItem(scrapy.Item):

    title =scrapy.Field()
    create_date =scrapy.Field()
    url= scrapy.Field()
    url_object_id =scrapy.Field()

    front_image_url =scrapy.Field()
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field()
    comment_nums= scrapy.Field()
    tags= scrapy.Field()
    fav_nums = scrapy.Field()
    content= scrapy.Field()

 

setting 设置

import os,sys
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
    # 'scrapy.pipelines.images.ImagesPipeline':1,
    'ArticleSpider.pipelines.ArticleImagepiple':1,
}   #pipeline 会流经这里,后面数字越小 处理的时间越早
IMAGES_URLS_FIELD = 'front_image_url'#配置这个  就去去item找这个字段
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE=os.path.join(project_dir,'images')  #articlespider下面的
IMAGES_MIX_HEIGHT=100 #最小高度和最小宽度
IMAGES_MIX_WIDTH=100

pipelines 图片下载路径配置

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import ImagesPipeline
class ArticlespiderPipeline(object):
    def process_item(self, item, spider):
        return item

class ArticleImagepiple(ImagesPipeline):
    def item_completed(self, results, item, info):   #这里击沉了image里面的方法,results,里面有两个value,其中有一个path路径,下面可以继承过来做修改
        for ok,value in results:
            image_file_path =value['path']
        item['front_image_path']=image_file_path
        return item

url  MD5设置


pass

 

posted on 2017-10-08 15:46  喵吉欧尼酱  阅读(623)  评论(0)    收藏  举报