scrapy 抓取果壳首页推荐文章

　　一直对爬虫耿耿于怀，今天总算是实现了，编写了一个Python Scrapy的爬虫获取果壳网首页的推荐文章。

　　打开果壳首页的一篇推荐文章，URL如下http://www.guokr.com/article/439791/可以看到果壳的文章都是在/article/下，并且所有文章是以6位数字表示，url问题解决了

　　下来是获取页面的标题，（图像使用的imgur的服务，国内可能需要多加载一段时间，请耐心等待）

我用的chrome浏览器，F12进入开发模式查看到源码

<h1 itemprop="http://purl.org/dc/terms/title"id="articleTitle">网络头像揭示了你的多少个性</h1>

因此使用Xpath可以轻易获得该标题

开始动手：

新建一个spider：scrapy startproject guokr

修改item.py:添加需要的item项

class guokrItem(scrapy.Item):
    name = scrapy.Field()
    url = scrapy.Field()

创建spider：

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from guokr.items import guokrItem

class guokrSpider(CrawlSpider):
    name = "guokr"
    allowed_domains = ["guokr.com"]
    start_urls = [
    "http://www.guokr.com/"
    ]
    rules = (
        Rule(LinkExtractor(allow=r"article/\d+/"),callback='parse_guokr'),#设置rule使用正则从start_urls中匹配符合条件的url并返回给parse_guokr
    )
    def parse_guokr(self,response):
        item = guokrItem()
        item['url'] = response.url
        item['name'] = response.xpath("//h1[@id='articleTitle']/text()").extract()#获取h1标签下id属性为articleTitle的值，也就是标题
        return item

设置pipelines：

import json
import codecs
class GuokrPipeline(object):
    def __init__(self):
        self.file = codecs.open('guokr.json','wb',encoding='utf-8')#构造函数中定义编码
def process_item(self,item,spider):                 #输出json文件
    line = json.dumps(dict(item)) + '\n'
    self.file.write(line.decode('unicode_escape'))
    return item

在setting中添加pipeline定义：

ITEM_PIPELINES = {'guokr.pipelines.GuokrPipeline'}

运行：scrapy crawl guokr
目录下得到guokr.json文件

如果你觉得不过瘾：想看更多文章，可以酱紫：在spider里面Rules下添加一句Follow=True意思就是在获取到的页面接着获取（想想都可怕，多少万的文章，我的天）

我小小改了一下正则表达式，像这样

    rules = (
            Rule(LinkExtractor(allow=r"article/439\d+/"),callback='parse_guokr',follow=True),
        )

然后爬虫跑了10多秒，然后就酱紫了

第一次玩爬虫，感觉自己飞起来了。

posted @ 2015-02-15 13:45 zynick 阅读(329) 评论(0) 收藏举报

刷新页面返回顶部

zynick

scrapy 抓取果壳首页推荐文章

公告