爬取报刊名称及地址
目标:爬取全国报刊名称及地址
链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm
目的:练习scrapy爬取数据
学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。
目标截图:

1、创建爬虫工程
|
1
2
|
$ cd ~/code/crawler/scrapyProject$ scrapy startproject newSpapers |
2、创建爬虫程序
|
1
2
|
$ cd newSpapers/$ scrapy genspider nationalNewspaper news.xinhuanet.com |
3、配置数据爬取项
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
$ cat items.py# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NewspapersItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() addr = scrapy.Field() |
4、 配置爬虫程序
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
$ cat spiders/nationalNewspaper.py# -*- coding: utf-8 -*-import scrapyfrom newSpapers.items import NewspapersItemclass NationalnewspaperSpider(scrapy.Spider): name = "nationalNewspaper" allowed_domains = ["news.xinhuanet.com"] start_urls = ['http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm'] def parse(self, response): sub_country = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[2]') sub2_local = response.xpath('//*[@id="Zoom"]/div/table/tbody/tr[4]') tags_a_country = sub_country.xpath('./td/table/tbody/tr/td/p/a') items = [] for each in tags_a_country: item = NewspapersItem() item['name'] = each.xpath('./strong/text()').extract() item['addr'] = each.xpath('./@href').extract() items.append(item) return items |
5、配置谁去处理爬取结果
|
1
2
3
4
|
$ cat settings.py……#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'ITEM_PIPELINES = {'newSpapers.pipelines.NewspapersPipeline':100} |
6、配置数据处理程序
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
$ cat pipelines.py# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport timeclass NewspapersPipeline(object): def process_item(self, item, spider): now = time.strftime('%Y-%m-%d',time.localtime()) filename = 'newspaper.txt' print '=================' print item print '================' with open(filename,'a') as fp: fp.write(item['name'][0].encode("utf8")+ '\t' +item['addr'][0].encode("utf8") + '\n') return item |
7、查看结果
|
1
2
3
4
5
6
7
|
$ cat spiders/newspaper.txt人民日报 http://paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm海外版 http://paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm光明日报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm经济日报 http://www.economicdaily.com.cn/no1/解放军报 http://www.gmw.cn/01gmrb/2007-09/20/default.htm中国日报 http://pub1.chinadaily.com.cn/cdpdf/cndy/ |
浙公网安备 33010602011771号