4.2:Scrapy爬虫
使用Scrapy框架对网站的内容进行爬取
在桌面处打开终端,并在终端中输入:
scrapy startproject bitNews
cd bitNews/bitNews
修改items文件的内容,输入vim items.py按 i 进行编辑,将其中的代码修改为:
# -*- coding: utf-8 -*- import scrapy class BitnewsItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() pass
按shift+zz 退出。在终端输入:
scrapy genspider bitnews "www.bit.edu.cn" cd spiders vim bitnews.py
修改代码为下图所示:
# -*- coding: utf-8 -*- import scrapy from bitNews.items import BitnewsItem class BitnewsSpider(scrapy.Spider): name = 'bitnews' allowed_domains = ['www.bit.edu.cn'] start_urls = ['http://www.bit.edu.cn/xww/jdgz/index.htm'] def parse(self, response): items=[] div = response.xpath("//div[@class='new_con']") for each in div.xpath("ul/li"): item=BitnewsItem() item['name']=each.xpath('a/text()').extract() items.append(item) pass return items
保存退出之后,在终端输入:cd ..
修改settings.py:vim settings.py
找到ROBOTSTXT_OBEY的值改为False:并添加设置如下:
ROBOTSTXT_OBEY=False
FEED_EXPORT_ENCODING = "UTF-8"
保存退出后,终端输入:
scrapy crawl bitnews -o news.json

本文来自博客园,作者:哥们要飞,转载请注明原文链接:https://www.cnblogs.com/liujinhui/p/16382553.html

浙公网安备 33010602011771号