2023数据采集与融合技术实践作业三
第三次作业
一、作业内容
- 作业①:
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。
- 输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
- Gitee文件夹链接:第一个作业代码及文件
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
1)作业代码和图片:
代码内容:
import scrapy from scrapy_weather.items import ScrapyWeatherItem class WeaSpider(scrapy.Spider): name = 'wea' allowed_domains = [] start_urls = ['http://www.weather.com.cn/'] page=1 def parse(self, response): url=('http://www.weather.com.cn/') yield scrapy.Request(url=url, callback=self.parse_second) lianjie=response.xpath('//p/a') for a in lianjie: b=a.xpath('./@href').get() url = b # print("**********************************************************************************************************************************************************") # print(url) # 对第二页的链接发起访问 yield scrapy.Request(url=url,callback=self.parse_second) self.page=self.page+1 if self.page>20: break def parse_second(self,response): # print("是否可以爬取天气网官网的内容") # print(response) li_list = response.xpath('//img') for li in li_list: src = li.xpath('./@src').extract_first() # print(src) tupian = ScrapyWeatherItem() tupian['lianjiedizhi'] = src # 获取一个图片地址就将其交给pipelines yield tupian
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class ScrapyWeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 图片 lianjiedizhi = scrapy.Field()
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface import scrapy from itemadapter import ItemAdapter from scrapy.pipelines.images import ImagesPipeline class 图片管道类(ImagesPipeline): def get_media_requests(self, item, info): tupiandizhi=item['lianjiedizhi'] yield scrapy.Request(tupiandizhi) def file_path(self, request, response=None, info=None, *, item=None): name = request.url.split('/')[-1] return name def item_completed(self, results, item, info): return item
单线程:
# Scrapy settings for scrapy_weather project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapy_weather' SPIDER_MODULES = ['scrapy_weather.spiders'] NEWSPIDER_MODULE = 'scrapy_weather.spiders' # 单线程 CONCURRENT_REQUESTS = 8 # # 多线程 # CONCURRENT_REQUESTS = 32 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_weather (+http://www.yourdomain.com)' # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapy_weather.middlewares.ScrapyWeatherSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapy_weather.middlewares.ScrapyWeatherDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'scrapy_weather.pipelines.图片管道类': 300, } IMAGES_STORE='D:/学习/数据采集与融合技术/爬取图片/' # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
多线程:
# Scrapy settings for scrapy_weather project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapy_weather' SPIDER_MODULES = ['scrapy_weather.spiders'] NEWSPIDER_MODULE = 'scrapy_weather.spiders' # # 单线程 # CONCURRENT_REQUESTS = 8 # 多线程 CONCURRENT_REQUESTS = 32 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_weather (+http://www.yourdomain.com)' # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapy_weather.middlewares.ScrapyWeatherSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapy_weather.middlewares.ScrapyWeatherDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'scrapy_weather.pipelines.图片管道类': 300, } IMAGES_STORE='D:/学习/数据采集与融合技术/爬取图片/' # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
输出结果图片:

2)心得体会:
这题主要是通过Scrapy框架实现了爬取天气网图片以及链接的功能。包括两个解析方法,其中parse方法用于解析首页,获取其他页面的链接,通过yield关键字传递给parse_second;parse_second方法用于解析parse方法传递过来的链接,提取里面的图片链接。通过yield关键字将提取到的图片链接封装为ScrapyWeatherItem对象,传递给管道进行下载保存。从单线程转变为多线程,只需要把把settings文件中的concurrent_request从8改为32即可。
- 作业②
- 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
- 候选网站:东方财富网:https://www.eastmoney.com/
- 输出信息:MySQL数据库存储和输出格式如下:
- 表头英文命名例如:序号id,股票代码:bStockNo……,由同学们自行定义设计
-
序号
股票代码
股票名称
最新报价
涨跌幅
涨跌额
成交量
振幅
最高
最低
今开
昨收
1
688093
N世华
28.47
10.92
26.13万
7.6亿
22.34
32.0
28.08
30.20
17.55
2……
- Gitee文件夹链接:第二个作业代码及文件
1)作业代码和图片:
代码内容:
import scrapy import re from scrapy_gupiao.items import ScrapyGupiaoItem import pymysql class GuSpider(scrapy.Spider): name = 'gu' allowed_domains = ['eastmoney.com'] start_urls = ['https://97.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112406318197805742447_1697478765438&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697478765439'] page = 1 count = 0 def parse(self, response): for a in range(1, 20): url = f'https://97.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112406318197805742447_1697478765438&pn={a}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1697478765439' # print(url) # 对第二页的链接发起访问 yield scrapy.Request(url=url,callback=self.parse_second) self.page=self.page+1 def parse_second(self,response): # print('******************************************************') html = response.text # print(html) # 股票代码 gupiaodaima = re.findall(r'\"f12\"\:\".*?\"', html) # print(gupiaodaima) # 股票名称 gupiaomingcheng = re.findall(r'\"f14\"\:\".*?\"', html) # print(gupiaomingcheng) # 最新报价 zuixinbaojia = re.findall(r'\"f2\"\:[\d\.]*', html) # print(zuixinbaojia) # 涨跌幅 zhangdiefu = re.findall(r'\"f3\"\:[\d\.]*', html) # print(zhangdiefu) # 涨跌额 zhangdiee = re.findall(r'\"f4\"\:[\d\.]*', html) # print(zhangdiee) # 成交量 chengjiaoliang = re.findall(r'\"f5\"\:[\d\.]*', html) # print(chengjiaoliang) # 成交额 chengjiaoe = re.findall(r'\"f6\"\:[\d\.]*', html) # print(chengjiaoe) # 振幅 zhenfu = re.findall(r'\"f7\"\:[\d\.]*', html) # print(zhenfu) # 最高 zuigao = re.findall(r'\"f15\"\:[\d\.]*', html) # print(zuigao) # 最低 zuidi = re.findall(r'\"f16\"\:[\d\.]*', html) # print(zuidi) # 今开 jinkai = re.findall(r'\"f17\"\:[\d\.]*', html) # print(jinkai) # 昨收 zuoshou = re.findall(r'\"f18\"\:[\d\.]*', html) # print(zuoshou) for i in range(len(gupiaodaima)): gupiaodaima1 = eval(gupiaodaima[i].split(':')[1]) # print(gupiaodaima1) gupiaomingcheng1 = eval(gupiaomingcheng[i].split(':')[1]) # print(gupiaomingcheng1) zuixinbaojia1 = eval(zuixinbaojia[i].split(':')[1]) zhangdiefu2 = eval(zhangdiefu[i].split(':')[1]) zhangdiefu1 = str(zhangdiefu2) + "%" zhangdiee1 = eval(zhangdiee[i].split(':')[1]) chengjiaoliang1 = eval(chengjiaoliang[i].split(':')[1]) chengjiaoe1 = eval(chengjiaoe[i].split(':')[1]) zhenfu2 = eval(zhenfu[i].split(':')[1]) zhenfu1 = str(zhenfu2) + "%" zuigao1 = eval(zuigao[i].split(':')[1]) zuidi1 = eval(zuidi[i].split(':')[1]) jinkai1 = eval(jinkai[i].split(':')[1]) zuoshou1 = eval(zuoshou[i].split(':')[1]) gupiao=ScrapyGupiaoItem(gupiaodaima=gupiaodaima1,gupiaomingcheng=gupiaomingcheng1,zuixinbaojia=zuixinbaojia1,zhangdiefu=zhangdiefu1,zhangdiee=zhangdiee1,chengjiaoliang=chengjiaoliang1,chengjiaoe=chengjiaoe1,zhenfu=zhenfu1,zuigao=zuigao1,zuidi=zuidi1,jinkai=jinkai1,zuoshou=zuoshou1) yield gupiao
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class ScrapyGupiaoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 通俗的说就是你要下载的数据都有什么 # 股票代码 gupiaodaima = scrapy.Field() # 股票名称 gupiaomingcheng = scrapy.Field() # 最新报价 zuixinbaojia = scrapy.Field() # 涨跌幅 zhangdiefu =scrapy.Field() # 涨跌额 zhangdiee = scrapy.Field() # 成交量 chengjiaoliang = scrapy.Field() # 成交额 chengjiaoe = scrapy.Field() # 振幅 zhenfu = scrapy.Field() # 最高 zuigao = scrapy.Field() # 最低 zuidi = scrapy.Field() # 今开 jinkai = scrapy.Field() # 昨收 zuoshou = scrapy.Field() pass
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import sqlite3 class ScrapyGupiaoPipeline: # 在爬虫文件开始之前就执行的一个方法 def open_spider(self, spider): # 创建数据表 self.conn = sqlite3.connect('gupiao.db') self.cursor = self.conn.cursor() self.cursor.execute(''' create table IF NOT EXISTS gupiao1( 股票代码 text, 股票名称 text, 最新报价 text, 涨跌幅 text, 涨跌额 text, 成交量 text, 成交额 text, 振幅 text, 最高 text, 最低 text, 今开 text, 昨收 text) ''') # item就是yield后面的book对象 def process_item(self, item, spider): # print(item['chengjiaoe']) self.cursor.execute( "INSERT INTO gupiao1 (股票代码, 股票名称,最新报价 ,涨跌幅,涨跌额,成交量,成交额,振幅,最高,最低,今开,昨收) VALUES (?, ?, ?, ?,?,?,?,?,?,?,?,?)",(item['gupiaodaima'], item['gupiaomingcheng'], item['zuixinbaojia'], item['zhangdiefu'], item['zhangdiee'], item['chengjiaoliang'], item['chengjiaoe'],item['zhenfu'], item['zuigao'], item['zuidi'], item['jinkai'], item['zuoshou'])) return item # 在爬虫文件执行完之后执行的方法 def close_spider(self, spider): self.conn.commit() self.cursor.close() self.conn.close()
# Scrapy settings for scrapy_gupiao project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapy_gupiao' SPIDER_MODULES = ['scrapy_gupiao.spiders'] NEWSPIDER_MODULE = 'scrapy_gupiao.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_gupiao (+http://www.yourdomain.com)' # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapy_gupiao.middlewares.ScrapyGupiaoSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapy_gupiao.middlewares.ScrapyGupiaoDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'scrapy_gupiao.pipelines.ScrapyGupiaoPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
输出结果图片:
2)心得体会:
这题主要是通过Scrapy框架实现了爬取股票网信息的功能。包括两个解析方法,其中parse方法用于给parse_second传递页面的链接,通过yield关键字传递给parse_second;parse_second方法用于解析parse方法传递过来的链接,提取里面有关的股票信息。通过yield关键字将提取到的图片链接封装为ScrapyGupiaoItem对象,传递给管道用数据库进行保存。
- 作业③:
- 要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
- 候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/
- 输出信息:
- Gitee文件夹链接:第三个作业代码及文件
|
Currency |
TBP |
CBP |
TSP |
CSP |
Time |
|
阿联酋迪拉姆 |
198.58 |
192.31 |
199.98 |
206.59 |
11:27:14 |
1)作业代码和图片:
代码内容:
import scrapy from scrapy_waihui.items import ScrapyWaihuiItem import pymysql class WaiSpider(scrapy.Spider): name = 'wai' allowed_domains = ['boc.cn'] start_urls = ['https://www.boc.cn/sourcedb/whpj/index.html'] page = 1 def parse(self, response): url = f'https://www.boc.cn/sourcedb/whpj/index.html' yield scrapy.Request(url=url, callback=self.parse_second) for a in range(1, 10): # print(a) url=f'https://www.boc.cn/sourcedb/whpj/index_{a}.html' # print(url) # 对第二页的链接发起访问 yield scrapy.Request(url=url, callback=self.parse_second) self.page = self.page + 1 def parse_second(self, response): # print("是否可以爬取中国银行官网的内容") data=response.xpath('//table[@align="left"]//tr') for i in data[1:]: # print(i) # 货币名称 huobimingcheng = i.xpath('./td[1]/text()').extract_first() # print(huobimingcheng) # 现汇买入价 xianhuimairujia = i.xpath('./td[2]/text()').extract_first() # print(xianhuimairujia) # 现钞买入价 xianchaomairujia = i.xpath('./td[3]/text()').extract_first() # print(xianchaomairujia) # 现汇卖出价 xianhuimaichujia = i.xpath('./td[4]/text()').extract_first() # print(xianhuimaichujia) # 现钞卖出价 xianchaomaichujia = i.xpath('./td[5]/text()').extract_first() # print(xianchaomaichujia) # 发布时间 fabushijian = i.xpath('./td[8]/text()').extract_first() # print(fabushijian) # print(data) # print(i) waihui=ScrapyWaihuiItem(huobimingcheng=huobimingcheng,xianhuimairujia=xianhuimairujia,xianchaomairujia=xianchaomairujia,xianhuimaichujia=xianhuimaichujia,xianchaomaichujia=xianchaomaichujia,fabushijian=fabushijian) # print(response) # li_list = response.xpath('//img') # for li in li_list: # src = li.xpath('./@src').extract_first() # # print(src) # tupian = ScrapyWeatherItem() # tupian['lianjiedizhi'] = src # # 获取一个图片地址就将其交给pipelines yield waihui
# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class ScrapyWaihuiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # 货币名称 huobimingcheng = scrapy.Field() # 现汇买入价 xianhuimairujia = scrapy.Field() # 现钞买入价 xianchaomairujia = scrapy.Field() # 现汇卖出价 xianhuimaichujia = scrapy.Field() # 现钞卖出价 xianchaomaichujia = scrapy.Field() # 发布时间 fabushijian = scrapy.Field() pass
# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interface from itemadapter import ItemAdapter import sqlite3 class ScrapyWaihuiPipeline: # 在爬虫文件开始之前就执行的一个方法 def open_spider(self, spider): # 创建数据表 self.conn = sqlite3.connect('waihui.db') self.cursor = self.conn.cursor() self.cursor.execute(''' create table IF NOT EXISTS waihui1( Currency text, TBP text, CBP text, TSP text, CSP text, Time text) ''') # item就是yield后面的book对象 def process_item(self, item, spider): # print(item['chengjiaoe']) self.cursor.execute( "INSERT INTO waihui1 (Currency, TBP,CBP ,TSP,CSP,Time) VALUES (?, ?, ?, ?,?,?)",(item['huobimingcheng'], item['xianhuimairujia'], item['xianchaomairujia'], item['xianhuimaichujia'], item['xianchaomaichujia'], item['fabushijian'])) return item # 在爬虫文件执行完之后执行的方法 def close_spider(self, spider): self.conn.commit() self.cursor.close() self.conn.close()
# Scrapy settings for scrapy_waihui project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'scrapy_waihui' SPIDER_MODULES = ['scrapy_waihui.spiders'] NEWSPIDER_MODULE = 'scrapy_waihui.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_waihui (+http://www.yourdomain.com)' # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'scrapy_waihui.middlewares.ScrapyWaihuiSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'scrapy_waihui.middlewares.ScrapyWaihuiDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'scrapy_waihui.pipelines.ScrapyWaihuiPipeline': 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
输出结果图片:
2)心得体会:
这题主要是通过Scrapy框架实现了爬取外汇信息的功能。包括两个解析方法,其中parse方法用于给parse_second传递页面的链接,通过yield关键字传递给parse_second;parse_second方法用于解析parse方法传递过来的链接,通过xpath语句提取里面有关的外汇信息。一开始不知道xpath对tbody标签不适用,导致提取出空信息,后面改了之后就可以提取了,再通过yield关键字将提取到的图片链接封装为ScrapyWaihuiItem对象,传递给管道用数据库进行保存。

浙公网安备 33010602011771号