102102126吴启严数据采集与融合技术实践作业三

作业内容

gitee

代码链接

作业①:

要求:

指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。

输出信息:

将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。

代码如下:

  1. 定义爬虫:
    编辑 weather_images/weather_images/spiders/weather_spider.py 文件,以实现以下爬虫定义:

    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule
    
    class WeatherSpider(CrawlSpider):
        name = 'weather_spider'
        allowed_domains = ['weather.com.cn']
        start_urls = ['http://www.weather.com.cn']
        total_pages = 0  # 初始化已爬取的页面数
        total_images = 0  # 初始化已下载的图片数
        max_pages = 26 # 你的学号尾数2位
        max_images = 126  # 你的尾数后3位
    
        rules = (
            Rule(LinkExtractor(), callback='parse_item', follow=True),
        )
    
        def parse_item(self, response):
            if self.total_pages < self.max_pages:
                self.total_pages += 1
                img_urls = response.css('img::attr(src)').extract()
                for img_url in img_urls:
                    if self.total_images < self.max_images:
                        self.total_images += 1
                        yield {
                            'image_url': response.urljoin(img_url),
                        }
    
  2. 配置Scrapy设置:
    编辑 weather_images/weather_images/settings.py 文件,以添加以下设置:

    ITEM_PIPELINES = {
        'scrapy.pipelines.images.ImagesPipeline': 1,
    }
    IMAGES_STORE = r'D:\桌面\数据采集与数据融合技术\full'  # 图片将被保存在项目根目录的images子文件夹中
    CONCURRENT_REQUESTS = 1  # 单线程
    # 为了多线程,可以将CONCURRENT_REQUESTS设置为一个更高的值
    

URL如下:

点击查看代码
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
http://pi.weather.com.cn/i//product/pic/l/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102409575512407332009374DDA12F1788006B444F.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102411140839FCD91101374DD69FD3ABD5A86259CA.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://i.weather.com.cn/images/cn/video/lssj/2023/10/24/20231024160107F79724290685BB5A07B8E240DD90185B_m.jpg
http://i.weather.com.cn/images/cn/index/2023/08/14/202308141514280EB4780D353FB4038A49F4980BFDA8F4.jpg
http://i.tq121.com.cn/i/weather2014/index_neweather/v.jpg
https://i.i8tq.com/video/index_v2.jpg
https://i.i8tq.com/adImg/tanzhonghe_pc1.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/17/202310171416368174046B224C8139BAD7CC113E3B54B2.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016172856E95A6146BB0849E1C165777507CEAB0D.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016163046E594C5EB7CD06C31B6452B01A097907E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/2023101610255146894E5E20CD8F84ABB5C4EDA8F7223E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/27/202309271054160CE1AF5F9EC5516A795F3528B5C1A284.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/25/202309251837219DDB4EB6B391A780A1E027DA3CA553A6.jpg
http://i.weather.com.cn/images/cn/sjztj/2020/07/20/20200720142523B5F07D41B4AC4336613DA93425B35B5E_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2019/10/28/20191028144048D58023A73C43EC6EEB61610B0AB0AD74_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/20/20231020170222F2252A2BD855BFBEFFC30AE770F716FB.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/20231024152112F91EE473F63AC3360E412346BC26C108.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/09/20/2023092011133671187DBEE4125031642DBE0404D7020D.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/16/20231016145553827BF524F4F576701FFDEC63F894DD29.jpg
http://i.weather.com.cn/images/cn/news/2021/05/14/20210514192548638D53A47159C5D97689A921C03B6546.jpg
http://i.weather.com.cn/images/cn/news/2021/03/26/20210326150416454FB344B92EC8BD897FA50DF6AD15E8.jpg
http://i.weather.com.cn/images/cn/science/2020/07/28/202007281001285C97A5D6CAD3BC4DD74F98B5EA5187BF.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg


部分结果图片如下

image

心得体会:

了解多线程相较于单线程的处理方式。

作业②:

要求:

熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站:东方财富网:https://www.eastmoney.com/
输出信息:MySQL数据库存储和输出格式如下:
表头英文命名例如:序号id,股票代码:

具体代码步骤如下:

创建一个股票信息爬虫使用Scrapy框架,XPath和MySQL可能包括几个主要步骤。请注意,开发这种爬虫可能需要一些时间,并且需要一定的编程知识。以下是基本步骤的概述:

  1. 定义Item:

    import scrapy
    
    class StockItem(scrapy.Item):
        id = scrapy.Field()
        bStockNo = scrapy.Field()
        name = scrapy.Field()
        latest_price = scrapy.Field()
        change_percent = scrapy.Field()
        change_amount = scrapy.Field()
        volume = scrapy.Field()
        amplitude = scrapy.Field()
        high = scrapy.Field()
        low = scrapy.Field()
        opening = scrapy.Field()
        closing = scrapy.Field()
    
  2. 创建Spider:

    import scrapy
    from items import StockItem
    
    class StockSpider(scrapy.Spider):
        name = "stock"
        start_urls = ['https://www.eastmoney.com/']
    
        def parse(self, response):
            # 使用XPath选择器抽取数据
            for stock in response.xpath('//your_xpath_selector'):
                item = StockItem()
                item['id'] = stock.xpath('your_xpath_selector').extract_first()
                item['bStockNo'] = stock.xpath('your_xpath_selector').extract_first()
                # ...为其他字段重复此操作
                yield item
    
  3. 配置Pipeline以存储数据到MySQL:

    import mysql.connector
    
    class StockPipeline(object):
        def __init__(self):
            self.conn = mysql.connector.connect(
                host='your_host',
                user='your_user',
                passwd='your_password',
                db='your_database'
            )
            self.cursor = self.conn.cursor()
    
        def process_item(self, item, spider):
            sql = "INSERT INTO your_table_name (id, bStockNo, name, latest_price, change_percent, change_amount, volume, amplitude, high, low, opening, closing) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
            values = (item['id'], item['bStockNo'], item['name'], item['latest_price'], item['change_percent'], item['change_amount'], item['volume'], item['amplitude'], item['high'], item['low'], item['opening'], item['closing'])
            self.cursor.execute(sql, values)
            self.conn.commit()
            return item
    
        def close_spider(self, spider):
            self.cursor.close()
            self.conn.close()
    

运行后结果如下

image

心得体会

对scrapy有了更深的体会。

作业③:

要求:

熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/

具体代码步骤如下:

  1. 创建Item:
    • 在项目的items.py文件中定义一个Item类来保存外汇数据。
import scrapy

class ExchangeRateItem(scrapy.Item):
    currency = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    tsp = scrapy.Field()
    csp = scrapy.Field()
    time = scrapy.Field()
  1. 创建Spider:
    • 创建一个新的Spider类来爬取外汇数据。
import scrapy
from item1 import ExchangeRateItem

class ExchangeRateSpider(scrapy.Spider):
    name = 'exchange_rate'
    start_urls = ['https://www.boc.cn/sourcedb/whpj/']

    def parse(self, response):
        rows = response.xpath('//table[@class="publish"]/tr')[1:]
        for row in rows:
            item = ExchangeRateItem()
            item['currency'] = row.xpath('td[1]/text()').get()
            item['tbp'] = row.xpath('td[2]/text()').get()
            item['cbp'] = row.xpath('td[3]/text()').get()
            item['tsp'] = row.xpath('td[4]/text()').get()
            item['csp'] = row.xpath('td[5]/text()').get()
            item['time'] = row.xpath('td[6]/text()').get()
            yield item
  1. 创建Pipeline:
    • 创建一个新的Pipeline类来将外汇数据保存到MySQL数据库。
import mysql.connector

class MySQLPipeline(object):

    def open_spider(self, spider):
        self.conn = mysql.connector.connect(
            host='your_host',
            user='your_username',
            passwd='your_password',
            db='your_database'
        )
        self.cursor = self.conn.cursor()

    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()

    def process_item(self, item, spider):
        self.cursor.execute("""
            INSERT INTO exchange_rates (currency, tbp, cbp, tsp, csp, time)
            VALUES (%s, %s, %s, %s, %s, %s)
        """, (
            item['currency'],
            item['tbp'],
            item['cbp'],
            item['tsp'],
            item['csp'],
            item['time']
        ))
        return item

settings.py文件中启用新创建的Pipeline:

ITEM_PIPELINES = {
    'exchange_rate.pipelines.MySQLPipeline': 300,
}
  1. 创建MySQL数据库和表:
    • 在MySQL中创建一个新的数据库和表来保存外汇数据。
CREATE DATABASE your_database;
USE your_database;
CREATE TABLE exchange_rates (
    id INT AUTO_INCREMENT PRIMARY KEY,
    currency VARCHAR(255),
    tbp DECIMAL(10,2),
    cbp DECIMAL(10,2),
    tsp DECIMAL(10,2),
    csp DECIMAL(10,2),
    time TIME
);

运行结果:

image

心得体会:

对scrapy和mysql有了更深的体会。

posted @ 2023-11-01 22:45  xdajun  阅读(39)  评论(0)    收藏  举报