102102126吴启严数据采集与融合技术实践作业三

作业内容

gitee

作业①:

要求：

指定一个网站，爬取这个网站中的所有的所有图片，例如：中国气象网（http://www.weather.com.cn）。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数（学号尾数2位）、总下载的图片数量（尾数后3位）等限制爬取的措施。

输出信息:

将下载的Url信息在控制台输出，并将下载的图片存储在images子文件中，并给出截图。

代码如下：

定义爬虫:
编辑 weather_images/weather_images/spiders/weather_spider.py 文件，以实现以下爬虫定义：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class WeatherSpider(CrawlSpider):
    name = 'weather_spider'
    allowed_domains = ['weather.com.cn']
    start_urls = ['http://www.weather.com.cn']
    total_pages = 0  # 初始化已爬取的页面数
    total_images = 0  # 初始化已下载的图片数
    max_pages = 26 # 你的学号尾数2位
    max_images = 126  # 你的尾数后3位

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        if self.total_pages < self.max_pages:
            self.total_pages += 1
            img_urls = response.css('img::attr(src)').extract()
            for img_url in img_urls:
                if self.total_images < self.max_images:
                    self.total_images += 1
                    yield {
                        'image_url': response.urljoin(img_url),
                    }

配置Scrapy设置:
编辑 weather_images/weather_images/settings.py 文件，以添加以下设置：

ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = r'D:\桌面\数据采集与数据融合技术\full'  # 图片将被保存在项目根目录的images子文件夹中
CONCURRENT_REQUESTS = 1  # 单线程
# 为了多线程，可以将CONCURRENT_REQUESTS设置为一个更高的值

URL如下：

点击查看代码

https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
http://pi.weather.com.cn/i//product/pic/l/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102409575512407332009374DDA12F1788006B444F.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102411140839FCD91101374DD69FD3ABD5A86259CA.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://i.weather.com.cn/images/cn/video/lssj/2023/10/24/20231024160107F79724290685BB5A07B8E240DD90185B_m.jpg
http://i.weather.com.cn/images/cn/index/2023/08/14/202308141514280EB4780D353FB4038A49F4980BFDA8F4.jpg
http://i.tq121.com.cn/i/weather2014/index_neweather/v.jpg
https://i.i8tq.com/video/index_v2.jpg
https://i.i8tq.com/adImg/tanzhonghe_pc1.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/17/202310171416368174046B224C8139BAD7CC113E3B54B2.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016172856E95A6146BB0849E1C165777507CEAB0D.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016163046E594C5EB7CD06C31B6452B01A097907E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/2023101610255146894E5E20CD8F84ABB5C4EDA8F7223E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/27/202309271054160CE1AF5F9EC5516A795F3528B5C1A284.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/25/202309251837219DDB4EB6B391A780A1E027DA3CA553A6.jpg
http://i.weather.com.cn/images/cn/sjztj/2020/07/20/20200720142523B5F07D41B4AC4336613DA93425B35B5E_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2019/10/28/20191028144048D58023A73C43EC6EEB61610B0AB0AD74_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/20/20231020170222F2252A2BD855BFBEFFC30AE770F716FB.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/20231024152112F91EE473F63AC3360E412346BC26C108.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/09/20/2023092011133671187DBEE4125031642DBE0404D7020D.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/16/20231016145553827BF524F4F576701FFDEC63F894DD29.jpg
http://i.weather.com.cn/images/cn/news/2021/05/14/20210514192548638D53A47159C5D97689A921C03B6546.jpg
http://i.weather.com.cn/images/cn/news/2021/03/26/20210326150416454FB344B92EC8BD897FA50DF6AD15E8.jpg
http://i.weather.com.cn/images/cn/science/2020/07/28/202007281001285C97A5D6CAD3BC4DD74F98B5EA5187BF.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg

部分结果图片如下

心得体会：

了解多线程相较于单线程的处理方式。

作业②：

要求：

熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站：东方财富网：https://www.eastmoney.com/
输出信息：MySQL数据库存储和输出格式如下：
表头英文命名例如：序号id，股票代码：

具体代码步骤如下：

创建一个股票信息爬虫使用Scrapy框架，XPath和MySQL可能包括几个主要步骤。请注意，开发这种爬虫可能需要一些时间，并且需要一定的编程知识。以下是基本步骤的概述：

定义Item:

import scrapy

class StockItem(scrapy.Item):
    id = scrapy.Field()
    bStockNo = scrapy.Field()
    name = scrapy.Field()
    latest_price = scrapy.Field()
    change_percent = scrapy.Field()
    change_amount = scrapy.Field()
    volume = scrapy.Field()
    amplitude = scrapy.Field()
    high = scrapy.Field()
    low = scrapy.Field()
    opening = scrapy.Field()
    closing = scrapy.Field()

创建Spider:

import scrapy
from items import StockItem

class StockSpider(scrapy.Spider):
    name = "stock"
    start_urls = ['https://www.eastmoney.com/']

    def parse(self, response):
        # 使用XPath选择器抽取数据
        for stock in response.xpath('//your_xpath_selector'):
            item = StockItem()
            item['id'] = stock.xpath('your_xpath_selector').extract_first()
            item['bStockNo'] = stock.xpath('your_xpath_selector').extract_first()
            # ...为其他字段重复此操作
            yield item

配置Pipeline以存储数据到MySQL:

import mysql.connector

class StockPipeline(object):
    def __init__(self):
        self.conn = mysql.connector.connect(
            host='your_host',
            user='your_user',
            passwd='your_password',
            db='your_database'
        )
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = "INSERT INTO your_table_name (id, bStockNo, name, latest_price, change_percent, change_amount, volume, amplitude, high, low, opening, closing) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        values = (item['id'], item['bStockNo'], item['name'], item['latest_price'], item['change_percent'], item['change_amount'], item['volume'], item['amplitude'], item['high'], item['low'], item['opening'], item['closing'])
        self.cursor.execute(sql, values)
        self.conn.commit()
        return item

    def close_spider(self, spider):
        self.cursor.close()
        self.conn.close()

运行后结果如下

心得体会

对scrapy有了更深的体会。

作业③:

要求：

熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站：中国银行网：https://www.boc.cn/sourcedb/whpj/

具体代码步骤如下：

创建Item:
- 在项目的items.py文件中定义一个Item类来保存外汇数据。

import scrapy

class ExchangeRateItem(scrapy.Item):
    currency = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    tsp = scrapy.Field()
    csp = scrapy.Field()
    time = scrapy.Field()

创建Spider:
- 创建一个新的Spider类来爬取外汇数据。

import scrapy
from item1 import ExchangeRateItem

class ExchangeRateSpider(scrapy.Spider):
    name = 'exchange_rate'
    start_urls = ['https://www.boc.cn/sourcedb/whpj/']

    def parse(self, response):
        rows = response.xpath('//table[@class="publish"]/tr')[1:]
        for row in rows:
            item = ExchangeRateItem()
            item['currency'] = row.xpath('td[1]/text()').get()
            item['tbp'] = row.xpath('td[2]/text()').get()
            item['cbp'] = row.xpath('td[3]/text()').get()
            item['tsp'] = row.xpath('td[4]/text()').get()
            item['csp'] = row.xpath('td[5]/text()').get()
            item['time'] = row.xpath('td[6]/text()').get()
            yield item

创建Pipeline:
- 创建一个新的Pipeline类来将外汇数据保存到MySQL数据库。

import mysql.connector

class MySQLPipeline(object):

    def open_spider(self, spider):
        self.conn = mysql.connector.connect(
            host='your_host',
            user='your_username',
            passwd='your_password',
            db='your_database'
        )
        self.cursor = self.conn.cursor()

    def close_spider(self, spider):
        self.conn.commit()
        self.conn.close()

    def process_item(self, item, spider):
        self.cursor.execute("""
            INSERT INTO exchange_rates (currency, tbp, cbp, tsp, csp, time)
            VALUES (%s, %s, %s, %s, %s, %s)
        """, (
            item['currency'],
            item['tbp'],
            item['cbp'],
            item['tsp'],
            item['csp'],
            item['time']
        ))
        return item

在settings.py文件中启用新创建的Pipeline：

ITEM_PIPELINES = {
    'exchange_rate.pipelines.MySQLPipeline': 300,
}

创建MySQL数据库和表:
- 在MySQL中创建一个新的数据库和表来保存外汇数据。

CREATE DATABASE your_database;
USE your_database;
CREATE TABLE exchange_rates (
    id INT AUTO_INCREMENT PRIMARY KEY,
    currency VARCHAR(255),
    tbp DECIMAL(10,2),
    cbp DECIMAL(10,2),
    tsp DECIMAL(10,2),
    csp DECIMAL(10,2),
    time TIME
);

运行结果：

心得体会：

对scrapy和mysql有了更深的体会。

posted @ 2023-11-01 22:45 xdajun 阅读(39) 评论(0) 收藏举报

刷新页面返回顶部

wqy123

102102126吴启严数据采集与融合技术实践作业三

作业内容

gitee

作业①:

要求：

输出信息:

代码如下：

URL如下：

部分结果图片如下

心得体会：

作业②：

要求：

具体代码步骤如下：

运行后结果如下

心得体会

作业③:

要求：

具体代码步骤如下：

运行结果：

心得体会：

公告