数据采集与融合作业3

第三次作业

作业①:

要求:指定一个网站,爬取这个网站中的所有的所有图片,例如:中国气象网(http://www.weather.com.cn)。使用scrapy框架分别实现单线程和多线程的方式爬取。
–务必控制总页数(学号尾数2位)、总下载的图片数量(尾数后3位)等限制爬取的措施。

输出信息: 将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。

完整代码如下:

weather_spider.py
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class WeatherSpider(scrapy.Spider):
    name = 'weather_spider'
    allowed_domains = ['weather.com.cn']
    start_urls = ['http://www.weather.com.cn/']

    def parse(self, response):
        for image_url in response.css('img::attr(src)').extract():
            yield {'image_urls': [response.urljoin(image_url)]}

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

class JpgLinksPipeline:
    def open_spider(self, spider):
        self.file = open('jpg_links.txt', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        image_urls = item.get('image_urls', [])
        jpg_urls = [url for url in image_urls if url.endswith('.jpg')]
        for jpg_url in jpg_urls:
            self.file.write(jpg_url + '\n')
        return item  # Don't forget to return the item!

pipelines.py
class WeatherImagesPipeline:
    def process_item(self, item, spider):
        return item
settings.py
BOT_NAME = "weather_images"

SPIDER_MODULES = ["weather_images.spiders"]
NEWSPIDER_MODULE = "weather_images.spiders"
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
IMAGES_STORE = 'images'
ITEM_PIPELINES = {
    'weather_images.spiders.weather_spider.JpgLinksPipeline': 100,
    'weather_images.spiders.weather_spider.MyImagesPipeline': 200,
}
运行结果如下
jpg_links.txt
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
https://i.i8tq.com/weather2020/search/rbAd.jpg
http://pi.weather.com.cn/i//product/pic/l/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102409575512407332009374DDA12F1788006B444F.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/2023102411140839FCD91101374DD69FD3ABD5A86259CA.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://i.weather.com.cn/images/cn/video/lssj/2023/10/24/20231024160107F79724290685BB5A07B8E240DD90185B_m.jpg
http://i.weather.com.cn/images/cn/index/2023/08/14/202308141514280EB4780D353FB4038A49F4980BFDA8F4.jpg
http://i.tq121.com.cn/i/weather2014/index_neweather/v.jpg
https://i.i8tq.com/video/index_v2.jpg
https://i.i8tq.com/adImg/tanzhonghe_pc1.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/17/202310171416368174046B224C8139BAD7CC113E3B54B2.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016172856E95A6146BB0849E1C165777507CEAB0D.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/20231016163046E594C5EB7CD06C31B6452B01A097907E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/10/16/2023101610255146894E5E20CD8F84ABB5C4EDA8F7223E.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/27/202309271054160CE1AF5F9EC5516A795F3528B5C1A284.jpg
http://i.weather.com.cn/images/cn/life/shrd/2023/09/25/202309251837219DDB4EB6B391A780A1E027DA3CA553A6.jpg
http://i.weather.com.cn/images/cn/sjztj/2020/07/20/20200720142523B5F07D41B4AC4336613DA93425B35B5E_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2019/10/28/20191028144048D58023A73C43EC6EEB61610B0AB0AD74_xm.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/20/20231020170222F2252A2BD855BFBEFFC30AE770F716FB.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/20231024152112F91EE473F63AC3360E412346BC26C108.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/24/202310240901372B0E8FDF7D8ED736C3BF947676D00ADA.jpg
http://pic.weather.com.cn/images/cn/photo/2023/09/20/2023092011133671187DBEE4125031642DBE0404D7020D.jpg
http://pic.weather.com.cn/images/cn/photo/2023/10/16/20231016145553827BF524F4F576701FFDEC63F894DD29.jpg
http://i.weather.com.cn/images/cn/news/2021/05/14/20210514192548638D53A47159C5D97689A921C03B6546.jpg
http://i.weather.com.cn/images/cn/news/2021/03/26/20210326150416454FB344B92EC8BD897FA50DF6AD15E8.jpg
http://i.weather.com.cn/images/cn/science/2020/07/28/202007281001285C97A5D6CAD3BC4DD74F98B5EA5187BF.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nmc_stfc_sfer_er24_achn_l88_p9_20231025010002400.jpg
http://pi.weather.com.cn/i//product/pic/m/sevp_nsmc_wxbl_fy4a_etcc_achn_lno_py_20231024234500000.jpg

下载图片文件夹截图
image
心得体会
首先,我通过命令行创建了一个新的Scrapy项目和一个新的爬虫。在爬虫的parse方法中,我使用Scrapy的CSS选择器提取了所有图片的URL。我定义了两个Pipeline类,一个是MyImagesPipeline用于处理和下载图片,另一个是JpgLinksPipeline用于将.jpg图片的链接写入到一个文本文件中。在settings.py文件中,我们通过ITEM_PIPELINES配置了这两个Pipeline,并通过IMAGES_STORE设置了图片的存储路径。最后,通过运行scrapy crawl weather_spider命令来执行爬虫。MyImagesPipeline确保所有图片被下载到指定的images子目录,而JpgLinksPipeline则将所有.jpg图片的链接保存到了jpg_links.txt文件中。为了实现单/多线程爬取,可以在settings.py中设置CONCURRENT_REQUESTS = 1/n。

gitee文件夹链接

作业②

要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取股票相关信息。
候选网站:东方财富网:https://www.eastmoney.com/
输出信息:MySQL数据库存储和输出格式如下:
完整代码如下:

easymoney_spider.py
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from stock_scraper.items import StockItem

class EastmoneySpider(scrapy.Spider):
    name = 'eastmoney'
    allowed_domains = ['eastmoney.com']
    start_urls = ['http://quote.eastmoney.com/center/gridlist.html#sz_a_board']

    def __init__(self):
        self.driver = webdriver.Edge()
        self.driver.get(self.start_urls[0])
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self, response):
        all_data = pd.DataFrame()
        num_pages_to_scrape = 5
        # 在保存到数据库之前,使用 fillna 方法替换 NaN 值
        all_data = all_data.fillna({
            'related_links': '',  # 用空字符串替换 NaN 值
            'latest_price': 0,  # 用 0 替换 NaN 值
        })

        for _ in range(num_pages_to_scrape):
            try:
                # 获取表格数据
                table = self.wait.until(
                    EC.presence_of_element_located((By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[5]/div/table'))
                )
                table_html = table.get_attribute('outerHTML')

                # 使用 pandas 读取表格数据
                df = pd.read_html(table_html, header=0, converters={'代码': str})[0]


                # 确保股票代码是6位数,填充前导0
                df['代码'] = df['代码'].apply(lambda x: "'" + str(x).zfill(6))  # 在每个代码前添加单引号

                # 添加数据到总的 DataFrame 中
                all_data = pd.concat([all_data, df], ignore_index=True)

                # 点击 "下一页" 按钮
                next_button = self.wait.until(
                    EC.element_to_be_clickable((By.XPATH, '//a[text()="下一页"]'))
                )
                next_button.click()

            except Exception as e:
                self.logger.error(f"An error occurred: {e}")
                break

        # 关闭 driver
        self.driver.quit()

        # 将 DataFrame 中的数据转换为 Scrapy items 并返回
        for index, row in all_data.iterrows():
            item = StockItem()
            item['serial_number'] = row['序号']
            item['code'] = row['代码']
            item['name'] = row['名称']
            item['related_links'] = row['相关链接']
            item['latest_price'] = row['最新价']
            item['change_percentage'] = row['涨跌幅']
            item['change_amount'] = row['涨跌额']
            item['trade_volume'] = row['成交量(手)']
            item['trade_value'] = row['成交额']
            item['amplitude'] = row['振幅']
            item['highest_price'] = row['最高']
            item['lowest_price'] = row['最低']
            item['opening_price'] = row['今开']
            item['closing_price'] = row['昨收']
            item['volume_ratio'] = row['量比']
            item['turnover_rate'] = row['换手率']
            item['pe_ratio'] = row['市盈率(动态)']
            item['pb_ratio'] = row['市净率']
            item['add_to_watchlist'] = row['加自选']

            yield item
items.py
import scrapy

class StockItem(scrapy.Item):
    serial_number = scrapy.Field()  # 序号
    code = scrapy.Field()  # 代码
    name = scrapy.Field()  # 名称
    related_links = scrapy.Field()  # 相关链接
    latest_price = scrapy.Field()  # 最新价
    change_percentage = scrapy.Field()  # 涨跌幅
    change_amount = scrapy.Field()  # 涨跌额
    trade_volume = scrapy.Field()  # 成交量(手)
    trade_value = scrapy.Field()  # 成交额
    amplitude = scrapy.Field()  # 振幅
    highest_price = scrapy.Field()  # 最高
    lowest_price = scrapy.Field()  # 最低
    opening_price = scrapy.Field()  # 今开
    closing_price = scrapy.Field()  # 昨收
    volume_ratio = scrapy.Field()  # 量比
    turnover_rate = scrapy.Field()  # 换手率
    pe_ratio = scrapy.Field()  # 市盈率(动态)
    pb_ratio = scrapy.Field()  # 市净率
    add_to_watchlist = scrapy.Field()  # 加自选
pipelines.py
import mysql.connector
from stock_scraper.items import StockItem

class MySQLPipeline:
    def open_spider(self, spider):
        self.db = mysql.connector.connect(
            host='localhost',
            user='root',
            password='xck030312',
            database='stock'
        )
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        if isinstance(item, StockItem):
            sql = (
                "INSERT INTO 1stocks "
                "(serial_number, code, name,  latest_price, change_percentage, change_amount, trade_volume, trade_value, "
                "amplitude, highest_price, lowest_price, opening_price, closing_price, volume_ratio, turnover_rate, pe_ratio, pb_ratio)"
                "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
            )
            values = (
                item['serial_number'], item['code'], item['name'],  item['latest_price'],
                item['change_percentage'], item['change_amount'], item['trade_volume'], item['trade_value'],
                item['amplitude'], item['highest_price'], item['lowest_price'], item['opening_price'],
                item['closing_price'], item['volume_ratio'], item['turnover_rate'], item['pe_ratio'],
                item['pb_ratio']
            )
            self.cursor.execute(sql, values)
            self.db.commit()

        return item
settings.py
BOT_NAME = "stock_scraper"
SPIDER_MODULES = ["stock_scraper.spiders"]
NEWSPIDER_MODULE = "stock_scraper.spiders"
ITEM_PIPELINES = {'stock_scraper.pipelines.MySQLPipeline': 1}
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

运行后部分结果如下
image

心得体会
此项目结合了Scrapy, Selenium, XPath, 和 MySQL,以爬取和存储东方财富网上的股票信息。项目结构包括一个Scrapy爬虫,一个用于定义数据结构的Items模块,和一个Pipelines模块用于处理和存储数据。在爬虫中,利用Selenium和XPath来定位和获取网页中的表格数据,包括股票代码、名称、最新价格等信息。Pandas库被用于处理和清洗从网页表格获取的数据。数据清洗包括删除不必要的列和确保股票代码的格式正确。项目还包括一个MySQLPipeline类,用于将清洗后的数据保存到MySQL数据库中。mysql中表的结构设计以适应该页面的各个字段,并确保所有字段都正确映射到数据库表中。

gitee文件夹链接

作业③:

要求:熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法;使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。
候选网站:中国银行网:https://www.boc.cn/sourcedb/whpj/
完整代码如下:

forex_spider.py
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from forex.items import ForexItem
from scrapy.selector import Selector

class ForexSpider(scrapy.Spider):
    name = "forex"
    start_urls = ['https://www.boc.cn/sourcedb/whpj/']

    def __init__(self):
        self.driver = webdriver.Edge()

    def parse(self, response):
        self.driver.get(response.url)
        for page in range(10):
            sel = Selector(text=self.driver.page_source)
            rows = sel.xpath('/html/body/div/div[5]/div[1]/div[2]/table/tbody/tr')[1:]  # Skip header row
            for row in rows:
                item = ForexItem()
                item['currency'] = row.xpath('td[1]/text()').get()
                item['tbp'] = row.xpath('td[2]/text()').get()
                item['cbp'] = row.xpath('td[3]/text()').get()
                item['tsp'] = row.xpath('td[4]/text()').get()
                item['csp'] = row.xpath('td[5]/text()').get()
                item['bank_conversion_price'] = row.xpath('td[6]/text()').get()
                item['release_date'] = row.xpath('td[7]/text()').get()
                item['release_time'] = row.xpath('td[8]/text()').get()
                yield item

            next_button = self.driver.find_element(By.XPATH, '/html/body/div/div[5]/div[2]/div/ol/li[7]')
            next_button.click()

    def close(self, reason):
        self.driver.quit()
items.py
import scrapy

class ForexItem(scrapy.Item):
    currency = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    tsp = scrapy.Field()
    csp = scrapy.Field()
    bank_conversion_price = scrapy.Field()
    release_date = scrapy.Field()
    release_time = scrapy.Field()
pipelines.py
import mysql.connector

class ForexPipeline:

    def open_spider(self, spider):
        self.connection = mysql.connector.connect(
            host='localhost',
            user='root',
            password='xck030312',
            database='forex_data'
        )
        self.cursor = self.connection.cursor()

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        self.cursor.execute("""
            INSERT INTO exchange_rates (currency, tbp, cbp, tsp, csp, bank_conversion_price, release_date, release_time)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
        """, (
            item['currency'],
            item['tbp'],
            item['cbp'],
            item['tsp'],
            item['csp'],
            item['bank_conversion_price'],
            item['release_date'],
            item['release_time']
        ))
        self.connection.commit()
        return item
settings.py
BOT_NAME = "forex"
SPIDER_MODULES = ["forex.spiders"]
NEWSPIDER_MODULE = "forex.spiders"
ROBOTSTXT_OBEY = True
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"
ITEM_PIPELINES = {
   'forex.pipelines.ForexPipeline': 300,
}

运行结果数据库截图:
image

心得体会
这个项目的目标是通过Scrapy和Selenium框架,以及XPath和MySQL技术,爬取并存储外汇网站的数据。首先,创建一个Scrapy项目和一个数据项类来定义需要爬取的字段。然后,创建一个Spider,使用Selenium和Edge浏览器来加载网页,并通过XPath来定位和提取表格中的数据。为了遍历网站上的10个页面,我们在循环中使用Selenium点击“下一页”按钮。提取的数据通过Scrapy的Pipeline处理,其中定义了如何将数据存储到MySQL数据库。最终,运行Scrapy爬虫,它会自动遍历所有页面,提取所有数据,并将其保存到MySQL数据库中。

gitee文件夹链接

posted @ 2023-10-25 09:38  拾霜  阅读(75)  评论(0)    收藏  举报