作业①

1.气象网页爬取实验

实验要求
指定一个网站，爬取这个网站中的所有的所有图片，中国气象网（http://www.weather.com.cn）。实现单线程和多线程的方式爬取。
输出信息：将下载的图片保存在images子文件夹中
核心代码
为了成功获取所有的图片链接，我使用 BeautifulSoup 解析 HTML，找到页面上所有标签，每个标签取出 src，有些是相对路径，所以用 urljoin拼成完整 URL。

website = "http://www.weather.com.cn"
headers = {'User-Agent': 'Mozilla/5.0'}

response = requests.get(website, headers=headers, timeout=10)# 请求网页，获取 HTML 源码
response.encoding = response.apparent_encoding   
soup = BeautifulSoup(response.text, 'lxml')      # 用 lxml 解析器解析 HTML
img_tags = soup.find_all('img')#提取所有 <img> 标签
img_urls = []

# 从每个 <img> 中拿到 src 属性，并拼接成完整的图片 URL
for img in img_tags:
    src = img.get('src')
    img_url = urljoin(website, src)  # 处理相对路径，补成绝对路径
    img_urls.append(img_url)

# 使用 set 去重，再转回列表
img_urls = list(set(img_urls))

在实现爬取方式上，我设计了单线程和多线程两种方法，实现不同速度的爬取。
对于单线程，代码实现如下

for img_url in img_urls:
    # 从 URL 中取出文件名
    filename = os.path.basename(urlparse(img_url).path)
    img_path = os.path.join('images', filename)
    # 请求图片数据
    r = requests.get(img_url, headers=headers, timeout=10)
    # 二进制写入本地文件
    with open(img_path, 'wb') as f:
        f.write(r.content)
    print(f"下载完成: {img_url}")

单线程就是简单的按顺序一个一个请求图片。
而对于多线程，代码如下：

threads = []

# 为每一张图片创建一个下载线程
for img_url in img_urls:
    # 使用默认参数 u=img_url，避免闭包里 img_url 被后续循环修改
    def task(u=img_url):
        filename = os.path.basename(urlparse(u).path)
        img_path = os.path.join('images', filename)
        r = requests.get(u, headers=headers, timeout=10)
        with open(img_path, 'wb') as f:
            f.write(r.content)
        print(f"下载完成: {u}")

    t = threading.Thread(target=task)
    threads.append(t)
    t.start()  # 启动线程

# 等待所有线程执行完毕
for t in threads:
    t.join()

在多线程中，我定义了task函数，该函数负责下载一张图片并保存到本地，但我为每一张图片都创建了一个，这样就不再是等待一个图片下载好再去下载另外一个，而是转变成了好几个图片开始一起下载，实现多张图片一起发请求。多线程比单线程快速的地方就在于，他不需要花大量时间去等待网络，这会使得资源利用更高效。
实验结果

2.心得体会
从单线程多线程两种方法，我直观的感受到了两种方法在速度上带来的差异，这种差异可能在本次小任务上只差十几秒，但真正到了一个大工程上，这个差异想必非常巨大，我对多线程的实现有了一个更清晰的认识。整个程序从 requests 获取网页源码开始，再用 BeautifulSoup 提取所有标签，最后通过 urljoin 把相对路径补成完整 URL，最后按 URL 中的文件名保存到本地目录。

作业②

1.股票信息定向爬虫实验

实验要求
熟练掌握 Scrapy 框架中 Item、Pipeline 数据的序列化与持久化输出方法。
掌握 Scrapy + 动态API分析 + SQLite 数据库存储的技术路线，爬取股票相关信息。
爬取东方财富网的股票列表信息，存入本地 stocks.db 数据库。
输出信息：数据库存储格式符合实验要求，表头使用英文命名（如 bStockNo）。
核心代码
在代码中，我依次构造了第1、2、3 页的请求，代码中这些乱七八糟的参数（fs、fid、ut...）是东方财富 API 规定的，直接从 Network 里复制过来就行。接着我定义了parse_api函数，解析 JSON，生成 Item。接着是把每条股票记录封装为 Item，最后把 Item 交给 Scrapy 的 Item Pipeline，后面方便写入数据库。

class EastmoneyStockSpider(scrapy.Spider):
    # 爬虫名称（运行：scrapy crawl eastmoney_stock）
    name = "eastmoney_stock"
    # 限制访问的域名
    allowed_domains = ["eastmoney.com", "push2.eastmoney.com", "48.push2.eastmoney.com"]

    def start_requests(self):
        max_page = 3          # 爬取页数
        page_size = 50        # 每页条数
        for pn in range(1, max_page + 1):
            # 东方财富股票列表接口（分页）
            url = (
                "http://48.push2.eastmoney.com/api/qt/clist/get"
                "?pn={pn}&pz={pz}&po=1&np=1"
                "&ut=bd1d9ddb04089700cf9c27f6f7426281"
                "&fltt=2&invt=2&wbp2u=|0|0|0|web"
                "&fid=f3"
                "&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048"
                "&fields=f2,f3,f4,f5,f6,f7,"
                "f12,f14,f15,f16,f17,f18"
            ).format(pn=pn, pz=page_size)

            print("请求第", pn, "页 API:", url)
            # 发送请求，响应交给 parse_api 处理，并把页码 pn 传过去
            yield scrapy.Request(url=url, callback=self.parse_api, meta={'pn': pn})

    def parse_api(self, response):
        pn = response.meta.get("pn")          # 取出当前页码
        text = response.text.strip()

        # 判断是否为 JSON 格式
        if text.startswith("{") or text.startswith("["):
            data = json.loads(text)
        
        # diff 是接口里存放股票列表的字段
        diff_list = data.get("data", {}).get("diff", [])
        self.logger.info("第 %s 页返回 %s 条股票记录", pn, len(diff_list))

        # 遍历每一只股票，封装为 Item
        for record in diff_list:
            item = StockdemoItem()
            item["stock_code"]   = record.get("f12", "")     # 股票代码
            item["stock_name"]   = record.get("f14", "")     # 股票名称
            item["latest_price"] = record.get("f2", "")      # 最新价
            item["change_pct"]   = record.get("f3", "")      # 涨跌幅
            item["change_amt"]   = record.get("f4", "")      # 涨跌额
            item["volume"]       = record.get("f5", "")      # 成交量
            item["turnover"]     = record.get("f6", "")      # 成交额
            item["amplitude"]    = record.get("f7", "")      # 振幅
            item["high_price"]   = record.get("f15", "")     # 最高
            item["low_price"]    = record.get("f16", "")     # 最低
            item["open_price"]   = record.get("f17", "")     # 今开
            item["pre_close"]    = record.get("f18", "")     # 昨收

            # 交给 pipelines 后续处理
            yield item

接着看连接数据库代码，这块负责把数据写入数据库

import sqlite3

class SqlitePipeline(object):
    def open_spider(self, spider):
        """爬虫启动时：连接数据库并建表"""
        self.conn = sqlite3.connect("stocks.db")
        self.cursor = self.conn.cursor()

        create_table_sql = """
        CREATE TABLE IF NOT EXISTS stock_info (
            id INTEGER PRIMARY KEY AUTOINCREMENT,   -- 序号 id
            stock_code   TEXT,                      -- 股票代码
            stock_name   TEXT,                      -- 股票名称
            latest_price TEXT,                      -- 最新报价
            change_pct   TEXT,                      -- 涨跌幅
            change_amt   TEXT,                      -- 涨跌额
            volume       TEXT,                      -- 成交量
            turnover     TEXT,                      -- 成交额
            amplitude    TEXT,                      -- 振幅
            high_price   TEXT,                      -- 最高
            low_price    TEXT,                      -- 最低
            open_price   TEXT,                      -- 今开
            pre_close    TEXT                       -- 昨收
        );
        """
        self.cursor.execute(create_table_sql)
        self.conn.commit()

    def close_spider(self, spider):
        """爬虫结束时：关闭连接"""
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        """每拿到一个 item，就插入数据库一行"""
        insert_sql = """
        INSERT INTO stock_info(
            stock_code, stock_name, latest_price,
            change_pct, change_amt, volume, turnover,
            amplitude, high_price, low_price,
            open_price, pre_close
        ) VALUES (
            ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?
        )
        """
        data = (
            item.get("stock_code"),
            item.get("stock_name"),
            item.get("latest_price"),
            item.get("change_pct"),
            item.get("change_amt"),
            item.get("volume"),
            item.get("turnover"),
            item.get("amplitude"),
            item.get("high_price"),
            item.get("low_price"),
            item.get("open_price"),
            item.get("pre_close"),
        )
        print(">>> SqlitePipeline 插入一条记录：", data)  # 调试用
        self.cursor.execute(insert_sql, data)
        self.conn.commit()
        return item

实验结果

导出csv文件如图所示
2.心得体会
本任务让我更深刻的了解到如何create 一个 scrapy project项目以及如何连接数据库并写入，理解了 item 从爬取到入库的完整流程，自己写pipeline完成建表和插入，加深了对 Scrapy 数据持久化机制的认识。

作业③

1.爬取外汇数据网站

实验要求
熟练掌握 Scrapy 中 Item、Pipeline 数据的序列化输出方法。
使用 Scrapy 框架 + Xpath + 数据库存储技术路线（本实验采用 SQLite），爬取中国银行外汇牌价网站数据。
候选网站：中国银行外汇牌价 (https://www.boc.cn/sourcedb/whpj/)
输出信息：将爬取的数据存储在数据库中，表头包含：Currency (货币名称), TBP (现汇买入价), CBP (现钞买入价), TSP (现汇卖出价), CSP (现钞卖出价), Time (发布时间)。
核心代码
首先我需要找到主内容区域的第一张表，用 XPath 找到页面中主内容区域的第一个

。把当前行里所有

里的文本全部拿出来，接着填充到item里，后面也是交给pipeline准备入库。

class BocRateSpider(scrapy.Spider):
    name = "boc_rate"
    allowed_domains = ["boc.cn"]
    start_urls = ["https://www.boc.cn/sourcedb/whpj/"]

    def parse(self, response):

        # 找到主内容区域里的第一张表
        table = response.xpath('//div[contains(@class,"BOC_main")]//table[1]')

        # 跳过第一行表头，从第二行开始都是数据
        rows = table.xpath('.//tr[position()>1]')
        self.logger.info("共解析到 %s 行外汇记录", len(rows))

        for tr in rows:
            # 当前行所有单元格文本（XPath）
            tds = tr.xpath('./td//text()').getall()
            # 去掉空白
            tds = [t.strip() for t in tds if t.strip()]
            if len(tds) < 8:
                continue

            item = WhpjItem()
            item["currency"] = tds[0]
            item["tbp"]      = tds[1]
            item["cbp"]      = tds[2]
            item["tsp"]      = tds[3]
            item["csp"]      = tds[4]
            item["time"]     = tds[7]

            print(">>> item:", item["currency"], item["tbp"], item["cbp"],
                  item["tsp"], item["csp"], item["time"])

            yield item

入库代码

import sqlite3

class SqlitePipeline(object):
    def open_spider(self, spider):
        """爬虫启动时：连接/创建数据库，并建表"""
        self.conn = sqlite3.connect("whpj.db")
        self.cursor = self.conn.cursor()

        create_sql = """
        CREATE TABLE IF NOT EXISTS fx_rate (
            id       INTEGER PRIMARY KEY AUTOINCREMENT,
            currency TEXT,
            tbp      TEXT,
            cbp      TEXT,
            tsp      TEXT,
            csp      TEXT,
            time     TEXT
        );
        """
        self.cursor.execute(create_sql)
        self.conn.commit()

    def close_spider(self, spider):
        """爬虫结束：关闭连接"""
        self.cursor.close()
        self.conn.close()

    def process_item(self, item, spider):
        """每次来一个 item 就写库"""
        insert_sql = """
        INSERT INTO fx_rate(currency, tbp, cbp, tsp, csp, time)
        VALUES (?, ?, ?, ?, ?, ?)
        """
        data = (
            item.get("currency"),
            item.get("tbp"),
            item.get("cbp"),
            item.get("tsp"),
            item.get("csp"),
            item.get("time"),
        )
        print(">>> 插入记录：", data)  # 调试输出
        self.cursor.execute(insert_sql, data)
        self.conn.commit()
        return item

实验结果

2.心得体会
我的主要收获有两点：一是对网页结构和 XPath 更敏感了，一开始只是知道表格在页面里，用浏览器审查元素多看了几次，才确认 BOC_main 下面第一张表才是我要的数据，然后再按行、按列去拆。二是对 Scrapy 的解析流程更有感觉了：response 进来先锁定表，再 tr 循环拿 td 文本，清洗空白后按顺序映射到 Item 字段，最后交给 pipeline。以前只是觉得 Scrapy 有点“重”，这次真写完一个小功能，感觉它在处理这类结构化数据（表格、列表）时还是挺顺手的，只要前期把 XPath 和字段对应关系弄清楚，后面扩展和维护都会比较方便。

posted on 2025-11-18 14:49 林焜阅读(9) 评论(0) 收藏举报