作业2

作业①

实验要求
在中国气象网(http://www.weather.com.cn)给定城市集的7日天气预报,并保存在数据库。

思路:爬取数据方面如之前作业,查看网页源代码,发现天气数据在"li class"属性下,采用requesets和bs4进行爬取,之后采用sqlite3将数据存入 SQLite 数据库中
image

核心代码

点击查看代码
# ========== 1. 请求网页 ==========
city_name = "福州"
url = "https://www.weather.com.cn/weather/101230101.shtml"
headers = {"User-Agent": "Mozilla/5.0"}

resp = requests.get(url, headers=headers)
resp.encoding = "utf-8"

soup = BeautifulSoup(resp.text, "html.parser")

# ========== 2. 解析天气(取前7天) ==========
ul = soup.find("div", id="7d").find("ul")
data_list = []

for li in ul.find_all("li")[:7]:
    date = li.find("h1").text.strip()
    wea = li.find("p", class_="wea").text.strip()
    temp = li.find("p", class_="tem").text.strip()

    data_list.append([city_name, date, wea, temp])
点击查看代码
# ========== 3. 存入 SQLite 数据库 ==========
conn = sqlite3.connect("weather.db")
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE IF NOT EXISTS weather (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        city TEXT,
        date TEXT,
        weather TEXT,
        temperature TEXT
    )
""")

cursor.executemany(
    "INSERT INTO weather (city, date, weather, temperature) VALUES (?,?,?,?)",
    data_list
)

conn.commit()

结果
数据文件weather.db:

image
终端打印:

image

实验心得:
(1)掌握了将数据存为.db文件的操作
(2)要加入headers部分防止反爬

作业②

实验要求
用requests和json解析方法定向爬取股票相关信息,并存储在数据库中

思路:(1)F12查看网页存信息的js文件,查看其请求url及负载
image
image
image
(2)使用requests库请求这个API的URL,用字典的方式从响应该网页字符串参数部分

点击查看代码
# 配置
# =========================================================
url = "https://vip.stock.finance.sina.com.cn/quotes_service/api/json_v2.php/Market_Center.getHQNodeData"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
# 翻页配置
MAX_ITEMS = 60  # 最大读取数目
PAGE_SIZE = 30  # 每页条目数
all_data = []
current_page = 1

print(f"开始爬取数据,目标最大条目数: {MAX_ITEMS}")

# =========================================================
# 翻页读取数据
# =========================================================

while len(all_data) < MAX_ITEMS:
    params = {
        "page": str(current_page),
        "num": str(PAGE_SIZE),
        "sort": "changepercent",
        "asc": "0",
        "node": "hs_a",
        "symbol": "",
        "_s_r_a": "init"
    }

    # 1. 发送请求
    resp = requests.get(url, params=params, headers=headers, timeout=10)
    resp.raise_for_status()  # 检查HTTP错误
    resp.encoding = "utf-8"
(3)使用json对爬取文本进行提取,同时限制其爬取量(最多为60)
点击查看代码
 # 2. 提取纯文本
    soup = BeautifulSoup(resp.text, "html.parser")
    text_only = soup.text.strip()

    # 3. 解析 JSON
    data = json.loads(text_only)

    if not data:
        print(f"第 {current_page} 页未获取到数据或数据为空,停止翻页。")
        break

    # 4. 限制添加的数据量,确保不超过 MAX_ITEMS
    items_needed = MAX_ITEMS - len(all_data)
    items_to_add = data[:items_needed]
    all_data.extend(items_to_add)

    print(f"成功获取第 {current_page} 页 ({len(items_to_add)} 条),当前总数: {len(all_data)} / {MAX_ITEMS}")

    if len(all_data) >= MAX_ITEMS:
        break

    current_page += 1

(4)用sqlite3将数据存入stock.db中
点击查看代码
# =========================================================
# 将数据写入 SQLite 数据库
# =========================================================

DB_NAME = "stock_data.db"
conn = None
try:
    # 连接数据库(自动创建)
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()

    # 建表(不存在则创建)
    cursor.execute("""
    CREATE TABLE IF NOT EXISTS stock_info (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        symbol TEXT,
        name TEXT,
        trade REAL,
        changepercent REAL,
        pricechange REAL,
        volume INTEGER,
        amount REAL,
        high REAL,
        low REAL,
        open REAL,
        settlement REAL
    )
    """)

    # 准备插入数据
    data_to_insert = []
    for item in all_data:
        data_to_insert.append((
            item.get("symbol", ""),
            item.get("name", ""),
            item.get("trade", ""),
            item.get("changepercent", ""),
            item.get("pricechange", ""),
            item.get("volume", ""),
            item.get("amount", ""),
            item.get("high", ""),
            item.get("low", ""),
            item.get("open", ""),
            item.get("settlement", ""),
        ))

    # 批量插入数据
    cursor.executemany("""
        INSERT INTO stock_info (symbol, name, trade, changepercent, pricechange,
                                volume, amount, high, low, open, settlement)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, data_to_insert)

    conn.commit()

    print(f"\n数据已成功写入 SQLite 数据库:{DB_NAME}")
**结果**:

image

image
实验心得
(1)爬取网页中js文件数据比爬取html文件要高效得多(没有正则表达式太快了),其最重要的是找到API和其js数据文件,然后对其相应参数进行更换。

作业③

实验要求
爬取中国大学2021主榜(https://www.shanghairanking.cn/rankings/bcur/2021)所有院校信息,并存储在数据库中,同时将浏览器F12调试分析的过程录制Gif加入至博客中

思路:(1)找到相应api对应的js数据接口https://www.shanghairanking.cn/_nuxt/static/1762223212/rankings/bcur/202111/payload.js

【软科排名】2021年最新软科中国大学排名_中国最好大学排名 和另外 2 个页面 - 个人 - Microsoft_ Edge 2025-11-08 21-09-46

(2)爬取后发现该数据下有许多是变量代号(eq、dj等)的字段(因为有__NUXT_JSONP__参数调用),对此采用字典的形式将每个参数名的值进行一一对应(该部分为形参,需要先对齐),(如i:双一流,j:211),生成mapping.csv,相应数据如下:

image
(3)查找univData数组(发现该数组下有大学相应的排名,得分等信息),使用正则表达式找到univNameCn,province,univCategory,score四个键,将其放进表当中。同时将部分大学的score通过mapping.csv进行转换(因为该部分为形参,要把对象里这些代号替换回真实值)
核心代码

点击查看代码
# ===================== 主程序 =====================
def run_ranking_process():
    source_js_url = "https://www.shanghairanking.cn/_nuxt/static/1762223212/rankings/bcur/202111/payload.js"
    local_js_file = "payload.js"
    main_db_file = "university_paiming.db"

    # 原代码中main函数没有调用 download_payload_js,此处保持一致。
    # fetch_js_file(source_js_url, local_js_file)

    # 加载得分映射表
    score_lookup_map = load_score_config("mapping.csv")

    # 解析数据
    final_rankings_data = process_university_data(local_js_file, score_lookup_map)

    # 写入数据库
    main_db_conn = establish_database_connection(main_db_file)
    if main_db_conn:
        create_university_table(main_db_conn)
        populate_university_data(main_db_conn, final_rankings_data)

        # ------- 打印前20条结果 -------
        print("\n数据预览(前20条):")
        display_headers = ["排名", "学校", "省份", "类型", "总分"]
        column_widths = [6, 20, 10, 10, 8]

        # 打印表头
        for header_text, width_val in zip(display_headers, column_widths):
            print(format_padded_string(header_text, width_val), end="")
        print()
        print("-" * sum(column_widths))

        # 打印数据
        for rank_entry in final_rankings_data[:20]:
            formatted_row = [
                format_padded_string(rank_entry["rank"], column_widths[0]),
                format_padded_string(rank_entry["name_cn"], column_widths[1]),
                format_padded_string(rank_entry["province"], column_widths[2]),
                format_padded_string(rank_entry["category"], column_widths[3]),
                format_padded_string(rank_entry["score"], column_widths[4])
            ]
            print("".join(formatted_row))

        print(f"\n共解析 {len(final_rankings_data)} 所大学数据,已保存至数据库。")
        main_db_conn.close()
点击查看代码
# ===================== 映射文件加载 =====================
def load_score_config(config_filepath="mapping.csv"):
    score_mappings = {}
    if not os.path.exists(config_filepath):
        print(f"未找到映射文件,将跳过映射替换。")
        return score_mappings
    with open(config_filepath, "r", encoding="utf-8") as f:
        for line_content in f:
            line_content = line_content.strip()
            if not line_content or line_content.startswith("#"):
                continue
            # 自动识别分隔符(制表符、空格、逗号)
            parts = re.split(r'[\t, ]+', line_content)
            if len(parts) >= 2:
                key, val = parts[0], parts[1]
                score_mappings[key.strip()] = val.strip()

    print(f"成功加载 {len(score_mappings)} 条得分映射。")
    return score_mappings


# ===================== 辅助解析函数 =====================
def _extract_item_field(university_item_str, field_regex, fallback_value=""):
    """
    从单个大学的字符串数据中提取字段的原始值。
    :param university_item_str: 单个大学的字符串数据。
    :param field_regex: 用于匹配字段值的正则表达式。
    :param fallback_value: 提取失败时的默认值。
    :return: 提取到的原始字符串值,或默认值。
    """
    match_result = re.search(field_regex, university_item_str)
    if match_result:
        # Group 1 is usually the captured value, strip quotes and whitespace.
        return match_result.group(1).strip('"').strip()
    return fallback_value


# ===================== 数据解析 =====================
def process_university_data(js_data_filepath="payload.js", score_conversion_map=None):
    """解析本地文件,只提取:学校、省份、类型、总分(支持映射替换)"""

    with open(js_data_filepath, "r", encoding="utf-8") as f:
        js_content = f.read()

    data_block_match = re.search(r'univData:\s*\[(.*?)\],\s*indList:', js_content, re.S)
    if not data_block_match:
        print("未找到大学数据块。")
        return []

    university_records = re.findall(r'\{[^}]*univNameCn:"[^"]+"[^}]*\}', data_block_match.group(1), re.S)
    parsed_university_rankings = []

    for record_index, record_string in enumerate(university_records, 1):
        chinese_name = _extract_item_field(record_string, r'univNameCn:"(.*?)"', f"未知大学_{record_index}")

        # 排名处理
        raw_rank_value = _extract_item_field(record_string, r'ranking:([^,]+)', str(record_index))
        ranking_position = int(raw_rank_value) if raw_rank_value.isdigit() else record_index

        # 省份处理
        prov_code = _extract_item_field(record_string, r'province:([^,]+)', "")
        province_name = PROVINCE_CODE_TO_NAME.get(prov_code, "未知省份")

        # 类型处理
        cat_code = _extract_item_field(record_string, r'univCategory:([^,]+)', "")
        category_type = CATEGORY_CODE_TO_TYPE.get(cat_code, "未知类型")

        # 总分处理
        raw_score_value = _extract_item_field(record_string, r'score:([^,}]+)', None) # 使用 None 作为默认值,以区分未找到和空字符串
        final_score = None
        if raw_score_value is not None:
            # 映射替换
            if score_conversion_map and raw_score_value in score_conversion_map:
                raw_score_value = score_conversion_map[raw_score_value]
            # 尝试转浮点
            try:
                final_score = float(raw_score_value)
            except (ValueError, TypeError): # 增加对 TypeError 的捕获,更健壮
                final_score = None

        parsed_university_rankings.append({
            "rank": ranking_position,
            "name_cn": chinese_name,
            "province": province_name,
            "category": category_type,
            "score": final_score
        })

    print(f"成功解析 {len(parsed_university_rankings)} 条排名数据")
    return parsed_university_rankings


# ===================== 用sqlite进行数据库操作 =====================
def establish_database_connection(database_filepath):
    """创建数据库连接"""
    try:
        db_connection = sqlite3.connect(database_filepath)
        return db_connection
    except Error as e:
        print(f"数据库连接失败: {e}")
        return None


def create_university_table(db_connection):
    """创建仅包含4字段的表"""
    try:
        db_cursor = db_connection.cursor()
        db_cursor.execute("""
            CREATE TABLE IF NOT EXISTS university_rankings (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                rank INTEGER,
                name_cn TEXT,
                province TEXT,
                category TEXT,
                score REAL
            )
        """)
        db_connection.commit()
        print("大学排名表创建成功")
    except Error as e:
        print(f"创建表失败: {e}")


def populate_university_data(db_connection, university_rank_list):
    """插入排名数据"""
    if not university_rank_list:
        print("没有数据可插入")
        return

    try:
        db_cursor = db_connection.cursor()
        db_cursor.execute("DELETE FROM university_rankings") # 清空旧数据

        insert_sql_query = "INSERT INTO university_rankings (rank, name_cn, province, category, score) VALUES (?, ?, ?, ?, ?)"
        data_to_insert = [(r["rank"], r["name_cn"], r["province"], r["category"], r["score"]) for r in university_rank_list]

        db_cursor.executemany(insert_sql_query, data_to_insert)
        db_connection.commit()
        print(f"成功插入 {db_cursor.rowcount} 条数据")

    except Error as e:
        print(f"插入数据失败: {e}")
        db_connection.rollback()

def calculate_char_display_width(input_string):
    display_width = 0
    for char_unit in str(input_string):
        if unicodedata.east_asian_width(char_unit) in ('F', 'W'):
            display_width += 2
        else:
            display_width += 1
    return display_width


def format_padded_string(input_string, desired_width):
    input_string = str(input_string)
    padding_spaces = desired_width - calculate_char_display_width(input_string)
    return input_string + " " * max(padding_spaces, 0)

**结果**:

image

image
实验心得
(1)该页面json数据解析十分复杂,部分形参数据信息无法用肉眼识别(很多乱码),最后选择性调用需要的数据(形参,data列表中的univ等)。
(2)该网页js数据的url似乎还是在变换的,与同学对照后发现自己的url和他们不同,这方面也是花了不少时间
gitee链接:https://gitee.com/liu-yi-huang/data_project/tree/master/作业2

posted @ 2025-11-09 11:48  流留六刘  阅读(12)  评论(0)    收藏  举报