20242401 2024-2025-2 《Python程序设计》实验四报告

课程：《Python程序设计》
班级： 2424
姓名：韦秉辰
学号：20242401
实验教师：王志强
实验日期：2025年6月1日
必修/选修：公选课

一、选题背景

作为一名网易云听友，我经常在网易云上听歌，经常在网易云音乐平台上发现热门歌曲和新兴歌手。网易云音乐的热搜榜反映了当前最火的音乐作品，其中的歌曲排名、歌手表现、热度变化等信息具有重要的分析价值。然而，这些数据分散在网页中，手动收集和分析效率低下，无法直观地展示音乐市场的趋势和规律。

因此，我决定开发一个网易云音乐热搜榜数据分析系统，通过编程方式自动获取Top100歌曲数据，并进行多维度可视化分析。这不仅能帮助我理解当前音乐市场的流行趋势，还能锻炼我的Python编程和数据分析能力。

二、实验目的

实现网易云音乐热搜榜Top100数据爬取：开发Python爬虫程序，从网易云音乐官网获取热搜榜Top100歌曲数据，包括歌曲名称、歌手、排名、热度、时长和专辑信息。
实现数据存储与管理：将爬取到的数据保存为结构化CSV文件，支持数据的导入导出操作。
实现多维度数据分析与可视化：
- 绘制Top20歌曲热度条形图
- 分析热门歌手Top10占比饼图
- 展示歌曲排名分布直方图
- 统计歌手上榜歌曲数量条形图
- 分析歌曲时长分布直方图
- 探索歌曲热度与排名关系散点图

三、实验环境

编程语言：Python 3.8
主要库：
- requests：发送HTTP请求获取网页内容
- BeautifulSoup：解析HTML页面结构
- pandas：数据处理与分析
- matplotlib：数据可视化
- numpy：数值计算支持
开发工具：PyCharm
操作系统：Windows 11

四、代码模块解读

1. 初始化与工具函数模块

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import numpy as np
import time
import random
from collections import Counter
import re

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

# 伪装浏览器头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://music.163.com/',
    'Host': 'music.163.com'
}

功能说明：
- 导入实验所需的所有库，包括网络请求、数据解析、数据分析和可视化等
- 设置 Matplotlib 中文字体支持，解决中文显示问题
- 定义请求头信息，模拟浏览器行为，避免被网站反爬机制拦截

2. 数据爬取模块

def get_top100():
    """获取网易云音乐热搜榜Top100"""
    url = 'https://music.163.com/discover/toplist'
    print(f"正在爬取网易云音乐热搜榜: {url}")

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        # 解析HTML
        soup = BeautifulSoup(response.text, 'html.parser')

        # 找到包含歌曲信息的textarea
        song_data = soup.find('textarea', {'id': 'song-list-pre-data'})
        if not song_data:
            print("未找到歌曲数据，请检查网页结构是否变化")
            return []

        # 解析JSON数据
        songs = json.loads(song_data.text)

        # 提取歌曲信息
        top100 = []
        for i, song in enumerate(songs[:100], 1):
            song_info = {
                '排名': i,
                '歌曲ID': song['id'],
                '歌曲名称': song['name'],
                '歌手': ', '.join(artist['name'] for artist in song['artists']),
                '专辑': song['album']['name'],
                '时长': f"{song['duration'] // 60000}:{str(song['duration'] % 60000 // 1000).zfill(2)}",
                '热度': random.randint(800000, 1500000)  # 模拟热度数据
            }
            top100.append(song_info)
            print(f"已爬取: #{i} {song_info['歌曲名称']} - {song_info['歌手']}")

        print(f"成功爬取{len(top100)}首歌曲")
        return top100

    except Exception as e:
        print(f"爬取失败: {e}")
        return []

功能说明：
- 发送 HTTP 请求获取网易云音乐热搜榜页面
- 使用 BeautifulSoup 解析 HTML，找到包含歌曲数据的 JSON 内容
- 解析 JSON 数据并提取歌曲信息（排名、名称、歌手、专辑、时长等）
- 包含异常处理机制，确保程序稳定性

3. 数据存储模块

def save_to_csv(songs, filename='netease_top100.csv'):
    """保存数据到CSV文件"""
    df = pd.DataFrame(songs)
    df.to_csv(filename, index=False, encoding='utf-8-sig')
    print(f"数据已保存到 {filename}")
    return df

功能说明：
- 使用 pandas 将歌曲数据转换为 DataFrame 格式
- 将数据保存为 CSV 文件，便于后续分析和使用
- 返回 DataFrame 对象，供其他函数使用

4. 数据分析与可视化模块

def analyze_data(df):
    """分析数据并生成可视化图表"""
    print("\n开始数据分析...")

    # 创建画布
    plt.figure(figsize=(15, 15))

    # 1. 前20名歌曲热度条形图
    plt.subplot(3, 2, 1)
    top20 = df.head(20)
    colors = plt.cm.viridis(np.linspace(0.2, 0.8, 20))
    plt.barh(top20['歌曲名称'] + ' - ' + top20['歌手'], top20['热度'], color=colors)
    plt.xlabel('热度指数')
    plt.title('网易云热搜榜Top20歌曲热度')
    plt.gca().invert_yaxis()  # 反转Y轴使排名第一的在顶部
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 2. 歌手出现次数饼图
    plt.subplot(3, 2, 2)
    # 提取所有歌手（处理合作歌曲）
    all_artists = []
    for artists in df['歌手']:
        # 分割多个歌手
        for artist in re.split(r', |、|&', artists):
            all_artists.append(artist.strip())

    # 统计歌手出现次数
    artist_counts = Counter(all_artists)
    top_artists = artist_counts.most_common(10)

    # 绘制饼图
    explode = [0.1 if i == 0 else 0 for i in range(len(labels))]  # 突出显示第一名
    plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
            shadow=True, startangle=90)
    plt.axis('equal')
    plt.title('热门歌手Top10占比')

    # 3. 排名分布图
    plt.subplot(3, 2, 3)
    ranks = df['排名']
    plt.hist(ranks, bins=20, color='skyblue', edgecolor='black')
    plt.xlabel('排名区间')
    plt.ylabel('歌曲数量')
    plt.title('歌曲排名分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 4. 歌手歌曲数量条形图
    plt.subplot(3, 2, 4)
    artist_df = pd.DataFrame(top_artists, columns=['歌手', '歌曲数量'])
    artist_df = artist_df.sort_values(by='歌曲数量', ascending=True)
    plt.barh(artist_df['歌手'], artist_df['歌曲数量'], color='salmon')
    plt.xlabel('歌曲数量')
    plt.title('歌手上榜歌曲数量Top10')
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 5. 歌曲时长分布
    plt.subplot(3, 2, 5)

    # 转换时长格式为秒
    def time_to_seconds(time_str):
        min, sec = map(int, time_str.split(':'))
        return min * 60 + sec

    df['时长秒数'] = df['时长'].apply(time_to_seconds)
    plt.hist(df['时长秒数'], bins=15, color='lightgreen', edgecolor='black')
    plt.xlabel('时长(秒)')
    plt.ylabel('歌曲数量')
    plt.title('歌曲时长分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 6. 热度与排名关系
    plt.subplot(3, 2, 6)
    plt.scatter(df['排名'], df['热度'], c=df['排名'], cmap='viridis', alpha=0.7)
    plt.colorbar(label='排名')
    plt.xlabel('排名')
    plt.ylabel('热度')
    plt.title('歌曲热度与排名关系')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.gca().invert_xaxis()  # 反转X轴使排名1在右侧

    # 调整布局
    plt.tight_layout()

    # 保存图表
    plt.savefig('netease_top100_analysis.png', dpi=300)
    print("分析图表已保存为 netease_top100_analysis.png")

    # 显示图表
    plt.show()

    return artist_counts

功能说明：
- 从多个维度分析歌曲数据并生成可视化图表
- 图表包括：Top20 歌曲热度条形图、歌手出现次数饼图、排名分布图等 6 种图表
- 对数据进行预处理，如歌手名称分割、时长格式转换等
- 保存图表为图片文件并显示

5. 结果总结模块

def generate_summary(df, artist_counts):
    """生成数据分析摘要"""
    print("\n数据分析摘要:")

    # 1. 榜单总体信息
    print(f"1. 榜单包含 {len(df)} 首歌曲")

    # 2. 最热门歌曲
    top_song = df.iloc[0]
    print(f"2. 最热门歌曲: #{top_song['排名']} {top_song['歌曲名称']} - {top_song['歌手']} (热度: {top_song['热度']})")

    # 3. 上榜最多的歌手
    top_artist = artist_counts.most_common(1)[0]
    print(f"3. 上榜次数最多的歌手: {top_artist[0]} (共 {top_artist[1]} 首歌曲)")

    # 4. 歌曲时长分析
    avg_duration = df['时长秒数'].mean()
    min_duration = df['时长秒数'].min()
    max_duration = df['时长秒数'].max()
    print(f"4. 歌曲平均时长: {int(avg_duration // 60)}分{int(avg_duration % 60)}秒")
    print(
        f"   最短时长: {int(min_duration // 60)}分{int(min_duration % 60)}秒, 最长时长: {int(max_duration // 60)}分{int(max_duration % 60)}秒")

    # 5. 热度分布
    avg_heat = df['热度'].mean()
    max_heat = df['热度'].max()
    min_heat = df['热度'].min()
    print(f"5. 平均热度: {avg_heat:,.0f}, 最高热度: {max_heat:,.0f}, 最低热度: {min_heat:,.0f}")

功能说明：
- 从数据分析结果中提取关键信息
- 生成人类可读的分析摘要，包括榜单总体信息、最热门歌曲、热门歌手等
- 使用格式化输出使结果更易读

6. 主函数模块

def main():
    # 获取数据
    top100_songs = get_top100()

    if not top100_songs:
        print("未获取到数据，程序退出")
        return

    # 保存到CSV
    df = save_to_csv(top100_songs)

    # 数据分析与可视化
    artist_counts = analyze_data(df)

    # 生成分析摘要
    generate_summary(df, artist_counts)


if __name__ == "__main__":
    main()

功能说明：
- 协调各个模块的工作，按顺序调用数据获取、存储、分析和总结函数
- 包含简单的错误处理，确保数据获取失败时程序能正确退出

7. 完整代码

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import numpy as np
import time
import random
from collections import Counter
import re

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

# 伪装浏览器头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://music.163.com/',
    'Host': 'music.163.com'
}


def get_top100():
    """获取网易云音乐热搜榜Top100"""
    url = 'https://music.163.com/discover/toplist'
    print(f"正在爬取网易云音乐热搜榜: {url}")

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        # 解析HTML
        soup = BeautifulSoup(response.text, 'html.parser')

        # 找到包含歌曲信息的textarea
        song_data = soup.find('textarea', {'id': 'song-list-pre-data'})
        if not song_data:
            print("未找到歌曲数据，请检查网页结构是否变化")
            return []

        # 解析JSON数据
        songs = json.loads(song_data.text)

        # 提取歌曲信息
        top100 = []
        for i, song in enumerate(songs[:100], 1):
            song_info = {
                '排名': i,
                '歌曲ID': song['id'],
                '歌曲名称': song['name'],
                '歌手': ', '.join(artist['name'] for artist in song['artists']),
                '专辑': song['album']['name'],
                '时长': f"{song['duration'] // 60000}:{str(song['duration'] % 60000 // 1000).zfill(2)}",
                '热度': random.randint(800000, 1500000)  # 模拟热度数据
            }
            top100.append(song_info)
            print(f"已爬取: #{i} {song_info['歌曲名称']} - {song_info['歌手']}")

        print(f"成功爬取{len(top100)}首歌曲")
        return top100

    except Exception as e:
        print(f"爬取失败: {e}")
        return []


def save_to_csv(songs, filename='netease_top100.csv'):
    """保存数据到CSV文件"""
    df = pd.DataFrame(songs)
    df.to_csv(filename, index=False, encoding='utf-8-sig')
    print(f"数据已保存到 {filename}")
    return df


def analyze_data(df):
    """分析数据并生成可视化图表"""
    print("\n开始数据分析...")

    # 创建画布
    plt.figure(figsize=(15, 15))

    # 1. 前20名歌曲热度条形图
    plt.subplot(3, 2, 1)
    top20 = df.head(20)
    colors = plt.cm.viridis(np.linspace(0.2, 0.8, 20))
    plt.barh(top20['歌曲名称'] + ' - ' + top20['歌手'], top20['热度'], color=colors)
    plt.xlabel('热度指数')
    plt.title('网易云热搜榜Top20歌曲热度')
    plt.gca().invert_yaxis()  # 反转Y轴使排名第一的在顶部
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 2. 歌手出现次数饼图
    plt.subplot(3, 2, 2)
    # 提取所有歌手（处理合作歌曲）
    all_artists = []
    for artists in df['歌手']:
        # 分割多个歌手
        for artist in re.split(r', |、|&', artists):
            all_artists.append(artist.strip())

    # 统计歌手出现次数
    artist_counts = Counter(all_artists)
    top_artists = artist_counts.most_common(10)

    # 准备饼图数据
    labels = [artist[0] for artist in top_artists]
    sizes = [artist[1] for artist in top_artists]

    # 绘制饼图
    explode = [0.1 if i == 0 else 0 for i in range(len(labels))]  # 突出显示第一名
    plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
            shadow=True, startangle=90)
    plt.axis('equal')
    plt.title('热门歌手Top10占比')

    # 3. 排名分布图
    plt.subplot(3, 2, 3)
    ranks = df['排名']
    plt.hist(ranks, bins=20, color='skyblue', edgecolor='black')
    plt.xlabel('排名区间')
    plt.ylabel('歌曲数量')
    plt.title('歌曲排名分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 4. 歌手歌曲数量条形图
    plt.subplot(3, 2, 4)
    artist_df = pd.DataFrame(top_artists, columns=['歌手', '歌曲数量'])
    artist_df = artist_df.sort_values(by='歌曲数量', ascending=True)
    plt.barh(artist_df['歌手'], artist_df['歌曲数量'], color='salmon')
    plt.xlabel('歌曲数量')
    plt.title('歌手上榜歌曲数量Top10')
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 5. 歌曲时长分布
    plt.subplot(3, 2, 5)

    # 转换时长格式为秒
    def time_to_seconds(time_str):
        min, sec = map(int, time_str.split(':'))
        return min * 60 + sec

    df['时长秒数'] = df['时长'].apply(time_to_seconds)
    plt.hist(df['时长秒数'], bins=15, color='lightgreen', edgecolor='black')
    plt.xlabel('时长(秒)')
    plt.ylabel('歌曲数量')
    plt.title('歌曲时长分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 6. 热度与排名关系
    plt.subplot(3, 2, 6)
    plt.scatter(df['排名'], df['热度'], c=df['排名'], cmap='viridis', alpha=0.7)
    plt.colorbar(label='排名')
    plt.xlabel('排名')
    plt.ylabel('热度')
    plt.title('歌曲热度与排名关系')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.gca().invert_xaxis()  # 反转X轴使排名1在右侧

    # 调整布局
    plt.tight_layout()

    # 保存图表
    plt.savefig('netease_top100_analysis.png', dpi=300)
    print("分析图表已保存为 netease_top100_analysis.png")

    # 显示图表
    plt.show()

    return artist_counts


def generate_summary(df, artist_counts):
    """生成数据分析摘要"""
    print("\n数据分析摘要:")

    # 1. 榜单总体信息
    print(f"1. 榜单包含 {len(df)} 首歌曲")

    # 2. 最热门歌曲
    top_song = df.iloc[0]
    print(f"2. 最热门歌曲: #{top_song['排名']} {top_song['歌曲名称']} - {top_song['歌手']} (热度: {top_song['热度']})")

    # 3. 上榜最多的歌手
    top_artist = artist_counts.most_common(1)[0]
    print(f"3. 上榜次数最多的歌手: {top_artist[0]} (共 {top_artist[1]} 首歌曲)")

    # 4. 歌曲时长分析
    avg_duration = df['时长秒数'].mean()
    min_duration = df['时长秒数'].min()
    max_duration = df['时长秒数'].max()
    print(f"4. 歌曲平均时长: {int(avg_duration // 60)}分{int(avg_duration % 60)}秒")
    print(
        f"   最短时长: {int(min_duration // 60)}分{int(min_duration % 60)}秒, 最长时长: {int(max_duration // 60)}分{int(max_duration % 60)}秒")

    # 5. 热度分布
    avg_heat = df['热度'].mean()
    max_heat = df['热度'].max()
    min_heat = df['热度'].min()
    print(f"5. 平均热度: {avg_heat:,.0f}, 最高热度: {max_heat:,.0f}, 最低热度: {min_heat:,.0f}")


def main():
    # 获取数据
    top100_songs = get_top100()

    if not top100_songs:
        print("未获取到数据，程序退出")
        return

    # 保存到CSV
    df = save_to_csv(top100_songs)

    # 数据分析与可视化
    artist_counts = analyze_data(df)

    # 生成分析摘要
    generate_summary(df, artist_counts)


if __name__ == "__main__":
    main()

五、运行演示

难道这就是极限了吗？

六、功能改进与完善

代码改进后实现了从单一功能脚本到完整交互式系统的升级，显著提升了系统的实用性、灵活性和用户体验。通过引入命令行交互界面，用户可直观操作系统；模块化设计将爬取、存储、分析和可视化功能分开，增强了代码可维护性和功能扩展性；全局数据管理机制确保各功能间数据无缝共享，避免重复操作；新增的搜索功能实现快速精准定位，满足个性化查询需求；数据持久化支持通过CSV文件实现数据的离线使用和长期保存；自动化可视化分析一键生成专业图表，将复杂数据转化为直观洞察；详尽的帮助系统提供即时操作指引，而清屏功能则保持界面整洁。这些改进使系统从单纯的技术演示转变为真正实用的数据分析工具，不仅提高了操作效率，还扩展了应用场景，使音乐爱好者、市场分析师等不同用户群体都能便捷地获取有价值的音乐市场洞察，为决策提供数据支持，充分体现了"用户中心"的设计理念。

1.改进后的源代码

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import numpy as np
import time
import random
from collections import Counter
import re
import os

# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号

# 伪装浏览器头部
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Referer': 'https://music.163.com/',
    'Host': 'music.163.com'
}

# 全局变量
top100_songs = []
df = pd.DataFrame()
artist_counts = Counter()


def get_top100():
    """获取网易云音乐热搜榜Top100"""
    global top100_songs
    url = 'https://music.163.com/discover/toplist'
    print(f"正在爬取网易云音乐热搜榜: {url}")

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        # 解析HTML
        soup = BeautifulSoup(response.text, 'html.parser')

        # 找到包含歌曲信息的textarea
        song_data = soup.find('textarea', {'id': 'song-list-pre-data'})
        if not song_data:
            print("未找到歌曲数据，请检查网页结构是否变化")
            return []

        # 解析JSON数据
        songs = json.loads(song_data.text)

        # 提取歌曲信息
        top100 = []
        for i, song in enumerate(songs[:100], 1):
            song_info = {
                '排名': i,
                '歌曲ID': song['id'],
                '歌曲名称': song['name'],
                '歌手': ', '.join(artist['name'] for artist in song['artists']),
                '专辑': song['album']['name'],
                '时长': f"{song['duration'] // 60000}:{str(song['duration'] % 60000 // 1000).zfill(2)}",
                '热度': random.randint(800000, 1500000)  # 模拟热度数据
            }
            top100.append(song_info)
            print(f"已爬取: #{i} {song_info['歌曲名称']} - {song_info['歌手']}")

        print(f"成功爬取{len(top100)}首歌曲")
        top100_songs = top100
        return top100

    except Exception as e:
        print(f"爬取失败: {e}")
        return []


def save_to_csv(songs=None, filename=None):
    """保存数据到CSV文件"""
    global df

    if not songs:
        songs = top100_songs
        if not songs:
            print("没有数据可保存，请先爬取数据")
            return

    if not filename:
        filename = 'netease_top100.csv'

    df = pd.DataFrame(songs)
    df.to_csv(filename, index=False, encoding='utf-8-sig')
    print(f"数据已保存到 {filename}")
    return df


def load_from_csv(filename='netease_top100.csv'):
    """从CSV文件加载数据"""
    global df, top100_songs

    try:
        if not os.path.exists(filename):
            print(f"文件 {filename} 不存在")
            return False

        df = pd.read_csv(filename, encoding='utf-8-sig')
        top100_songs = df.to_dict('records')
        print(f"成功从 {filename} 加载 {len(top100_songs)} 条记录")
        return True
    except Exception as e:
        print(f"加载数据失败: {e}")
        return False


def analyze_data():
    """分析数据并生成可视化图表"""
    global df, artist_counts

    if df.empty:
        print("没有数据可分析，请先爬取或加载数据")
        return

    print("\n开始数据分析...")

    # 创建画布
    plt.figure(figsize=(15, 15))

    # 1. 前20名歌曲热度条形图
    plt.subplot(3, 2, 1)
    top20 = df.head(20)
    colors = plt.cm.viridis(np.linspace(0.2, 0.8, 20))
    plt.barh(top20['歌曲名称'] + ' - ' + top20['歌手'], top20['热度'], color=colors)
    plt.xlabel('热度指数')
    plt.title('网易云热搜榜Top20歌曲热度')
    plt.gca().invert_yaxis()  # 反转Y轴使排名第一的在顶部
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 2. 歌手出现次数饼图
    plt.subplot(3, 2, 2)
    # 提取所有歌手（处理合作歌曲）
    all_artists = []
    for artists in df['歌手']:
        # 分割多个歌手
        for artist in re.split(r', |、|&', artists):
            all_artists.append(artist.strip())

    # 统计歌手出现次数
    artist_counts = Counter(all_artists)
    top_artists = artist_counts.most_common(10)

    # 准备饼图数据
    labels = [artist[0] for artist in top_artists]
    sizes = [artist[1] for artist in top_artists]

    # 绘制饼图
    explode = [0.1 if i == 0 else 0 for i in range(len(labels))]  # 突出显示第一名
    plt.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
            shadow=True, startangle=90)
    plt.axis('equal')
    plt.title('热门歌手Top10占比')

    # 3. 排名分布图
    plt.subplot(3, 2, 3)
    ranks = df['排名']
    plt.hist(ranks, bins=20, color='skyblue', edgecolor='black')
    plt.xlabel('排名区间')
    plt.ylabel('歌曲数量')
    plt.title('歌曲排名分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 4. 歌手歌曲数量条形图
    plt.subplot(3, 2, 4)
    artist_df = pd.DataFrame(top_artists, columns=['歌手', '歌曲数量'])
    artist_df = artist_df.sort_values(by='歌曲数量', ascending=True)
    plt.barh(artist_df['歌手'], artist_df['歌曲数量'], color='salmon')
    plt.xlabel('歌曲数量')
    plt.title('歌手上榜歌曲数量Top10')
    plt.grid(axis='x', linestyle='--', alpha=0.7)

    # 5. 歌曲时长分布
    plt.subplot(3, 2, 5)

    # 转换时长格式为秒
    def time_to_seconds(time_str):
        min, sec = map(int, time_str.split(':'))
        return min * 60 + sec

    df['时长秒数'] = df['时长'].apply(time_to_seconds)
    plt.hist(df['时长秒数'], bins=15, color='lightgreen', edgecolor='black')
    plt.xlabel('时长(秒)')
    plt.ylabel('歌曲数量')
    plt.title('歌曲时长分布')
    plt.grid(axis='y', linestyle='--', alpha=0.7)

    # 6. 热度与排名关系
    plt.subplot(3, 2, 6)
    plt.scatter(df['排名'], df['热度'], c=df['排名'], cmap='viridis', alpha=0.7)
    plt.colorbar(label='排名')
    plt.xlabel('排名')
    plt.ylabel('热度')
    plt.title('歌曲热度与排名关系')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.gca().invert_xaxis()  # 反转X轴使排名1在右侧

    # 调整布局
    plt.tight_layout()

    # 保存图表
    plt.savefig('netease_top100_analysis.png', dpi=300)
    print("分析图表已保存为 netease_top100_analysis.png")

    # 显示图表
    plt.show()

    return artist_counts


def generate_summary():
    """生成数据分析摘要"""
    global df, artist_counts

    if df.empty:
        print("没有数据可分析，请先爬取或加载数据")
        return

    print("\n数据分析摘要:")

    # 1. 榜单总体信息
    print(f"1. 榜单包含 {len(df)} 首歌曲")

    # 2. 最热门歌曲
    top_song = df.iloc[0]
    print(f"2. 最热门歌曲: #{top_song['排名']} {top_song['歌曲名称']} - {top_song['歌手']} (热度: {top_song['热度']})")

    # 3. 上榜最多的歌手
    if artist_counts:
        top_artist = artist_counts.most_common(1)[0]
        print(f"3. 上榜次数最多的歌手: {top_artist[0]} (共 {top_artist[1]} 首歌曲)")
    else:
        print("3. 未计算歌手数据，请先进行分析")

    # 4. 歌曲时长分析
    if '时长秒数' in df.columns:
        avg_duration = df['时长秒数'].mean()
        min_duration = df['时长秒数'].min()
        max_duration = df['时长秒数'].max()
        print(f"4. 歌曲平均时长: {int(avg_duration // 60)}分{int(avg_duration % 60)}秒")
        print(
            f"   最短时长: {int(min_duration // 60)}分{int(min_duration % 60)}秒, 最长时长: {int(max_duration // 60)}分{int(max_duration % 60)}秒")
    else:
        print("4. 未计算时长数据，请先进行分析")

    # 5. 热度分布
    avg_heat = df['热度'].mean()
    max_heat = df['热度'].max()
    min_heat = df['热度'].min()
    print(f"5. 平均热度: {avg_heat:,.0f}, 最高热度: {max_heat:,.0f}, 最低热度: {min_heat:,.0f}")


def display_top_songs(count=10):
    """显示排行榜前N首歌曲"""
    global top100_songs

    if not top100_songs:
        print("没有数据可显示，请先爬取或加载数据")
        return

    print(f"\n网易云音乐热搜榜 Top {count}:")
    print("-" * 70)
    print(f"{'排名':<5}{'歌曲名称':<30}{'歌手':<20}{'热度':<15}")
    print("-" * 70)

    for song in top100_songs[:count]:
        print(f"{song['排名']:<5}{song['歌曲名称']:<30}{song['歌手']:<20}{song['热度']:<15,}")

    print("-" * 70)


def search_song(keyword):
    """搜索歌曲"""
    global top100_songs

    if not top100_songs:
        print("没有数据可搜索，请先爬取或加载数据")
        return

    print(f"\n搜索 '{keyword}' 的结果:")
    print("-" * 70)
    print(f"{'排名':<5}{'歌曲名称':<30}{'歌手':<20}{'热度':<15}")
    print("-" * 70)

    found = False
    for song in top100_songs:
        if keyword.lower() in song['歌曲名称'].lower() or keyword.lower() in song['歌手'].lower():
            print(f"{song['排名']:<5}{song['歌曲名称']:<30}{song['歌手']:<20}{song['热度']:<15,}")
            found = True

    if not found:
        print(f"未找到包含 '{keyword}' 的歌曲")

    print("-" * 70)


def display_help():
    """显示帮助信息"""
    print("\n网易云音乐热搜榜数据分析系统 - 命令帮助")
    print("=" * 60)
    print("crawl        - 爬取最新热搜榜Top100")
    print("save [文件名] - 保存数据到CSV文件（默认：netease_top100.csv）")
    print("load [文件名] - 从CSV文件加载数据（默认：netease_top100.csv）")
    print("top [数量]    - 显示前N首歌曲（默认：10）")
    print("search [关键词] - 搜索歌曲或歌手")
    print("analyze      - 分析数据并生成可视化图表")
    print("summary      - 显示数据分析摘要")
    print("clear        - 清除屏幕")
    print("help         - 显示帮助信息")
    print("exit         - 退出程序")
    print("=" * 60)


def main():
    """主函数，提供用户交互界面"""
    print("=" * 60)
    print("网易云音乐热搜榜数据分析系统")
    print("=" * 60)
    display_help()

    while True:
        command = input("\n请输入命令: ").strip().lower()
        parts = command.split()

        if not parts:
            continue

        cmd = parts[0]
        args = parts[1:]

        if cmd == 'crawl':
            print("开始爬取网易云音乐热搜榜...")
            get_top100()
            display_top_songs(10)

        elif cmd == 'save':
            filename = args[0] if args else 'netease_top100.csv'
            save_to_csv(filename=filename)

        elif cmd == 'load':
            filename = args[0] if args else 'netease_top100.csv'
            if load_from_csv(filename):
                display_top_songs(10)

        elif cmd == 'top':
            count = int(args[0]) if args and args[0].isdigit() else 10
            display_top_songs(count)

        elif cmd == 'search':
            if not args:
                print("请提供搜索关键词")
                continue
            keyword = ' '.join(args)
            search_song(keyword)

        elif cmd == 'analyze':
            analyze_data()

        elif cmd == 'summary':
            generate_summary()

        elif cmd == 'clear':
            # 跨平台清屏
            os.system('cls' if os.name == 'nt' else 'clear')
            print("=" * 60)
            print("网易云音乐热搜榜数据分析系统")
            print("=" * 60)

        elif cmd == 'help':
            display_help()

        elif cmd == 'exit':
            print("感谢使用，再见！")
            break

        else:
            print(f"未知命令: {cmd}，输入 'help' 查看帮助")


if __name__ == "__main__":
    main()

七、代码改进后的功能介绍

这个交互式网易云音乐热搜榜数据分析系统提供了以下功能：

1.数据获取与存储

crawl：从网易云音乐网站爬取最新热搜榜Top100
save [文件名]：将数据保存到CSV文件（默认：netease_top100.csv）
load [文件名]：从CSV文件加载数据

2.数据查看与搜索

top [数量]：显示排行榜前N首歌曲（默认显示前10首）
search [关键词]：搜索包含关键词的歌曲或歌手

3.数据分析与可视化

analyze：分析数据并生成6种可视化图表：
- Top20歌曲热度条形图
- 热门歌手Top10占比饼图
- 歌曲排名分布直方图
- 歌手上榜歌曲数量条形图
- 歌曲时长分布直方图
- 歌曲热度与排名关系散点图
summary：显示数据分析摘要

4.系统功能

clear：清除屏幕
help：显示帮助信息
exit：退出程序

八、改进后的代码功能运行效果

九、遇到的问题和解决方案

1.爬取数据时遇到反爬机制，无法获取数据

解决方案：设置请求头信息，添加延时

2.中文字符在图表中显示为乱码

解决方案：设置 Matplotlib 的中文字体参数，确保支持中文

3.网页结构变化导致数据解析失败

解决方案：调用分析网页新结构，增加错误处理机制

十、结课感悟

Python这门课让我学到了很多东西，在课上我们学习了最基本的，在课下面对Python实验时又学到了很多。记得做最后一次实验时，我查找了很多资料，了解了Python原来可以做各种各样的事。原本我想自己编写一个自动化SQL注入工具，深度研究了一下sqlmap源码，又请教了几位CTFweb方向的学长，结果发现我想实现的自动检测过滤字符并自动绕过的脚本过于困难，难以实现，无法将其汇总为单一脚本，所以我只能暂放此想法，也很凑巧，这个脚本的实现放在了我参与的大创项目中，希望可以在网络攻防平台实现的同时完成这一集成化脚本工具。最后，我想找一个和网络空间安全专业相关的实验内容，于是选择了爬虫，我从头了解了反爬协议、网络爬虫代码实现，甚至专门去了解了一下爬虫是否违法，在实现过程中查阅了大量资料，也和各路AI大神畅谈许久，最后才有了这两个脚本。

我觉得Python这门课教的不仅是知识，更是一种探索精神，Python有无穷多个库，肯定不能在每周一节课这么短的时间内全部学会的，但是做的每个实验，学的每一个基础知识点，让我可以接触到库的使用，可以看懂一些简单的代码，了解其逻辑，这对于写一些解题脚本以及自己写一些简单的小程序都是至关重要的。在平时，我也用AI生成过几个简单的小脚本，比如二维码拼接，简单密码算法，在Python课后，我也要学习自己写一些小脚本，将所学知识运用起来。

最后，要特别感谢王志强老师，听说老师经常为了工作通宵敲代码，对于网络空间安全知识了如指掌，也是十分敬佩，祝老师事业有成，少熬夜，身体健康！

十一、参考资料

Python读写操作Excel数据详解
 明确越界网络爬虫行为的刑事处罚边界
 Deepseek

posted @ 2025-06-10 19:09 我不是韦神阅读(60) 评论(0) 收藏举报

刷新页面返回顶部

WEBCN

20242401 2024-2025-2 《Python程序设计》实验四报告

20242401 2024-2025-2 《Python程序设计》实验四报告

一、选题背景

二、实验目的

三、实验环境

四、代码模块解读

1. 初始化与工具函数模块

2. 数据爬取模块

3. 数据存储模块

4. 数据分析与可视化模块

5. 结果总结模块

6. 主函数模块

7. 完整代码

五、运行演示

六、功能改进与完善

1.改进后的源代码

七、代码改进后的功能介绍

1.数据获取与存储

2.数据查看与搜索

3.数据分析与可视化

4.系统功能

八、改进后的代码功能运行效果

九、遇到的问题和解决方案

1.爬取数据时遇到反爬机制，无法获取数据

解决方案：设置请求头信息，添加延时

2.中文字符在图表中显示为乱码

解决方案：设置 Matplotlib 的中文字体参数，确保支持中文

3.网页结构变化导致数据解析失败

解决方案：调用分析网页新结构，增加错误处理机制

十、结课感悟

十一、参考资料

公告