28日日志豆瓣书籍评论爬取与分析

在数据分析和自然语言处理领域，爬取和分析网络评论是一种常见的研究方法。本文将介绍如何使用 Python 爬取豆瓣书籍评论，并进行词云生成和评论分析。我们将以《都挺好》这本书为例，展示完整的爬取和分析流程。

一、背景与目标

豆瓣是中国知名的社交平台，用户可以在上面发表对书籍、电影和音乐的评论。这些评论不仅反映了用户的真实感受，还蕴含了丰富的语义信息。通过爬取和分析这些评论，我们可以了解读者对某本书的看法，挖掘评论中的热点话题，甚至进行情感分析。

本文的目标是：

爬取豆瓣书籍评论数据。
解析评论内容，提取有效信息。
生成词云，可视化评论中的高频词汇。
提供评论的多维度展示，包括按热度、时间排序等。

二、准备工作

1. 环境搭建

确保已安装以下 Python 库：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML 页面。
jieba：用于中文分词。
wordcloud：用于生成词云。
matplotlib：用于可视化。

可以通过以下命令安装：

pip install requests beautifulsoup4 jieba wordcloud matplotlib

2. 数据准备

创建一个 stopwords.txt 文件，用于存放停用词（如“的”、“了”等常见但无实际意义的词汇）。
创建一个 custom_dict.txt 文件，用于自定义分词词典，确保一些特定词汇（如书名、人名）能够正确分词。

三、代码实现

1. 爬取评论数据

我们使用 requests 库发送 HTTP 请求，获取豆瓣书籍评论页面的 HTML 内容，然后通过 BeautifulSoup 解析页面，提取评论信息。

import requests
from bs4 import BeautifulSoup
import time

def fetch_page_comments(book_id, sort_param, page):
    """爬取指定页面的评论"""
    start = (page - 1) * 20
    url = f"https://book.douban.com/subject/{book_id}/comments/?sort={sort_param}&start={start}&status=P"
    try:
        response = requests.get(url, timeout=10)
        response.encoding = 'utf-8'
        response.raise_for_status()
        if "异常流量" in response.text or "security_check" in response.url:
            raise Exception("触发反爬机制，请更换IP或更新cookies")
        soup = BeautifulSoup(response.text, 'html.parser')
        return parse_comments(soup)
    except Exception as e:
        print(f"请求失败: {str(e)[:50]}")
        return []

def parse_comments(soup):
    """解析评论数据"""
    comments = []
    for item in soup.find_all('li', class_='comment-item'):
        try:
            user_tag = item.find('a', href=lambda x: x and '/people/' in x)
            username = user_tag.text.strip() if user_tag else "匿名用户"
            content = item.find('span', class_='short').text.strip() if item.find('span', class_='short') else ""
            time_tag = item.find('span', class_='comment-time')
            comment_time = time_tag.get('title', '').strip() or time_tag.text.strip()
            rating = parse_rating(item)
            vote_tag = item.find('span', class_='vote-count')
            vote_count = int(vote_tag.text.strip()) if vote_tag else 0
            comments.append({
                'username': username,
                'content': content,
                'time': comment_time,
                'rating': rating,
                'vote_count': vote_count
            })
        except Exception as e:
            print(f"解析异常: {str(e)[:50]}")
            continue
    return comments

2. 评论解析

我们定义了 parse_rating 函数来解析用户评分，同时对评论时间、用户名等信息进行了提取和处理。

def parse_rating(item):
    """解析评分"""
    rating_tag = item.find('span', class_='rating')
    if rating_tag:
        stars = rating_tag.find('span', class_=lambda c: c and c.startswith('allstar'))
        if stars:
            return int(stars['class'][0].replace('allstar', '')) // 10
    title_rating = item.find('span', title=lambda t: t and '星' in t)
    if title_rating:
        return int(title_rating['title'][0])
    return None

3. 生成词云

使用 jieba 进行中文分词，并结合 wordcloud 生成词云。

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter

def generate_wordcloud(text, title):
    """生成词云"""
    words = process_text(text)
    word_counts = Counter(words)
    wc = WordCloud(font_path='msyh.ttc', width=1600, height=1200, background_color='white').generate_from_frequencies(word_counts)
    plt.figure(figsize=(20, 16))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'《{title}》评论分析', fontsize=24)
    plt.show()

4. 数据展示

我们将爬取的评论按热度和时间排序，并展示前 10 条评论。

def print_comments(comments, title):
    """打印评论"""
    print(f"\n{'-' * 20} {title} {'-' * 20}")
    for i, item in enumerate(comments[:10], 1):
        username = item['username'] if item['username'] else "匿名用户"
        rating = f"{'★' * item['rating']}({item['rating']}星)" if item['rating'] else "无评分"
        time_str = f"时间：{item['time']}" if item['time'] else "[时间未记录]"
        print(f"{i}. {time_str} | {username} {rating}")
        print(f"   👍 有用数：{item['vote_count']}")
        print(f"   📝 内容：{item['content'][:60]}...\n")

posted @ 2025-05-29 23:42 Moonbeamsc 阅读(60) 评论(0) 收藏举报

刷新页面返回顶部

MoonbeamsC

见微知著，历久弥新。

2025/05/28日日志豆瓣书籍评论爬取与分析

一、背景与目标

二、准备工作

1. 环境搭建

2. 数据准备

三、代码实现

1. 爬取评论数据

2. 评论解析

3. 生成词云

4. 数据展示

公告

MoonbeamsC

见微知著，历久弥新。

2025/05/28日日志 豆瓣书籍评论爬取与分析

一、背景与目标

二、准备工作

1. 环境搭建

2. 数据准备

三、代码实现

1. 爬取评论数据

2. 评论解析

3. 生成词云

4. 数据展示

公告

2025/05/28日日志豆瓣书籍评论爬取与分析