py爬虫学习 -2025/3/7

BeautifulSoup (`soup`) 常用方法速查表

方法/属性	用途	参数示例	返回值
查找元素
`find(tag, attrs)`	查找第一个匹配的标签	`soup.find('div', class_='content')`	单个 `Tag` 对象
`find_all(tag, attrs)`	查找所有匹配的标签	`soup.find_all('a', href=True)`	`Tag` 对象列表
`select(css)`	通过 CSS 选择器查找所有匹配元素	`soup.select('div.item > a.link')`	`Tag` 对象列表
`select_one(css)`	通过 CSS 选择器查找第一个匹配元素	`soup.select_one('title')`	单个 `Tag` 对象
提取数据
`.text` / `.get_text()`	提取标签内所有文本（含子标签）	`tag.text` 或 `tag.get_text(strip=True)`	字符串
`.string`	提取标签直接包含的文本（无子标签时有效）	`tag.string`	字符串（可能为 `None`）
`tag['attr']`	获取标签属性值	`tag['href']`	字符串（属性值）
`.get('attr', default)`	安全获取属性值（避免 `KeyError`）	`tag.get('class', [])`	字符串或默认值
遍历与导航
`.parent`	获取父标签	`tag.parent`	`Tag` 或 `BeautifulSoup` 对象
`.children`	获取直接子标签（生成器）	`for child in tag.children:`	生成器
`.descendants`	获取所有后代标签（生成器）	`for descendant in tag.descendants:`	生成器
`.next_sibling`	获取下一个同级标签	`tag.next_sibling`	`Tag` 或 `NavigableString`
`.previous_sibling`	获取上一个同级标签	`tag.previous_sibling`	`Tag` 或 `NavigableString`
迭代与过滤
`.strings`	生成标签内所有文本（含空白符）	`for text in tag.strings:`	生成器
`.stripped_strings`	生成标签内去空白后的文本	`for text in tag.stripped_strings:`	生成器
`find_next(tag)`	查找后续第一个匹配的标签	`tag.find_next('p')`	`Tag` 对象
`find_all_next(tag)`	查找后续所有匹配的标签	`tag.find_all_next('a')`	`Tag` 对象列表

关键说明

参数示例：
• tag: 标签名（如 'div'、'a'）。
• attrs: 属性字典（如 {'class': 'title'}）。
• css: CSS 选择器（如 'div.content > a.link'）。
常用场景：
• 快速定位：优先用 select（CSS 选择器）替代链式 find，代码更简洁。
• 文本提取：若需保留格式用 .text，需去空白用 .get_text(strip=True)。
• 健壮性：使用 .get() 避免属性不存在时报错。

解析器选择：

soup = BeautifulSoup(html_doc, 'lxml')       # 速度快（推荐）
soup = BeautifulSoup(html_doc, 'html.parser') # 无需安装第三方库

动态内容：若页面由 JavaScript 渲染，需搭配 Selenium 或 requests-html 使用。

通过此表可快速查阅方法用途，结合具体需求灵活使用

爬取豆瓣top250电影并保存至excel

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook  # 改用openpyxl
import random
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36',
    'Referer': 'https://movie.douban.com/',
    'Host': 'movie.douban.com'
}


# proxies = {
#     'http': 'http://219.140.143.75:7893',
#     'https': 'https://219.140.143.75:7893'
# }

def request_douban(url):
    try:
        response = requests.get(url=url, headers=headers, timeout=10)
        if response.status_code == 200:
            return response.text
        else:
            print(f"请求失败，状态码：{response.status_code}")
            return None
    except requests.RequestException as e:
        print(f"请求异常：{str(e)}")
        return None


# 使用openpyxl创建Excel文件
book = Workbook()
sheet = book.active
sheet.title = '豆瓣电影Top250'
sheet.append(['名称', '图片', '排名', '评分', '作者', '简介'])


def save_to_excel(soup):
    grid_view = soup.find(class_='grid_view')
    if not grid_view:
        print("未找到电影列表，可能页面结构已更新！")
        return
    movie_list = grid_view.find_all('li')

    for item in movie_list:
        try:
            item_name = item.find(class_='title').text.strip()
            item_img = item.find('img').get('src')
            item_index = item.find(class_='pic').find('em').text  # 修正排名选择器
            item_score = item.find(class_='rating_num').text
            item_author = item.find('p').text.strip()
            item_intr = item.find(class_='quote').text if item.find(class_='quote') else '暂无简介'

            print(f'爬取电影：{item_index} | {item_name} | {item_score} | {item_intr}')

            sheet.append([item_name, item_img, item_index, item_score, item_author, item_intr])
        except Exception as e:
            print(f"解析电影条目出错：{str(e)}")
            continue


def main(page):
    url = f'https://movie.douban.com/top250?start={page * 25}&filter='
    html = request_douban(url)
    if html:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)


if __name__ == '__main__':
    for i in range(10):
        main(i)
        time.sleep(random.uniform(5, 10))  # 随机延迟

    book.save(u'豆瓣最受欢迎的250部电影.xlsx')  # 保存为.xlsx

posted @ 2025-03-07 22:06 XYu1230 阅读(70) 评论(0) 收藏举报

刷新页面返回顶部

XYu1230

py爬虫学习 -2025/3/7

BeautifulSoup (soup) 常用方法速查表

关键说明

公告

BeautifulSoup (`soup`) 常用方法速查表