爬虫 - BeautifulSoup

Python 爬虫（Web Scraping）是指通过编写 Python 程序从互联网上自动提取信息的过程。

爬虫的基本流程通常包括发送 HTTP 请求获取网页内容、解析网页并提取数据，然后存储数据。

Python 的丰富生态使其成为开发爬虫的热门语言，特别是由于其强大的库支持。

一般来说，爬虫的流程可以分为以下几个步骤：

发送 HTTP 请求：爬虫通过 HTTP 请求从目标网站获取 HTML 页面，常用的库包括 requests。
解析 HTML 内容：获取 HTML 页面后，爬虫需要解析内容并提取数据，常用的库有 BeautifulSoup、lxml、Scrapy 等。
提取数据：通过定位 HTML 元素（如标签、属性、类名等）来提取所需的数据。
存储数据：将提取的数据存储到数据库、CSV 文件、JSON 文件等格式中，以便后续使用或分析。

本章节主要介绍 BeautifulSoup，它是一个用于解析 HTML 和 XML 文档的 Python 库，能够从网页中提取数据，常用于网页抓取和数据挖掘。

BeautifulSoup

BeautifulSoup 是一个用于从网页中提取数据的 Python 库，特别适用于解析 HTML 和 XML 文件。

BeautifulSoup 能够通过提供简单的 API 来提取和操作网页中的内容，非常适合用于网页抓取和数据提取的任务。

BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、修改DOM树等功能。
BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。
BeautifulSoup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

安装 BeautifulSoup

要使用 BeautifulSoup，需要安装 beautifulsoup4 和 lxml 或 html.parser（一个 HTML 解析器）。

我们可以使用 pip 来安装这些依赖：

pip install beautifulsoup4
pip install lxml  # 推荐使用 lxml 作为解析器（速度更快）

如果你没有 lxml，可以使用 Python 内置的 html.parser 作为解析器。

基本用法

BeautifulSoup 用于解析 HTML 或 XML 数据，并提供了一些方法来导航、搜索和修改解析树。

BeautifulSoup 常见的操作包括查找标签、获取标签属性、提取文本等。

要使用 BeautifulSoup，需要先导入 BeautifulSoup，并将 HTML 页面加载到 BeautifulSoup 对象中。

通常，你会先用爬虫库（如 requests）获取网页内容:

from bs4 import BeautifulSoup
import requests

# 使用 requests 获取网页内容
url = "https://cn.bing.com/"  # 抓取bing搜索引擎的网页内容
response = requests.get(url)

# 获取网页标题：
# 防止中文乱码
response.encoding = "utf-8"
# 确认请求是否成功
if response.status_code == 200:
    # 使用 BeautifulSoup 解析网页
    soup1 = BeautifulSoup(response.text, "lxml")  # 使用 lxml 解析器
    soup2 = BeautifulSoup(response.text, "html.parser")  # 解析网页内容 html.parser 解析器
    # 查找<title>标签
    # 方式1：
    title = soup1.find("title")
    if title:
        print(title.text)  # 搜索 - Microsoft 必应
        print(title.get_text())  # 搜索 - Microsoft 必应
    else:
        print("未获取到<title>标签")
    #     方式2
    if soup2.title:
        print(soup2.title.string)  # 搜索 - Microsoft 必应
    else:
        print("未获取到<title>标签")
else:
    print("请求异常：", response.status_code)

查找标签

BeautifulSoup 提供了多种方法来查找网页中的标签，最常用的包括 find() 和 find_all()。

find() 返回第一个匹配的标签
find_all() 返回所有匹配的标签

from bs4 import BeautifulSoup
import requests
url = "https://www.baidu.com/"
response = requests.get(url)
# 修正编码：更接近真实编码，但 检测需要时间（对大型响应可能影响性能）。
# 方法1
# response.encoding=chardet.detect(response.content)["encoding"]
# 方法2
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "lxml")
# find() 返回第一个匹配的标签
find = soup.find("a")
# find_all() 返回所有匹配的标签
find_all = soup.find_all("a")
# 打印a标签
print(find)  # <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
# 打印a标签的href属性
print(find.get("href"))  ##http://news.baidu.com
print("*************")
for item in find_all:
    print(item)  # 打印完整a标签：<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
    print(item.string)  # 打印a标签内部的文本：新闻

结合正则表达式获取指定内容

import re
import requests
from bs4 import BeautifulSoup

url="https://www.baidu.com/"

html = requests.get(url)
html.encoding=html.apparent_encoding
soup = BeautifulSoup(html.content, "html.parser")
find_all = soup.find_all(href=re.compile(r"^http://.*"))
print(find_all)

一次获取多个标签

find_all = soup.find_all(["a", "p"],class_="mnav",attrs={"name":"tj_trnews"})
# find_all = soup.find_all(["a", "p"])
for item in find_all:
    print(item)

根据文本内容获取标签

find_all = soup.find_all("a",string=re.compile(r"百度"),limit=2)
for s in find_all:
    print(s)

'''
<a href="http://home.baidu.com">关于百度</a>
<a href="http://www.baidu.com/duty/">使用百度前必读</a>
'''

获取标签的文本

通过 get_text() 方法，你可以提取标签中的文本内容：

from bs4 import BeautifulSoup
import requests
url = "https://www.baidu.com/"
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, "lxml")
# find() 返回第一个匹配的标签
find = soup.find("a")
# 提取标签文本内容
print(find.get_text())#新闻
print(find.getText())#新闻
# 获取所有文本
print(soup.get_text())# 百度一下，你就知道                     新闻 hao123 地图 视频 贴吧  登录   更多产品       关于百度 About Baidu  ©2017 Baidu 使用百度前必读  意见反馈 京ICP证030173号     
print("*************")

查找子标签和父标签

你可以通过 parent 和 children 属性访问标签的父标签和子标签：

parent = find.parent
print(parent)#<div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div>
print(parent.get_text())#新闻 hao123 地图 视频 贴吧  登录   更多产品
print("-------------------")
div = soup.find("div")
print(div)
children = div.children
for child in children:
    print(child)

'''
<div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div> <div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div>
 
<div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div>
 
<div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div>

查找具有特定属性的标签

你可以通过传递属性来查找具有特定属性的标签。

例如，查找类名为 example-class 的所有 div 标签：

# 查找所有 class="example-class" 的 <div> 标签
divs_with_class = soup.find_all('div', class_='example-class')

# 查找具有 id="unique-id" 的 <p> 标签
unique_paragraph = soup.find('p', id='unique-id')

获取搜索按钮，id 为 su ：

idInput = soup.find("input", id="su")
print(idInput)#<input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
# 获取输入框值
print(idInput["value"])#百度一下
soup_find_all = soup.find_all("a", class_="mnav")
# 好像不支持多个类
# soup_find_all = soup.find_all("a", class_="mnav c-font-normal c-color-t")
for item in soup_find_all:
    print(item)
'''
<input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/>
<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>
<a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123</a>
<a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图</a>
<a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频</a>
<a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧</a>
'''

高级用法

CSS 选择器

BeautifulSoup 也支持通过 CSS 选择器来查找标签。

这就是另一种与findall方法有异曲同工之妙的查找方法

写CSS时，标签名不加任何修饰，类名前加.，id名前加#

在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回类型是list

知识补充，CSS里面的类表示:

.center {text-align:center;}
p.center {text-align:center;}

select() 方法允许使用类似 jQuery 的选择器语法来查找标签：

import requests
from bs4 import BeautifulSoup

url="https://www.baidu.com/"
'''
反爬虫代码
1. 请求被拦截，未获取到真实页面
百度会检测请求头，未设置 User-Agent 的爬虫请求可能被重定向到验证页面或返回不同的 HTML 结构。
修复方法：添加浏览器请求头，模拟浏览器访问
'''
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
html = requests.get(url,headers=headers)
html.encoding=html.apparent_encoding
soup = BeautifulSoup(html.content, "html.parser")

# 标签查找
tagselect = soup.select('a',limit=3)
print("标签查找:",tagselect)
print("*****************************************************************************************************")
# 类查找
classselect = soup.select('.text-color')
print("类查找:",classselect)
print("*****************************************************************************************************")
# id查找
idselect = soup.select('#bottom_space')
print("id查找:",idselect)
print("*****************************************************************************************************")
# 标签+类组合查找
classselect = soup.select('a.text-color')
print("标签+类组合查找:",classselect)
print("*****************************************************************************************************")
# 标签+id查找
idselect = soup.select('div#bottom_space')
print(" 标签+id查找:",idselect)
print("*****************************************************************************************************")
# 标签+属性组合查找
attrselect = soup.select("a[href='http://news.baidu.com']")
print("标签+属性组合查找:",attrselect)#[<a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻</a>]
print("*****************************************************************************************************")

更多选择器参考下列表格：

选择器	示例	示例说明	CSS
.class	.intro	选择所有class="intro"的元素	1
#id	#firstname	选择所有id="firstname"的元素	1
*	*	选择所有元素	2
element	p	选择所有<p>元素	1
element,element	div,p	选择所有<div>元素和 <p> 元素	1
element.class	p.hometown	选择所有 class="hometown" 的 <p> 元素	1
element element	div p	选择<div>元素内的所有<p>元素	1
element>element	div>p	选择所有父级是 <div> 元素的 <p> 元素	2
element+element	div+p	选择所有紧跟在 <div> 元素之后的第一个 <p> 元素	2
[attribute]	[target]	选择所有带有target属性元素	2
[attribute=value]	[target=-blank]	选择所有使用target="-blank"的元素	2
[attribute~=value]	[title~=flower]	选择标题属性包含单词"flower"的所有元素	2
[attribute\|=language]	[lang\|=en]	选择 lang 属性等于 en，或者以 en- 为开头的所有元素	2
:link	a:link	选择所有未访问链接	1
:visited	a:visited	选择所有访问过的链接	1
:active	a:active	选择活动链接	1
:hover	a:hover	选择鼠标在链接上面时	1
:focus	input:focus	选择具有焦点的输入元素	2
:first-letter	p:first-letter	选择每一个<p>元素的第一个字母	1
:first-line	p:first-line	选择每一个<p>元素的第一行	1
:first-child	p:first-child	指定只有当<p>元素是其父级的第一个子级的样式。	2
:before	p:before	在每个<p>元素之前插入内容	2
:after	p:after	在每个<p>元素之后插入内容	2
:lang(language)	p:lang(it)	选择一个lang属性的起始值="it"的所有<p>元素	2
element1~element2	p~ul	选择p元素之后的每一个ul元素	3
[attribute^=value]	a[src^="https"]	选择每一个src属性的值以"https"开头的元素	3
[attribute$=value]	a[src$=".pdf"]	选择每一个src属性的值以".pdf"结尾的元素	3
[attribute=value*]	a[src*="runoob"]	选择每一个src属性的值包含子字符串"runoob"的元素	3
:first-of-type	p:first-of-type	选择每个p元素是其父级的第一个p元素	3
:last-of-type	p:last-of-type	选择每个p元素是其父级的最后一个p元素	3
:only-of-type	p:only-of-type	选择每个p元素是其父级的唯一p元素	3
:only-child	p:only-child	选择每个p元素是其父级的唯一子元素	3
:nth-child(n)	p:nth-child(2)	选择每个p元素是其父级的第二个子元素	3
:nth-last-child(n)	p:nth-last-child(2)	选择每个p元素的是其父级的第二个子元素，从最后一个子项计数	3
:nth-of-type(n)	p:nth-of-type(2)	选择每个p元素是其父级的第二个p元素	3
:nth-last-of-type(n)	p:nth-last-of-type(2)	选择每个p元素的是其父级的第二个p元素，从最后一个子项计数	3
:last-child	p:last-child	选择每个p元素是其父级的最后一个子级。	3
:root	:root	选择文档的根元素	3
:empty	p:empty	选择每个没有任何子级的p元素（包括文本节点）	3
:target	#news:target	选择当前活动的#news元素（包含该锚名称的点击的URL）	3
:enabled	input:enabled	选择每一个已启用的输入元素	3
:disabled	input:disabled	选择每一个禁用的输入元素	3
:checked	input:checked	选择每个选中的输入元素	3
:not(selector)	:not(p)	选择每个并非p元素的元素	3
::selection	::selection	匹配元素中被用户选中或处于高亮状态的部分	3
:out-of-range	:out-of-range	匹配值在指定区间之外的input元素	3
:in-range	:in-range	匹配值在指定区间之内的input元素	3
:read-write	:read-write	用于匹配可读及可写的元素	3
:read-only	:read-only	用于匹配设置 "readonly"（只读）属性的元素	3
:optional	:optional	用于匹配可选的输入元素	3
:required	:required	用于匹配设置了 "required" 属性的元素	3
:valid	:valid	用于匹配输入值为合法的元素	3
:invalid	:invalid	用于匹配输入值为非法的元素	3
:has	:has	允许根据其后代元素来选择一个元素。	3
:is	:is	接收任何数量的选择器作为参数，并且返回这些选择器匹配的元素的并集。	3

处理嵌套标签

BeautifulSoup 支持深度嵌套的 HTML 结构，你可以通过递归查找子标签来处理这些结构：

# 查找嵌套的 <div> 标签
nested_divs = soup.find_all('div', class_='nested')
for div in nested_divs:
    print(div.get_text())

修改网页内容

BeautifulSoup 允许你修改 HTML 内容。

我们可以修改标签的属性、文本或删除标签：

# 修改第一个 <a> 标签的 href 属性
first_link['href'] = 'http://new-url.com'

# 修改第一个 <p> 标签的文本内容
first_paragraph = soup.find('p')
first_paragraph.string = 'Updated content'

# 删除某个标签
first_paragraph.decompose()

转换为字符串

你可以将解析的 BeautifulSoup 对象转换回 HTML 字符串：

# 转换为字符串
html_str = str(soup)

print(soup)
print(str(soup))
'''
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻+++++++</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123+++++++</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图+++++++</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频+++++++</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧+++++++</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div> <div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div> </body> </html>

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta content="text/html;charset=utf-8" http-equiv="content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="always" name="referrer"/><link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/><title>百度一下，你就知道</title></head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id="lg"> <img height="129" hidefocus="true" src="//www.baidu.com/img/bd_logo1.png" width="270"/> </div> <form action="//www.baidu.com/s" class="fm" id="form" name="f"> <input name="bdorz_come" type="hidden" value="1"/> <input name="ie" type="hidden" value="utf-8"/> <input name="f" type="hidden" value="8"/> <input name="rsv_bp" type="hidden" value="1"/> <input name="rsv_idx" type="hidden" value="1"/> <input name="tn" type="hidden" value="baidu"/><span class="bg s_ipt_wr"><input autocomplete="off" autofocus="autofocus" class="s_ipt" id="kw" maxlength="255" name="wd" value=""/></span><span class="bg s_btn_wr"><input autofocus="" class="bg s_btn" id="su" type="submit" value="百度一下"/></span> </form> </div> </div> <div id="u1"> <a class="mnav" href="http://news.baidu.com" name="tj_trnews">新闻+++++++</a> <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">hao123+++++++</a> <a class="mnav" href="http://map.baidu.com" name="tj_trmap">地图+++++++</a> <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">视频+++++++</a> <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">贴吧+++++++</a> <noscript> <a class="lb" href="http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1" name="tj_login">登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');
                </script> <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">更多产品</a> </div> </div> </div> <div id="ftCon"> <div id="ftConw"> <p id="lh"> <a href="http://home.baidu.com">关于百度</a> <a href="http://ir.baidu.com">About Baidu</a> </p> <p id="cp">©2017 Baidu <a href="http://www.baidu.com/duty/">使用百度前必读</a>  <a class="cp-feedback" href="http://jianyi.baidu.com/">意见反馈</a> 京ICP证030173号  <img src="//www.baidu.com/img/gs.gif"/> </p> </div> </div> </div> </body> </html>

'''

BeautifulSoup 属性与方法

以下是 BeautifulSoup 中常用的属性和方法:

方法/属性	描述	示例
`BeautifulSoup()`	用于解析 HTML 或 XML 文档并返回一个 BeautifulSoup 对象。	`soup = BeautifulSoup(html_doc, 'html.parser')`
`.prettify()`	格式化并美化文档内容，生成结构化的字符串。	`print(soup.prettify())`
`.find()`	查找第一个匹配的标签。	`tag = soup.find('a')`
`.find_all()`	查找所有匹配的标签，返回一个列表。	`tags = soup.find_all('a')`
`.find_all_next()`	查找当前标签后所有符合条件的标签。	`tags = soup.find('div').find_all_next('p')`
`.find_all_previous()`	查找当前标签前所有符合条件的标签。	`tags = soup.find('div').find_all_previous('p')`
`.find_parent()`	返回当前标签的父标签。	`parent = tag.find_parent()`
`.find_all_parents()`	查找当前标签的所有父标签。	`parents = tag.find_all_parents()`
`.find_next_sibling()`	查找当前标签的下一个兄弟标签。	`next_sibling = tag.find_next_sibling()`
`.find_previous_sibling()`	查找当前标签的前一个兄弟标签。	`prev_sibling = tag.find_previous_sibling()`
`.parent`	获取当前标签的父标签。	`parent = tag.parent`
`.next_sibling`	获取当前标签的下一个兄弟标签。	`next_sibling = tag.next_sibling`
`.previous_sibling`	获取当前标签的前一个兄弟标签。	`prev_sibling = tag.previous_sibling`
`.get_text()`	提取标签内的文本内容，忽略所有HTML标签。	`text = tag.get_text()`
`.attrs`	返回标签的所有属性，以字典形式表示。	`href = tag.attrs['href']`
`.string`	获取标签内的字符串内容。	`string_content = tag.string`
`.name`	返回标签的名称。	`tag_name = tag.name`
`.contents`	返回标签的所有子元素，以列表形式返回（包含空格和换行）。	`children = tag.contents`
`.descendants`	返回标签的所有后代元素，生成器形式。	`for child in tag.descendants: print(child)`
`.parent`	获取当前标签的父标签。	`parent = tag.parent`
`.previous_element`	获取当前标签的前一个元素（不包括文本）。	`prev_elem = tag.previous_element`
`.next_element`	获取当前标签的下一个元素（不包括文本）。	`next_elem = tag.next_element`
`.decompose()`	从树中删除当前标签及其内容。	`tag.decompose()`
`.unwrap()`	移除标签本身，只保留其子内容。	`tag.unwrap()`
`.insert()`	向标签内插入新标签或文本。	`tag.insert(0, new_tag)`
`.insert_before()`	在当前标签前插入新标签。	`tag.insert_before(new_tag)`
`.insert_after()`	在当前标签后插入新标签。	`tag.insert_after(new_tag)`
`.extract()`	删除标签并返回该标签。	`extracted_tag = tag.extract()`
`.replace_with()`	替换当前标签及其内容。	`tag.replace_with(new_tag)`
`.has_attr()`	检查标签是否有指定的属性。	`if tag.has_attr('href'):`
`.get()`	获取指定属性的值。	`href = tag.get('href')`
`.clear()`	清空标签的所有内容。	`tag.clear()`
`.encode()`	编码标签内容为字节流。	`encoded = tag.encode()`
`.is_empty_element`	检查标签是否是空元素（例如 `<br>`、`<img>` 等）。	`if tag.is_empty_element:`
`.is_ancestor_of()`	检查当前标签是否是指定标签的祖先元素。	`if tag.is_ancestor_of(another_tag):`
`.is_descendant_of()`	检查当前标签是否是指定标签的后代元素。	`if tag.is_descendant_of(another_tag):`

`.string` 和 `.get_text()`

.string 和 .get_text() 都用于提取标签中的文本内容，但它们的用途和行为有显著区别。以下是详细对比：

1. `.string` 的用法和特点

定义：.string 是 Tag 对象的属性，用于获取 直接包含在标签内的文本。
返回值：
- 如果标签 仅包含单个 NavigableString 子节点（即没有嵌套其他标签），则返回该字符串。
- 如果标签包含 多个子节点（如嵌套标签或其他内容），则返回 None。
适用场景：简单标签且无嵌套结构时快速获取文本。

示例：

from bs4 import BeautifulSoup

html = '<div>Hello, World!</div>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.div.string)  # 输出：Hello, World!

html_nested = '<div>Hello <b>World</b></div>'
soup_nested = BeautifulSoup(html_nested, 'html.parser')
print(soup_nested.div.string)  # 输出：None（因为存在嵌套标签）

2. get_text() 的用法和特点定义：

.get_text() 是 Tag 对象的方法，用于递归获取标签及其所有子标签的文本内容。

参数： separator：指定分隔符（默认为空字符串）。

strip：是否去除文本两端的空白（默认为 False）。

返回值：合并后的字符串（包含所有子标签的文本）。

适用场景：提取复杂嵌套结构中的全部文本。

示例：

html = '<div>Hello <b>World</b>!</div>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.div.get_text())       # 输出：Hello World!
print(soup.div.get_text(' ', strip=True))  # 输出：Hello World!（带空格分隔并去空格）

4. 使用建议

优先使用 .get_text()：
在大多数爬虫场景中，页面结构复杂，使用 .get_text() 更可靠。例如：

# 提取所有文本并用换行符分隔
text = soup.get_text('\n', strip=True)

谨慎使用 .string：
仅在明确知道标签结构简单时使用，例如：

# 提取标题标签的文本
title = soup.title.string if soup.title else ''

其他属性

方法/属性	描述	示例
`.style`	获取标签的内联样式。	`style = tag['style']`
`.id`	获取标签的 `id` 属性。	`id = tag['id']`
`.class_`	获取标签的 `class` 属性。	`class_name = tag['class']`
`.string`	获取标签内部的字符串内容，忽略其他标签。	`content = tag.string`
`.parent`	获取标签的父元素。	`parent = tag.parent`

其他

方法/属性	描述	示例
`find_all(string)`	使用字符串查找匹配的标签。	`tag = soup.find_all('div', class_='container')`
`find_all(id)`	查找指定 `id` 的标签。	`tag = soup.find_all(id='main')`
`find_all(attrs)`	查找具有指定属性的标签。	`tag = soup.find_all(attrs={"href": "http://example.com"})`

将网页内容写入文件

url="http://www.baidu.com/"
html = requests.get(url)
html.encoding=html.apparent_encoding
soup = BeautifulSoup(html.text, "lxml")
prettify = soup.prettify()
with open("../Test/index-prettify.html","w+",encoding="utf-8") as f:
    f.write(prettify)#美化后写入，注意，美化后我在网页预览时布局发生了错乱，具体原因未知
with open("../Test/index.html","w+",encoding="utf-8") as f:
    f.write(str(soup))#转换成str之后写入
print("寫入完成")

反爬虫

1. 请求被拦截，未获取到真实页面

例如，百度会检测请求头，未设置 User-Agent 的爬虫请求可能被重定向到验证页面或返回不同的 HTML 结构。

修复方法：添加浏览器请求头，添加 User-Agent 模拟浏览器。

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
html = requests.get(url,headers=headers)
html.encoding=html.apparent_encoding
soup = BeautifulSoup(html.content, "html.parser")
with open("../Test/有请求头返回的html.html","w+",encoding="utf-8") as f:
    f.write(str(soup))

请求头获取方式：

爬取本地文件

myHtml = BeautifulSoup(open("../Test/index-prettify.html","r",encoding="utf-8"), "lxml")
print(myHtml.prettify())

posted @ 2025-03-24 11:09 指尖下的世界阅读(58) 评论(0) 收藏举报

刷新页面返回顶部

指尖下的世界

今日事今日毕,今日无事早休息.