BeautifulSoup
Beautiful Soup的简介
官方使用文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
Beautifulsoup(html, features="html.parser/lxml" ) # lxml速度更快,需要安装
安装 Beautiful Soup
$ pip install beautifulsoup4
使用参考:
find(单个查找) find_all(全部查找) get("attr") 获取标签属性 attrs={"id": "abc", "class": "bcd"} text - 获取文本 string - 获取文本,但是也可以赋值 Beautifulsoup(html, features="html.parser/lxml" ) # lxml速度更快,需要安装 soup = Beautifulsoup("HTML格式字符串", "html.parser") tag = soup.find(name="div", attrs={}) print(tag.text) print(tag.attrs) print(tag.get("attr")) tags = soup.find_all(name="div", attrs={"id": "btn"})
更多参考: https://www.cnblogs.com/wupeiqi/articles/6283017.html
实例
下面是一个很简单的爬虫(爬取百度百科主页的词条)
from urllib.request import urlopen from bs4 import BeautifulSoup import re import ssl # 防止urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590) ssl._create_default_https_context = ssl._create_unverified_context resp = urlopen('https://baike.baidu.com/').read().decode('utf-8') soup = BeautifulSoup(resp, 'html.parser') listUrls = soup.find_all('a', href=re.compile(r'^http://baike.baidu.com')) for url in listUrls: if not len(url.get_text().strip()) == 0: #过滤空词条或重复URL print(url.get_text().strip(),'>>>>>>>>>>>>>>>', url['href'])