BeautifulSoup

Beautiful Soup的简介

官方使用文档： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

　　Beautifulsoup(html, features="html.parser/lxml" ) # lxml速度更快，需要安装

安装 Beautiful Soup

$ pip install beautifulsoup4

使用参考：

    find(单个查找)
    find_all(全部查找)
    get("attr") 获取标签属性
    attrs={"id": "abc", "class": "bcd"}
    text - 获取文本
    string - 获取文本，但是也可以赋值
    
    Beautifulsoup(html, features="html.parser/lxml" )    # lxml速度更快，需要安装
    
    soup = Beautifulsoup("HTML格式字符串", "html.parser")
    tag = soup.find(name="div", attrs={})
    print(tag.text)
    print(tag.attrs)
    print(tag.get("attr"))
    tags = soup.find_all(name="div"， attrs={"id": "btn"})

更多参考： https://www.cnblogs.com/wupeiqi/articles/6283017.html

实例

下面是一个很简单的爬虫（爬取百度百科主页的词条）

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import ssl

# 防止urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
ssl._create_default_https_context = ssl._create_unverified_context

resp = urlopen('https://baike.baidu.com/').read().decode('utf-8')
soup = BeautifulSoup(resp, 'html.parser')
listUrls = soup.find_all('a', href=re.compile(r'^http://baike.baidu.com'))
for url in listUrls:
    if not len(url.get_text().strip()) == 0: #过滤空词条或重复URL
         print(url.get_text().strip(),'>>>>>>>>>>>>>>>', url['href'])

posted @ 2017-01-05 23:06 Vincen_shen 阅读(195) 评论(0) 收藏举报

刷新页面返回顶部