BeautifulSoup

Beautiful Soup的简介

官方使用文档: https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

   Beautifulsoup(html, features="html.parser/lxml" ) # lxml速度更快,需要安装

 

安装 Beautiful Soup

$ pip install beautifulsoup4

 

使用参考:

    find(单个查找)
    find_all(全部查找)
    get("attr") 获取标签属性
    attrs={"id": "abc", "class": "bcd"}
    text - 获取文本
    string - 获取文本,但是也可以赋值
    
    Beautifulsoup(html, features="html.parser/lxml" )    # lxml速度更快,需要安装
    
    soup = Beautifulsoup("HTML格式字符串", "html.parser")
    tag = soup.find(name="div", attrs={})
    print(tag.text)
    print(tag.attrs)
    print(tag.get("attr"))
    tags = soup.find_all(name="div", attrs={"id": "btn"})

更多参考: https://www.cnblogs.com/wupeiqi/articles/6283017.html

 

实例

 下面是一个很简单的爬虫(爬取百度百科主页的词条)

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import ssl

# 防止urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)
ssl._create_default_https_context = ssl._create_unverified_context

resp = urlopen('https://baike.baidu.com/').read().decode('utf-8')
soup = BeautifulSoup(resp, 'html.parser')
listUrls = soup.find_all('a', href=re.compile(r'^http://baike.baidu.com'))
for url in listUrls:
    if not len(url.get_text().strip()) == 0: #过滤空词条或重复URL
         print(url.get_text().strip(),'>>>>>>>>>>>>>>>', url['href'])

  

 

posted @ 2017-01-05 23:06  Vincen_shen  阅读(195)  评论(0)    收藏  举报