编程语言:python 3.5

环境:windows 7

python库:urllib,bs4

1.试试获取网页信息

# 导入urlopen
from urllib.request import urlopen

# 使用urlopen访问http://pythonscraping.com/pages/page1.html
html = urlopen("http://pythonscraping.com/pages/page1.html")
# 读取http://pythonscraping.com/pages/page1.html的网页源码
print(html.read())

是不是很简单,3行代码搞定。


2.试试使用bs4解析html

# 导入urlopen和BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 使用urlopen访问http://pythonscraping.com/pages/page1.html
html = urlopen("http://pythonscraping.com/pages/page1.html")
# 使用BeautifulSoup将html源码转换为BeautifulSoup对象
bsObj = BeautifulSoup(html.read(),"lxml")
# 使用4种方式获取html源码种的h1标签
print(bsObj.h1)
print(bsObj.html.body.h1)
print(bsObj.body.h1)
print(bsObj.html.h1)
强大的bs4,有了你,解析html不再繁琐。


3. 试试异常处理

# 导入urlopen,异常处理HTTPError以及BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
else:
    print("vist is ok.")
    
bsObj = BeautifulSoup(html.read(),"lxml")
try:
    bsContent = bsObj.h1
except AttributeError as e:
    print(e)
else:
    if bsContent == None:
        print("bsContent is None")
    else:
        print(bsContent)

这个没啥好说的,异常处理是python的基础,异常处理可以提前发现问题,提前解决。也方便分析问题。


4.试试查找指定属性值的某类标签

# 导入urlopen,BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html.read(),"lxml")
# 查找所有标签为span,属性class的值为green的标签
nameList = bsObj.findAll("span",{"class":"green"})
for name in nameList:
    # 使用get_text()获取标签中的文本
    print(name.get_text())
重点理解findAll方法,span为标签,class为span的属性,green为class的值。


5.试试查找子节点

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html.read(),"lxml")
# 查找标签为table,属性id的值为giftList的所有子节点
for child in bsObj.find("table", {"id":"giftList"}).children:
    print(child)
find跟findAll用法几乎相同,children只取一级子节点,如果需要迭代遍历,则将children改为descendants


今天就写这么多了。