大家一起学爬虫（一） - 需要重生的鹰

编程语言：python 3.5

环境：windows 7

python库：urllib,bs4

1.试试获取网页信息

# 导入urlopen
from urllib.request import urlopen

# 使用urlopen访问http://pythonscraping.com/pages/page1.html
html = urlopen("http://pythonscraping.com/pages/page1.html")
# 读取http://pythonscraping.com/pages/page1.html的网页源码
print(html.read())

是不是很简单，3行代码搞定。

2.试试使用bs4解析html

# 导入urlopen和BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup

# 使用urlopen访问http://pythonscraping.com/pages/page1.html
html = urlopen("http://pythonscraping.com/pages/page1.html")
# 使用BeautifulSoup将html源码转换为BeautifulSoup对象
bsObj = BeautifulSoup(html.read(),"lxml")
# 使用4种方式获取html源码种的h1标签
print(bsObj.h1)
print(bsObj.html.body.h1)
print(bsObj.body.h1)
print(bsObj.html.h1)

强大的bs4，有了你，解析html不再繁琐。

3. 试试异常处理

# 导入urlopen，异常处理HTTPError以及BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
try:
    html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
    print(e)
else:
    print("vist is ok.")
    
bsObj = BeautifulSoup(html.read(),"lxml")
try:
    bsContent = bsObj.h1
except AttributeError as e:
    print(e)
else:
    if bsContent == None:
        print("bsContent is None")
    else:
        print(bsContent)

这个没啥好说的，异常处理是python的基础，异常处理可以提前发现问题，提前解决。也方便分析问题。

4.试试查找指定属性值的某类标签

# 导入urlopen，BeautifulSoup
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html.read(),"lxml")
# 查找所有标签为span，属性class的值为green的标签
nameList = bsObj.findAll("span",{"class":"green"})
for name in nameList:
    # 使用get_text()获取标签中的文本
    print(name.get_text())

重点理解findAll方法，span为标签，class为span的属性，green为class的值。

5.试试查找子节点

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html.read(),"lxml")
# 查找标签为table，属性id的值为giftList的所有子节点
for child in bsObj.find("table", {"id":"giftList"}).children:
    print(child)

find跟findAll用法几乎相同，children只取一级子节点，如果需要迭代遍历，则将children改为descendants

今天就写这么多了。

发表于 2017-09-19 18:09 需要重生的鹰阅读(132) 评论(0) 编辑收藏举报