编程语言:python 3.5
环境:windows 7
python库:urllib,bs4
1.试试获取网页信息
# 导入urlopen from urllib.request import urlopen # 使用urlopen访问http://pythonscraping.com/pages/page1.html html = urlopen("http://pythonscraping.com/pages/page1.html") # 读取http://pythonscraping.com/pages/page1.html的网页源码 print(html.read())
是不是很简单,3行代码搞定。
2.试试使用bs4解析html
# 导入urlopen和BeautifulSoup from urllib.request import urlopen from bs4 import BeautifulSoup # 使用urlopen访问http://pythonscraping.com/pages/page1.html html = urlopen("http://pythonscraping.com/pages/page1.html") # 使用BeautifulSoup将html源码转换为BeautifulSoup对象 bsObj = BeautifulSoup(html.read(),"lxml") # 使用4种方式获取html源码种的h1标签 print(bsObj.h1) print(bsObj.html.body.h1) print(bsObj.body.h1) print(bsObj.html.h1)强大的bs4,有了你,解析html不再繁琐。
3. 试试异常处理
# 导入urlopen,异常处理HTTPError以及BeautifulSoup from urllib.request import urlopen from urllib.error import HTTPError from bs4 import BeautifulSoup try: html = urlopen("http://www.pythonscraping.com/pages/page1.html") except HTTPError as e: print(e) else: print("vist is ok.") bsObj = BeautifulSoup(html.read(),"lxml") try: bsContent = bsObj.h1 except AttributeError as e: print(e) else: if bsContent == None: print("bsContent is None") else: print(bsContent)
这个没啥好说的,异常处理是python的基础,异常处理可以提前发现问题,提前解决。也方便分析问题。
4.试试查找指定属性值的某类标签
# 导入urlopen,BeautifulSoup from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html") bsObj = BeautifulSoup(html.read(),"lxml") # 查找所有标签为span,属性class的值为green的标签 nameList = bsObj.findAll("span",{"class":"green"}) for name in nameList: # 使用get_text()获取标签中的文本 print(name.get_text())重点理解findAll方法,span为标签,class为span的属性,green为class的值。
5.试试查找子节点
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/pages/page3.html") bsObj = BeautifulSoup(html.read(),"lxml") # 查找标签为table,属性id的值为giftList的所有子节点 for child in bsObj.find("table", {"id":"giftList"}).children: print(child)find跟findAll用法几乎相同,children只取一级子节点,如果需要迭代遍历,则将children改为descendants
今天就写这么多了。