python网络数据采集之beautifulsoup

beautifulsoup中常用的方法findall与find,清楚这俩个方法的关系和用法
其中还有
.children标签

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for child in bsObj.find("table",{"id":"giftList"}).children:
print(child)

兄弟标签next_siblings()

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
print(sibling)



这里通过上述的方法找到div class=pl2下的 a标签下的title

# coding=utf-8
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("https://book.douban.com/top250?start=0")
bsObj = BeautifulSoup(html)

for link in bsObj.findAll("div",attrs={"class":"pl2"}):
name=link.find("a")
print name.get('title')

如果改成
for link in bsObj.findAll("div",attrs={"class":"pl2"}):
name=link.findAll("a")
print name[0].get('title')
效果是一样的

还能通过name.text获取a标签中的文本内容
.get('href')
.val等方法获取各种属性



posted @ 2016-11-07 19:50  进击的大乐  阅读(1117)  评论(0)    收藏  举报