Spider_基础总结4_bs.find_all()与正则及lambda表达式

# beautifulsoup的 find()及find_all()方法,也会经常和正则表达式以及 Lambda表达式结合在一起使用:

# 1-bs.find_all()与正则表达式的应用:
# 语法如示例:

# 查找符合条件的所有图片:
import requests
from bs4 import BeautifulSoup
import re

html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})  # '../img/gifts/imgx.jpg'
for image in images: 
    print(image['src'])
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
# 2-tag.attrs 获取标签属性,返回字典:

在查找网页时,有时候不需要查找标签的内容,只需要获取标签的属性,比如:
a标签指向 URL链接包含在 href属性里;
img标签的图片文件包含在src属性里...

对于一个标签对象,可以用 tag.attrs获取一个包含这个标签所有属性的字典,然后用字典索引获取属性值,如:tag.attrs['src']
# 3-bs.find_all()与Lambda的应用:

bs.find_all(lambda tag: len(tag.attrs) == 2)   # 包含两个属性的所有标签
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
bs.find_all('', text='Or maybe he\'s only resting?')  # 返回 NavigableString对象
# 4-总结:
import requests
from bs4 import BeautifulSoup
import re

html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')


1)bs.find(),bs.find_all()根据名称和属性查找标签:
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)  
#     find(tag/tag_list,attributes_dict,recursive,text,keywords)
或:
# find_all(Lambda表达式)  
#     find(Lambda表达式)

2)bs.find() bs.find_all()与导航树(CSS)
# tag.children  tag.descendants   tag.next_siblings   tag.previous_siblings  tag.parent

3) bs.find() bs.find_all()的 attributes_dict参数与正则表达式:
# images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')}) 

4)bs.find() bs.find_all()与Lambda表达式:
# bs.find_all(lambda tag: len(tag.attrs) == 2)   # 包含两个属性的所有标签
# bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')
# bs.find_all('', text='Or maybe he\'s only resting?')  # 返回 NavigableString对象
posted @ 2020-06-23 16:20  collin_pxy  阅读(870)  评论(0编辑  收藏  举报