HTML页面解析

一、正则表达式

使用正则表达式解析页面需要引入re模块

import re

1.先写正则表达式

obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)'
                     r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp;/&nbsp;'
                     r'(?P<place>.*?)&nbsp;.*?</p>', re.S)

(?P<分组名字>正则)

使用re.S参数以后，正则表达式会将这个字符串作为一个整体，将“\n”当做一个普通的字符加入到这个字符串中，在整体中进行匹配。

2.将返回的HTML页面进行解析　　

result = obj.finditer(resp.text)    //返回一个迭代器

result = obj.match(resp.text)    //从字符串开头匹配

result = obj.search(resp.text)　　//返回第一个结果或者空

result = obj.findall(resp.text)　　//返回list

3.取组值

for it in result:
    print(it.group('name'))

二、BeautifulSoup

import requests
from bs4 import BeautifulSoup    //引入BeautifulSoup

find(标签，属性=值)

find_all(标签，属性=值)

# bs4解析页面
page = BeautifulSoup(resp.text, 'html.parser')
pageList = page.find_all('a', style="display: block;",target="_blank")
for i in pageList:
        print(i.get('href'))    //获取标签中属性的值，.text来取标签标记的内容

三、xpath

import requests
from lxml import etree    //引入etree

html = etree.HTML(resp.text)

divs = html.xpath('//*[@id="utopia_widget_6"]/div/div[1]/div')

    for div in divs:
        price = div.xpath('./div[4]/span/text()')[0]

posted @ 2022-05-29 22:37 昂昂呀阅读(205) 评论(0) 收藏举报

刷新页面返回顶部

好好学习，天天向上

HTML页面解析

一、正则表达式

二、BeautifulSoup

三、xpath

公告