10.12 HTMLParser

如果我们要编写一个搜索引擎，第一步是用爬虫把目标网站的页面抓取下来，第二步就是解析该HTML页面，看看里面的内容到底是新闻、图片还是视频。

假设第一步已经完成，那么第二步该如何解析HTML呢？

HTML本质上XML的子集，但是HTML的语法没有XML那么严格，所以不能用标准的DOM或者SAX来解析HTML。

好在Python提供了HTMLParser来非常方便地解析HTML，只需要简单的几行代码：

模块和类

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    #开始标签
    def handle_starttag(self,tag,attrs):
        print('<%s>'%tag)
    
    #结束标签
    def handle_endtag(self,tag):
        print('</%s>'%tag)

    #
    def handle_startendtag(self,tag,attrs):
        print('<%s/>'%tag)

    def handle_data(self,data):
        print(data)

    def handle_comment(self,data):
        print('<!--',data,'-->')

    def handle_entityref(self,name):
        print('&%s;'%name)

    def handle_charref(self,name):
        print('&#%s;'%name)

parser=MyHTMLParser()
parser.feed('''<html>
<head></head>
<body>
<!-- test html parser -->
    <p>Some <a href=\"#\">html</a> HTML&nbsp;tutorial...<br>END</p>
</body></html>''')

feed()方法可以多次调用，不必非要一次把整个HTML字符串都塞进去，可以一部分一部分塞进去。

特殊字符有两种，一种是英文表示的&nbsp；一种是数字表示的&#1234；这两种字符都可以通过Parser解析出来。在上边代码中，是通过内部方法handle_entityref和handle_charref解析出来的。

例二：

对网页https://www.python.org/events/python-events/，用浏览器查看源码并复制，然后解析一下HTML，输出Python官网发布的会议时间、名称和地点。

解决方法：

1、先在开发者模式下，看这个网页的源代码，找到放置时间、名称、地点的标签块。

<li>
        <h3 class="event-title"><a href="/events/python-events/883/">PyConDE &amp; PyData Berlin 2020 (cancelled)</a></h3>
    <p>
        <time datetime="2020-10-14T00:00:00+00:00">14 Oct. – 16 Oct. <span class="say-no-more"> 2020</span></time>
        <span class="event-location">Berlin, Germany</span>
    </p>
</li>

其他的会议格式也与这个相同，不同的只是内容，提取内容时，可以用re模块配合正则表达式。

2、构建正则表达式

import re

re_date=re.compile(r'time datetime="(.*)"')
re_location=re.compile(r'<span class="event-location">(\w+), (/w+)</span>')
re_name=re.compile(r'/events/.+?/">(.+?)<')

posted @ 2020-10-12 19:56 ShineLe 阅读(33) 评论(0) 收藏举报

刷新页面返回顶部

ShineLee

10.12 HTMLParser

公告