提取网页正文，过滤掉script style meta

from lxml import etree

text = ''
html = etree.HTML(text, parser=etree.HTMLParser(encoding="utf-8"))
text = '\n'.join(html.xpath('//*[not(script or style or meta)]/text()'))

posted @ 2023-01-29 17:19 二二二狗子阅读(43) 评论(0) 收藏举报

刷新页面返回顶部

提取网页正文，过滤掉script style meta

公告