提取网页正文,过滤掉script style meta

from lxml import etree

text = ''
html = etree.HTML(text, parser=etree.HTMLParser(encoding="utf-8"))
text = '\n'.join(html.xpath('//*[not(script or style or meta)]/text()'))

posted @ 2023-01-29 17:19  二二二狗子  阅读(43)  评论(0)    收藏  举报