简单爬取文字框架

环境python 3.7
提前装好相应的库

-- coding: utf-8 --

import requests
from lxml import etree
import re
headers = {
"User-Agent": "Mozilla/5.0"
}
目标网址
url='http://www.stats.gov.cn/tjsj/sjjd/202010/t20201020_1795025.html'
解决爬取的网页源代码乱码问题
res=requests.get(url,headers=headers).content.decode('utf-8')
s=etree.HTML(res)
t=re.findall(r'<span.?>(.?)',res)
text=""
for i in t:
if i==" ":
continue
text+=i

print(text)

运行结果：

posted @ 2020-10-31 10:48 水云栖阅读(134) 评论(0) 收藏举报

刷新页面返回顶部

水云栖

简单爬取文字框架

-- coding: utf-8 --

公告