简单爬取文字框架
环境python 3.7
提前装好相应的库
-- coding: utf-8 --
import requests
from lxml import etree
import re
headers = {
"User-Agent": "Mozilla/5.0"
}
目标网址
url='http://www.stats.gov.cn/tjsj/sjjd/202010/t20201020_1795025.html'
解决爬取的网页源代码乱码问题
res=requests.get(url,headers=headers).content.decode('utf-8')
s=etree.HTML(res)
t=re.findall(r'<span.?>(.?)',res)
text=""
for i in t:
if i==" ":
continue
text+=i
print(text)
运行结果:


浙公网安备 33010602011771号