在获取的页面中有的是网页中既含有utf-8，还有gb312,gbk，出乱码结局方法解决方法

#coding:utf-8 
""" 
在开发爬虫过程中，中文网页编码有的是utf-8,有的是gb2312,有的是gbk等等。 
如何取得网页的编码，用这个库最方便。 
用这个chardet库，可以获取网页的编码 
chardet下载地址https://pypi.python.org/pypi/chardet/ 

python培训班暑假班和周末班 
http://www.010dm.com/xflml/3069.html 

chardet安装的方法，先解压，到解压后的目录中运行 
python setup.py install 
""" 


import chardet,urllib2 
#抓取网页html 
line = "http://www.***.com" 
html_1 = urllib2.urlopen(line,timeout=30).read() 

mychar = chardet.detect(html_1) 

bianma = mychar['encoding'] 
#print bianma 
if bianma == 'utf-8' or bianma == 'UTF-8': 
#html=html.decode('utf-8','ignore').encode('utf-8') 
html=html_1 
elif bianma == 'gbk' or bianma == 'GBK' : 
html =html_1.decode('gbk','ignore').encode('utf-8') 
elif bianma == 'gb2312' : 
html =html_1.decode('gb2312','ignore').encode('utf-8') 
有以上处理，整个html就不会是乱码。

偷来点知识

posted @ 2020-01-03 15:05 chaiyinlei 阅读(228) 评论(0) 收藏举报

刷新页面返回顶部

笨鸟先飞

在获取的页面中有的是网页中既含有utf-8，还有gb312,gbk，出乱码结局方法解决方法

公告