Python解决爬虫中文返回乱码问题

ISO-9959-1的问题

直接上干货

import re
import requests
 
 
 
new_url = "http://www.anquan.us/static/drops/papers-17213.html"
res = requests.get(url=new_url).content.decode('utf-8')
print (res)
 
 
"""
if res.encoding == 'ISO-8859-1':
    encodings = requests.utils.get_encodings_from_content(res.text)
    if encodings:
        encoding = encodings[0]
    else:
        encoding = res.apparent_encoding
else:
    encoding = res.encoding
encode_content = res.content.decode(encoding, 'replace').encode('utf-8', 'replace').decode('utf-8')
"""
#print(encode_content)
#print(res.encoding)
#print(res.apparent_encoding)
#print(requests.utils.get_encodings_from_content(res.text))

解释这种现象的原因:

诸如类似的代码：

……

texts = bs.find_all('div',class='content_element').p.text.strip()

print(texts)

……

搜索的内容中有中文的情况下，python包BeautifulSoup解码网页的时候默认应该是使用了gbk进行编码

通过此代码可以看出

……

r=requests.get(link,headers=headers)

Print(r.encoding)

……

可以显示编码格式为iso-8859-1

因此我在代码中定义的headers中没有说明网页头文件中的编码格式，以下图片为非标准格式浏览器头部文件

一般标准的头部文件格式：

《HTTP权威指南》里第16章国际化里提到，如果HTTP响应中Content-Type字段没有指定charset，则默认页面是'ISO-8859-1'编码。这处理英文页面当然没有问题，但是中文页面，就会有乱码了！

前面已经说明如何查看当前编码格式的方法，下面就说一下遇见此类问题该如何进行解决和纠正

首先因为使用的查看IDE中使用的解码格式是utf-8的，之前网页中默认使用的编码格式也是utf-8的，所以我们必须将其解码成utf-8格式的字符才能正常显示中文

使用之前的代码进行此操作：

texts = bs.find_all('div',class='content_element').p.text.strip().encode(‘iso-8859-1’).decode(‘utf-8’)

print(texts)

或者在之前进行encoding=‘utf-8’解码修正

……

r=requests.get(link,headers=headers)

r=encoding=’utf-8’

print(r.encoding)

……

参考网站：

https://www.cnblogs.com/ccsx/p/8572735.html

https://www.cnblogs.com/surecheun/p/9694052.html

https://www.cnblogs.com/bw13/p/6549248.html

posted @ 2020-04-03 21:02 7411 阅读(801) 评论(0) 收藏举报

刷新页面返回顶部

LLBFWH

Python解决爬虫中文返回乱码问题

ISO-9959-1的问题

解释这种现象的原因:

公告