python爬虫编码的局限性

源代码
# *_*coding:utf-8 *_*

import requests
keyword={'wd':'中国'}
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3704.400 QQBrowser/10.4.3587.400'}
response=requests.get('https://www.baidu.com/s',params=keyword,headers=header)
1、#print(response.content.decode('utf-8'))

2、#with open('百度.html','w',encoding='utf-8') as f:
# f.write(response.content.decode('utf-8'))
3、#with open('百度.txt','w',encoding='utf-8') as f:
# f.write(response.content.decode('utf-8'))


控制台显示

Traceback (most recent call last):
File "C:/Users/20281/Desktop/代码文件/爬虫/requests库的使用/requests库简单的使用.py", line 7, in <module>
print(response.content.decode('utf-8'))
UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 158783: illegal multibyte sequence

Process finished with exit code 1

 

报错翻译后:gbk解码器不能够编码'\xbb'在158783这个位置处:非法的多字节出

为什么在这里只有2可以运行成功,在这里1实在控制台上显示,3是在文本显示各有各的编码规则没有统一规则,然而2的手动解码只针对html格式的编码方式所以只有2可以成功。

 

 



posted @ 2019-07-22 20:23  wss9806  阅读(333)  评论(0)    收藏  举报