第4章 4.4 处理编码

1. 用默认方式打开:

(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ echo $LANG
zh_CN.UTF-8
(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ cat example_iso.txt
20�(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ cat example_utf8.txt
20£
(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ python
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('example_utf8.txt') as file:
... print(file.read())
...
20£

#utf-8编码的文件可以打开

>>> with open('example_iso.txt') as file:
... print(file.read())
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/metal/anaconda3/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 2: invalid start byte

#iso编码的文件用默认编码打开报错。

 2. 使用正确的编码打开example.iso.txt文件:

>>> with open('example_iso.txt', encoding='iso-8859-1') as file:
... print(file.read())
...
20£

 

 

3. 打开utf8文件并将其内容保存到iso-8859-1编码的文件中,然后以正确的编码读取该文件:

>>> with open('example_utf8.txt') as file:
... content = file.read()
...
>>> with open('example_output_iso.txt', 'w', encoding='iso-8859-1') as file:
... file.write(content)
...
>>>
>>>
(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ cat example_output_iso.txt
20�
(.venv) (base) metal@metal-Lenovo-Product:~/project/PAutomationCookbook/ch04$ python
Python 3.7.6 (default, Jan 8 2020, 19:59:22)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('example_output_iso.txt', encoding='iso-8859-1') as file:
... print(file.read())
...
20£

 

 

4. 通过然后BeautifulSoup模块来检测特定的文件编码,因为很多时候我们不知道文件用什么编码。需要使用'rb'以二进制格式打开,然后通过BeautifulSoup的UnicodeDammit方法检测,如下:

>>> from bs4 import UnicodeDammit
>>>
>>> with open('example_iso.txt', 'rb') as file:
... content = file.read()
...
>>> suggestion = UnicodeDammit(content)
>>> suggestion.original_encoding
'windows-1252'
>>>
>>> with open('example_output_iso.txt', 'rb') as file:
... content = file.read()
...
>>> suggestion = UnicodeDammit(content)
>>> suggestion.original_encoding
'windows-1252'

#ios编码检测到的是这个编码值,跟写文件时的不一样,但我们打开文件时也可以用这种编码来打开,看后面的示例
>>> suggestion.unicode_markup
'20£\n'
>>> with open('example_utf8.txt', 'rb') as file:
... content = file.read()
...
>>> suggestion = UnicodeDammit(content)
>>> suggestion.original_encoding
'utf-8'
>>>
>>>
>>> with open('example_output_iso.txt', encoding='windows-1252') as file:
... print(file.read())
...
20£

 

 建议只使用suggestion对象一次,然后在自动化任务中以正确的编码打开文件。

posted @ 2022-04-13 08:49  轻舞飞洋  阅读(161)  评论(0编辑  收藏  举报