字符串与unicode

#coding=utf-8
"""
在python3中文本字符串类型(使用unicode数据存储)被命名为str,字节字符串类型被命名为bytes
在python2中,python3中的str类在python2中名称为unicode,但是python3中bytes类型被命名为str,
这意味着在python3中str是一个字符串,在python2中str是字节字符串

与python3不同,python2会在文本字符串与字节字符串之间尝试进行隐式转换,该工作机制是,
如果解释器遇到一个不同种类字符串的混合操作,解释器首先会将字节字符串转换为文本字符串,然后
对文本字符串进行操作,解释器使用默认编码进行隐式转换,用以下方法提供隐式默认编码
import sys
print(sys.getdefaultencoding())

"""
test_str = u'\u03b1 is for alpha'

print (test_str.encode('utf-8'))

# print (test_str.encode('utf-8').encode('utf-8'))

# python2隐式转换报错如下,
"""
Traceback (most recent call last):
  File "D:/code/test/�ַ�����unicode.py", line 18, in <module>
    print test_str.encode('utf-8').encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
"""

# 对于解释器来说,最后一行代码相当于
# print (test_str.encode('utf-8').decode('ascii').encode('utf-8'))

"""
如果你是使用的是python2.6以上的版本,可以使用from __future__ import unicode_literals,
一旦调用该方法,没有前缀的字符串就会转换成unicode



"""

# 读取文件*****************************************************

# 获取文件读取默认编码

import locale
print(locale.getpreferredencoding()) #提供基于底层系统的运行方式,然而文件并不是总在同一操作系统保存和打开

# 读取文件

with open('test.txt','r',encoding='utf-8') as f:
    text = f.read()

print(type(text))


# 读取字节

with open('test.txt','rb') as f:
    text = f.read()
print(type(text))

print(text.decode('ascii',errors='replace'))

#由于python2总是提供字节字符串，因此open函数并没有提供encoding参数，如果尝试提供encoding参数，则会引发异常
#如果希望运行在python2中代码，最好的方式是以二进制模式（使用b）读取文件，如果希望获取文本数据，请自行解码


# 严格编码********************************************
#utf-8是一个严格编码，它不仅仅是接受任意的字节流并解码，通常它还可以检测无效的非unicode字节流而报错


#尝试使用ascii编码将希腊文本解码，并使用replace错误处理程序会得到以下结果：

"""                               
Testing
context
managers,
�������� ������.

"""
"""
原始内容：
Testing
context
managers,
Γεια σου.
"""

# 注册错误处理程序
import codecs

def replace_with_underscore(err):
    length = err.end - err.start
    return ('*'*length,err.end)

codecs.register_error('replace_with_underscore',replace_with_underscore)

with open('test.txt','r',encoding='utf-8') as f:
    text = f.read()

print(text)

print(text.encode('utf-8').decode('ascii','replace_with_underscore'))

# 小结：在接收到字节字符串后第一时间对其进行解码，同样，输出数据时，应努力尽可能的晚的将文本字符串
# 编码为字节字符串
posted @ 2020-08-06 10:52 ~相忘于江湖阅读(260) 评论(0) 收藏举报
刷新页面返回顶部
~相忘于江湖

字符串与unicode

公告