【Python开发web】(4) -- Python基础之中文字符

上一篇提到了中文字符的长度，这篇主要就这点做了些测试，代码如下：

#!/usr/bin/python
#-*- coding: utf-8 -*-
s = "中国"
ss = u"中国"

print s, type(s), len(s)
print ss, type(ss), len(ss)
print '-' * 40
print repr(s)
print repr(ss)
print '-' * 40
s1 = s.decode('utf-8')
print s1,len(s1),type(s1)
print '-' * 40
s2 = s.decode('utf-8').encode('gbk')
print s2
print type(s2)
print len(s2)
print '-' * 40
s3 = ss.encode('gbk')
print s3
print type(s3)
print len(s3)

执行结果如下：

中国 <type 'str'> 6
中国 <type 'unicode'> 2
----------------------------------------
'\xe4\xb8\xad\xe5\x9b\xbd'
u'\u4e2d\u56fd'
----------------------------------------
中国 2 <type 'unicode'>
----------------------------------------
�й
<type 'str'>
4
----------------------------------------
�й
<type 'str'>
4

补充：

查看python中默认编码设置：

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

由于在文件的头上已经指明了#-*- coding: utf-8 -*- ，则s的编码已是utf-8。

在utf-8下，英文字母占一个字节，中文占3个字节；

unicode下的中文是1个字符（双字节）；

GBK编码下的中文占2个字节。(感谢keakon的指正)

posted @ 2009-03-15 13:55 pangzi 阅读(1526) 评论(0) 收藏举报

刷新页面返回顶部