python 字符编码转换

#!/bin/env python
#-*- encoding=utf8 -*-
# 文件头指定utf8编码还是乱码时，使用下面方式指定

# fix encoding problem 
import sys
reload(sys)
sys.setdefaultencoding('utf8')  # 设置编码
sys.getdefaultencoding()  # 获取编码


-------------------------------------------------------

sys.path.append('../')  # 设置路径

 #coding=utf-8
  
 s='中文'
 if(isinstance(s, str)):
     #s为u'中文'
     s.encode('gb2312')
 else:
     #s为'中文'
     s.decode('utf8').encode('gb2312')

python 内部使用unicode表示字符串，自然当需要编码转换时，要用unicode作为中间“中间编码”，

eg:

gbk转utf-8时，

gbk --> unicode --> utf-8

分解为两个步骤:

1. gbk --> unicode

python语法：字符串.decode('gbk')

2. unicode --> utf-8

python语法: 字符串.decode('gbk').encode('utf-8')

对于已经是Unicode编码的字符串，可以直接encode，而不能decode了。这种情况下，需要代码给出判断，

可以使用python __builtin__.py中提供的函数：isinstance() 去判断python范围内的任何“类型”，当然也

可以判断是不是unicode：

if isinstance(yourchar, unicode):

communicate = yourchar.encode('utf-8') #直接编码成utf-8格式

else :

# 此处没有进行过测试，如果出错可以直接使用: communicate = yourchar.decode('你当前的编码类型', errors='ignore').encode('utf-8')

import chardet # chardet.detect 可以试探字符串类型，估计是某种字符的概率

type_decode = chardet.detect(yourchar)['encoding']

communicate = yourchar.decode(type_decode, errors='ignore').encode('utf-8')

errors:

因为unicode 只有128那么长，所以为了“容错”，这里有3个级别，

errors='strict' # 很严格，出错（多于128）就异常

errors='replace' # add U+FFFD, 'REPLACEMENT CHARACTER'

　　 errors = 'ignore' # 用短的替换

posted on 2016-11-22 20:33 折翼的飞鸟阅读(523) 评论(0) 收藏举报

刷新页面返回顶部

折翼的飞鸟

导航

公告