python 正则匹配网页中文内容
在对读取到的网页内容进行中文匹配,大体思路是:
1.对读取到的网页内容提取http header中的content-type,获取网页内容的编码格式;
2.根据获取的编码格式将网页内容转换为unicode格式;
3.使用[\u2e80-\u4dfh]进行正则匹配;
4.将匹配获取的字符进行编码为utf-8格式
Demo:
1: #coding=utf-82:3: import urllib24:5: if __name__ == '__main__':6: try:7: url = 'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24'8: req = urllib2.Request(url)9: res = urllib2.urlopen( req )10: # get content encode11: encoding = res.headers['content-type'].split('charset=')[-1]12: # get http content13: data = res.read()14: # encode with unicode15: data = unicode(data,encoding)16: res.close()17: # match with regex18: str = re.findall(ur'[\u2e80-\u4dfh]+',data)19: for item in str:20: # encode with utf-821: item = item.encode('utf-8')22: print item23: catch Excepiton,e:24: print e
浙公网安备 33010602011771号