转:Unicode编码范围

Unicode 编码范围(Unicode blocks)

0x0000-0x001F:控制字符 (Control character)
0x0020-0x007F:基本拉丁字母 (Basic Latin)
0x0080-0x00FF:拉丁文补充1(Latin-1 Supplement)
0x0100-0x017F:拉丁文扩展-A (Latin Extended-A)
0x0180-0x024F:拉丁文扩展-B(Latin Extended-B)
0x0250-0x02AF:国际音标扩展 (IPA Extensions)
0x02B0-0x02FF:占位修饰符号(Spacing Modifier Letters)
0x0300-0x036F:结合附加符号(Combining Diacritical Marks)
0x0370-0x03FF:希腊字母及科普特字母(Greek and Coptic)
0x0400-0x04FF:西里尔字母(Cyrillic)
0x0500-0x052F:西里尔字母补充(Cyrillic Supplement)
0x0530-0x058F:亚美尼亚语 (Armenian)
0x0590-0x05FF:希伯来文 (Hebrew)
0x0600-0x06FF:阿拉伯文(Arabic)
0x0700-0x074F:叙利亚文(Syriac)
0x0750-0x077F:阿拉伯文补充 (Arabic Supplement)
0x0780-0x07BF:它拿字母(Thaana)
0x07C0-0x077F:西非书面语言 (NKo)
0x0800-0x083F:撒玛利亚字母(Samaritan)
0x0840-0x085FMandaic
0x0860-0x086F:叙利亚文补充(Syriac Supplement)
0x08A0-0x08AF:阿拉伯语扩展(Arabic Extended-A)
0x0900-0x097F:天城文 (Devanagari)
0x0980-0x09FF:孟加拉文 (Bengali)
0x0A00-0x0A7F:果鲁穆奇字母 (Gurmukhi)
0x0A80-0x0AFF:古吉拉特文 (Gujarati)
0x0B00-0x0B7F:奥里亚文 (Oriya)
0x0B80-0x0BFF:泰米尔文 (Tamil)
0x0C00-0x0C7F:泰卢固文 (Telugu)
0x0C80-0x0CFF:卡纳达文 (Kannada)
0x0D00-0x0D7F:马拉雅拉姆文 (Malayalam)
0x0D80-0x0DFF:僧伽罗语 (Sinhala)
0x0E00-0x0E7F:泰文 (Thai)
0x0E80-0x0EFF:老挝文 (Lao)
0x0F00-0x0FFF:藏文 (Tibetan)
0x1000-0x109F:缅甸文 (Myanmar)
0x10A0-0x10FF:格鲁吉亚字母 (Georgian)
0x1100-0x11FF:谚文字母 (Hangul Jamo)
0x1200-0x137F:埃塞俄比亚语 (Ethiopic)
0x1380-0x139F:埃塞俄比亚语补充 (Ethiopic Supplement)
0x13A0-0x13FF:切罗基字母(Cherokee)
0x1400-0x167F:统一加拿大原住民音节文字 (Unified Canadian Aboriginal Syllabics)
0x1680-0x169F:欧甘字母 (Ogham)
0x16A0-0x16FF:卢恩字母 (Runic)
0x1700-0x171F:他加禄字母 (Tagalog)
0x1720-0x173F:哈努诺文(Hanunoo)
0x1740-0x175F:布迪文(Buhid)
0x1760-0x177F:塔格巴努亚文(Tagbanwa)
0x1780-0x17FF:高棉文 (Khmer)
0x1800-0x18AF:蒙古文 (Mongolian)
0x18B0-0x18FF:统一加拿大原住民音节文字扩展(Unified Canadian Aboriginal Syllabics Extended)
0x1900-0x194F:林布文(Limbu)
0x1950-0x197F:德宏傣文(Tai Le)
0x1980-0x19DF:新傣仂文 (New Tai Lue)
0x19E0-0x19FF:高棉语符号 (Kmer Symbols)
0x1A00-0x1A1F:布吉文(Buginese)
0x1A20-0x1AAF:老傣文(Tai Tham)
0x1AB0-0x1AFF:Combining Diacritical Marks Extended

0x1B00-0x1B7F:巴厘字母(Balinese)
0x1B80-0x1BBF:巽他字母 (Sundanese)
0x1BC0-0x1BFF:巴塔克文(Batak)
0x1C00-0x1C4F:雷布查字母(Lepcha)
0x1C50-0x1C7F:Ol Chiki
0x1C80-0x1C8F:Cyrillic Extended C
0x1CC0-0x1CCF:巽他字母补充 (Sundanese Supplement)
0x1CD0-0x1CFF:吠陀梵文(Vedic Extensions)
0x1D00-0x1D7F:语音学扩展 (Phonetic Extensions)
0x1D80-0x1DBF:语音学扩展补充 (Phonetic Extensions Supplement)
0x1DC0-0x1DFF:结合附加符号补充(Combining Diacritics Marks Supplement)
0x1E00-0x1EFF:拉丁文扩展附加(Latin Extended Additional)
0x1F00-0x1FFF:希腊语扩展 (Greek Extended)
0x2000-0x206F:常用标点 (General Punctuation)
0x2070-0x209F:上标及下标 (Superscripts and Subscripts)
0x20A0-0x20CF:货币符号 (Currency Symbols)
0x20D0-0x20FF:组合用记号 (Combining Diacritics Marks for Symbols)
0x2100-0x214F:字母式符号 (Letterlike Symbols)
0x2150-0x218F:数字形式 (Number Form)
0x2190-0x21FF:箭头 (Arrows)
0x2200-0x22FF:数学运算符 (Mathematical Operator)
0x2300-0x23FF:杂项工业符号 (Miscellaneous Technical)
0x2400-0x243F:控制图片 (Control Pictures)
0x2440-0x245F:光学识别符 (Optical Character Recognition)
0x2460-0x24FF:带圈或括号的字母数字 (Enclosed Alphanumerics)
0x2500-0x257F:制表符 (Box Drawing)
0x2580-0x259F:方块元素 (Block Element)
0x25A0-0x25FF:几何图形 (Geometric Shapes)
0x2600-0x26FF:杂项符号 (Miscellaneous Symbols)
0x2700-0x27BF:印刷符号 (Dingbats)
0x27C0-0x27EF:杂项数学符号-A (Miscellaneous Mathematical Symbols-A)
0x27F0-0x27FF:追加箭头-A (Supplemental Arrows-A)
0x2800-0x28FF:盲文点字模型 (Braille Patterns)
0x2900-0x297F:追加箭头-B (Supplemental Arrows-B)
0x2980-0x29FF:杂项数学符号-B (Miscellaneous Mathematical Symbols-B)
0x2A00-0x2AFF:追加数学运算符 (Supplemental Mathematical Operator)
0x2B00-0x2BFF:杂项符号和箭头 (Miscellaneous Symbols and Arrows)
0x2C00-0x2C5F:格拉哥里字母 (Glagolitic)
0x2C60-0x2C7F:拉丁文扩展-C (Latin Extended-C)
0x2C80-0x2CFF:科普特字母 (Coptic)
0x2D00-0x2D2F:格鲁吉亚字母补充 (Georgian Supplement)
0x2D30-0x2D7F:提非纳文 (Tifinagh)
0x2D80-0x2DDF:埃塞俄比亚语扩展 (Ethiopic Extended)
0x2DE0-0x2DFF:西里尔字母扩展(Cyrillic Extended-A)
0x2E00-0x2E7F:追加标点 (Supplemental Punctuation)
0x2E80-0x2EFF:中日韩部首补充 (CJK Radicals Supplement)
0x2F00-0x2FDF:康熙字典部首 (Kangxi Radicals)
0x2FF0-0x2FFF:表意文字描述符 (Ideographic Description Characters)
0x3000-0x303F:中日韩符号和标点 (CJK Symbols and Punctuation)
0x3040-0x309F:日文平假名 (Hiragana)
0x30A0-0x30FF:日文片假名 (Katakana)
0x3100-0x312F:注音字母 (Bopomofo)
0x3130-0x318F:谚文兼容字母 (Hangul Compatibility Jamo)
0x3190-0x319F:象形字注释标志 (Kanbun)
31A0-0x31BF:注音字母扩展 (Bopomofo Extended)
0x31C0-31EF:中日韩笔画 (CJK Strokes)
0x31F0-0x31FF:日文片假名语音扩展 (Katakana Phonetic Extensions)
0x3200-0x32FF:带圈中日韩字母和月份(Enclosed CJK Letters and Months)
0x3300-0x33FF:中日韩字符集兼容 (CJK Compatibility)
0x3400-0x4DBF:中日韩统一表意文字扩展A (CJK Unified Ideographs Extension A)
0x4DC0-0x4DFF:易经六十四卦符号 (Yijing Hexagrams Symbols)
0x4E00-0x9FBF:中日韩统一表意文字 (CJK Unified Ideographs)
0xA000-0xA48F:彝文音节 (Yi Syllables)
0xA490-0xA4CF:彝文字根 (Yi Radicals)
0xA4D0-0xA4FF:Lisu
0xA500-0xA63F:老傈僳文(Vai)
0xA640-0xA69F:西里尔字母扩展B(Cyrillic Extended-B)
0xA6A0-0xA6FF:巴姆穆语(Bamum)
0xA700-0xA71F:声调修饰字母 (Modifier Tone Letters)
0xA720-0xA7FF:拉丁文扩展-D (Latin Extended-D)
0xA800-0xA82F:锡尔赫特文(Syloti Nagri)
  
0xA830-0xA83F:印第安数字(Common Indic Number Forms)
0xA840-0xA87F:八思巴文 (Phags-pa)
0xA880-0xA8DF:索拉什特拉(Saurashtra)
0xA8E0-0xA8FF:天城文扩展(Devanagari Extended)
0xA900-0xA92F:克耶字母(Kayah Li)
0xA930-0xA95F:勒姜语(Rejang)
0xA960-0xA97F:谚文字母扩展A (Hangul Jamo Extended-A)
0xA980-0xA9DF:爪哇语(Javanese)
0xA9E0-0xA9FF:Myanmar Extended-B
0xAA00-0xAA5F:鞑靼文(Cham)
0xAA60-0xAA7F:缅甸语扩展(Myanmar Extended-A)
0xAA80-0xAADF:越南傣文(Tai Viet)
0xAAE0-0xAAFF:曼尼普尔文扩展(Meetei Mayek Extensions)
0xAB00-0xAB2F:埃塞俄比亚文 (Ethiopic Extended-A)
0xAB30-0xAB6F:Latin Extended-E
0xAB70-0xABBF:Cherokee Supplement
0xABC0-0xABFF:曼尼普尔文(Meetei Mayek)
0xAC00-0xD7AF:谚文音节 (Hangul Syllables)
0xD7B0-0xD7FF:Hangul Jamo Extended-B
0xD800-0xDB7F:代理对高位字(High Surrogates)
0xD880-0xDBFF:代理对私用区高位字(High Private Use Surrogates)
0xDC00-0xDFFF:代理对低位字(Low Surrogates)
0xE000-0xF8FF:私用区 (Private Use Area)
0xF900-0xFAFF:中日韩兼容表意文字 (CJK Compatibility Ideographs)
0xFB00-0xFB4F:字母表达形式(拉丁字母连字、亚美尼亚字母连字、希伯来文表现形式) (Alphabetic Presentation Forms)
0xFB50-0xFDFF:阿拉伯表达形式A (Arabic Presentation Forms-A)
0xFE00-0xFE0F:异体字选择符 (Variation Selectors)
0xFE10-0xFE1F:竖排形式 (Vertical Forms)
0xFE20-0xFE2F:组合用半符号 (Combining Half Marks)
0xFE30-0xFE4F:中日韩兼容形式 (CJK Compatibility Forms)
0xFE50-0xFE6F:小型变体形式 (Small Form Variants)
0xFE70-0xFEFF:阿拉伯表达形式B (Arabic Presentation Forms-B)
0xFF00-0xFFEF:半角及全角形式 (Halfwidth and Fullwidth Forms)
0xFFF0-0xFFFF:特殊 (Specials)

 

#coding=utf-8

import re
import os,sys,math
import json

s = '''
0123456789
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
こんにちは テキストコンテンツはコピーをサポートしいやだよ
나는 북경 천안문을 사랑한다
我爱北京天安门
我愛北京天安門
ÆÁÂÂÀÅÃÄÇÐÉÊÈËÍÎÌÏÑÓÔÒØÕÖÞÚÛÙÜÝ
áâæàåãäçéêèðëíîìïñóôòøõößþúûùüýÿ
Αρχαίαελληνικήγλώσσα
اللغة العربية
لغة عربية
º ¹ ² ³ ⁴⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ⁽ ⁾ ⁿ  ½⁰⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ˙ᵃ ᵇ ᶜ ᵈ ᵉ ᵍ ʰ ⁱ ʲ ᵏ ˡ ᵐ ⁿ ᵒ ᵖ ᵒ ʳ ˢ ᵗ ᵘ ᵛ ʷ ˣ ʸ ᙆ ᴬ ᴮ ᒼ ᴰ ᴱ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ ᴼ ᴾ ᴼ̴ ᴿ ˢ ᵀ ᵁ ᵂ ˣ ᵞ ᙆ⁵⁵ᵍᵍ  ⁵ᴳ ₅ᵍ ₅ᵩ   ₅ᵩ⁽ⁿ⁻⁶⁾ᵃ ᵇ ᶜ ᵈ ᵉ ᵍ ʰ ⁱ ʲ ᵏ ˡ ᵐ ⁿ ᵒ ᵖ ᵒ ʳ ˢ ᵗ ᵘ ᵛ ʷ ˣ ʸ ᙆ ᴬ ᴮ ᒼ ᴰ ᴱ ᴳ ᴴ ᴵ ᴶ ᴷ ᴸ ᴹ ᴺ ᴼ ᴾ ᴼ̴ ᴿ ˢ ᵀ ᵁ ᵂ ˣ ᵞ ᙆ ꝰ ˀ ˁ ˤ ꟸ ꭜ ʱ ꭝ ꭞ ʴ ʵ ʶ ꭟ ˠ ꟹ ᴭ ᴯ ᴲ ᴻ ᴽ ᵄ ᵅ ᵆ ᵊ ᵋ ᵌ ᵑ ᵓ ᵚ ᵝ ᵞ ᵟ ᵠ ᵡ ᵎ ᵔ ᵕ ᵙ ᵜ ᶛ ᶜ ᶝ ᶞ ᶟ ᶡ ᶣ ᶤ ᶥ ᶦ ᶧ ᶨ ᶩ ᶪ ᶫ ᶬ ᶭ ᶮ ᶯ ᶰ ᶱ ᶲ ᶳ ᶴ ᶵ ᶶ ᶷ ᶸ ᶹ ᶺ ᶼ ᶽ ᶾ ᶿ ꚜ ꚝ ჼ ᒃ ᕻ ᑦ ᒄ ᕪ ᑋ ᑊ ᔿ ᐢ ᣕ ᐤ ᣖ ᣴ ᣗ ᔆ ᙚ ᐡ ᘁ ᐜ ᕽ ᙆ ᙇ ᒼ ᣳ ᒢ ᒻ ᔿ ᐤ ᣖ ᣵ ᙚ ᐪ ᓑ ᘁ ᐜ ᕽ ᙆ ᙇ ⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁼ ˂ ˃ ⁽ ⁾ ˙ * º
₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₋ ₌ ₍ ₎₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎ₐₑₒₓ
¥ €£¤
.?-*+[](){}<>@#=_~%&:。?:【】《》()
'''.strip()


#UTF-8使用1~4字节为每个字符编码:
#一个US-ASCIl字符只需1字节编码(Unicode范围由U+0000~U+007F)。
arr = re.findall('[0-9a-zA-Z]+', s)
print('ASCIl可见字符:')
print('\n'.join(arr))
print()


#日文\u0800-\u4e00
arr = re.findall('[\u3040-\u31FF]+', s)
print('日文:')
print('\n'.join(arr))
print()

#韩文
arr = re.findall('[\uAC00-\uD7AF]+', s)
print('韩文:')
print('\n'.join(arr))
print()

#中文\u4E00-\u9FA5\uf900-\ufa2d
arr = re.findall('[\u4E00-\u9FD5]+', s)
print('中文:')
print('\n'.join(arr))
print()

#上标及下标
arr = re.findall('[\u2070-\u209F]+', s)
print('上标及下标:')
print(''.join(arr))
print()

#常用符号
arr = re.findall('[\.\?\-\*\+\[\]\(\)\{\}<>@#=_~%&:。?:【】《》()]+', s)
print('常用符号:')
print('\n'.join(arr))
print()


#\w匹配
arr = re.findall('[\w]+', s)
print('\w匹配:')
print(''.join(arr))
print()

#控制字符
arr = re.findall('[\u0000-\u001F]+', s)
print('控制字符:')
print(arr)
print()


#·其他语言的字符(包括中日韩文字、东南亚文字、中东文字等)包含了大部分常用字,使用3字节编码。
#其他极少使用的语言字符使用4字节编码。

  

输出:

ASCIl可见字符:
0123456789
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ

日文:
こんにちは
テキストコンテンツはコピーをサポートしいやだよ

韩文:
나는
북경
천안문을
사랑한다

中文:
我爱北京天安门
我愛北京天安門

上标及下标:
⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿ⁰⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿⁱⁿ⁵⁵⁵₅₅₅⁽ⁿ⁻⁶⁾ⁱⁿ⁰⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎₀₁₂₃₄₅₆₇₈₉₊₋₌₍₎ₐₑₒₓ

常用符号:
*
.?-*+[](){}<>@#=_~%&:。?:【】《》()

\w匹配:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZこんにちはテキストコンテンツはコピーをサポートしいやだよ나는북경천안문을사랑한다我爱北京天安门我愛北京天安門ÆÁÂÂÀÅÃÄÇÐÉÊÈËÍÎÌÏÑÓÔÒØÕÖÞÚÛÙÜÝáâæàåãäçéêèðëíîìïñóôòøõößþúûùüýÿΑρχαίαελληνικήγλώσσαاللغةالعربيةلغةعربيةº¹²³⁴⁵⁶⁷⁸⁹ⁿ½⁰⁴⁵⁶⁷⁸⁹ⁿᵃᵇᶜᵈᵉᵍʰⁱʲᵏˡᵐⁿᵒᵖᵒʳˢᵗᵘᵛʷˣʸᙆᴬᴮᒼᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴼᴿˢᵀᵁᵂˣᵞᙆ⁵⁵ᵍᵍ⁵ᴳ₅ᵍ₅ᵩ₅ᵩⁿ⁶ᵃᵇᶜᵈᵉᵍʰⁱʲᵏˡᵐⁿᵒᵖᵒʳˢᵗᵘᵛʷˣʸᙆᴬᴮᒼᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴼᴿˢᵀᵁᵂˣᵞᙆꝰˀˁˤꟸꭜʱꭝꭞʴʵʶꭟˠꟹᴭᴯᴲᴻᴽᵄᵅᵆᵊᵋᵌᵑᵓᵚᵝᵞᵟᵠᵡᵎᵔᵕᵙᵜᶛᶜᶝᶞᶟᶡᶣᶤᶥᶦᶧᶨᶩᶪᶫᶬᶭᶮᶯᶰᶱᶲᶳᶴᶵᶶᶷᶸᶹᶺᶼᶽᶾᶿꚜꚝჼᒃᕻᑦᒄᕪᑋᑊᔿᐢᣕᐤᣖᣴᣗᔆᙚᐡᘁᐜᕽᙆᙇᒼᣳᒢᒻᔿᐤᣖᣵᙚᐪᓑᘁᐜᕽᙆᙇ⁰¹²³⁴⁵⁶⁷⁸⁹º₀₁₂₃₄₅₆₇₈₉₀₁₂₃₄₅₆₇₈₉ₐₑₒₓ_

控制字符:
['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']

  

来源:

https://blog.csdn.net/sdibt513/article/details/89641187

https://blog.csdn.net/ztf312/article/details/76670851?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-1-76670851-blog-89641187.235^v38^pc_relevant_anti_t3&spm=1001.2101.3001.4242.2&utm_relevant_index=4

https://blog.csdn.net/tfstone/article/details/87877462?spm=1001.2101.3001.6650.7&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-7-87877462-blog-89641187.235%5Ev38%5Epc_relevant_anti_t3&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromBaidu%7ERate-7-87877462-blog-89641187.235%5Ev38%5Epc_relevant_anti_t3&utm_relevant_index=12

https://www.ssec.wisc.edu/~tomw/java/unicode.html#x0100

posted @ 2023-06-07 14:05  河北大学-徐小波  阅读(147)  评论(0编辑  收藏  举报