This article was digested from XML Schema Part 2: Datatypes Second Edition and posted here for the purpose of reference.

F.1.1 Character Class Escapes

[Definition:]   A character class escape is a short sequence of characters that identifies predefined character class. The valid character class escapes are the ·single character escape·s, the ·multi-character escape·s, and the ·category escape·s (including the ·block escape·s).

A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s). A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s).

Character Class Escape
[23]    charClassEsc    ::=    ( SingleCharEsc | MultiCharEsc | catEsc | complEsc )

[Definition:]   A single character escape identifies a set containing a only one character -- usually because that character is difficult or impossible to write directly into a ·regular expression·.

Single Character Escape
[24]    SingleCharEsc    ::=    '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

The valid ·single character escape·s are: Identifying the set of characters C(R) containing:
\n the newline character (#xA)
\r the return character (#xD)
\t the tab character (#x9)
\\ \
\| |
\. .
\- -
\^ ^
\? ?
\* *
\+ +
\{ {
\} }
\( (
\) )
\[ [
\] ]

[Definition:]   [Unicode Database] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X}. The complement of this set is specified with the category escape \P{X}. ([\P{X}] = [^\p{X}]).

Category Escape
[25]    catEsc    ::=    '\p{' charProp '}'
[26]    complEsc    ::=    '\P{' charProp '}'
[27]    charProp    ::=    IsCategory | IsBlock
Note:  [Unicode Database] is subject to future revision. For example, the mapping from code points to character properties might be updated. All ·minimally conforming· processors ·must· support the character properties defined in the version of [Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the character properties defined in any future version.

The following table specifies the recognized values of the "General Category" property.

Category Property Meaning
Letters L All Letters
Lu uppercase
Ll lowercase
Lt titlecase
Lm modifier
Lo other
 
Marks M All Marks
Mn nonspacing
Mc spacing combining
Me enclosing
 
Numbers N All Numbers
Nd decimal digit
Nl letter
No other
 
Punctuation P All Punctuation
Pc connector
Pd dash
Ps open
Pe close
Pi initial quote (may behave like Ps or Pe depending on usage)
Pf final quote (may behave like Ps or Pe depending on usage)
Po other
 
Separators Z All Separators
Zs space
Zl line
Zp paragraph
 
Symbols S All Symbols
Sm math
Sc currency
Sk modifier
So other
 
Other C All Others
Cc control
Cf format
Co private use
Cn not assigned
Categories
[28]    IsCategory    ::=    Letters | Marks | Numbers | Punctuation | Separators | Symbols | Others
[29]    Letters    ::=    'L' [ultmo]?
[30]    Marks    ::=    'M' [nce]?
[31]    Numbers    ::=    'N' [dlo]?
[32]    Punctuation    ::=    'P' [cdseifo]?
[33]    Separators    ::=    'Z' [slp]?
[34]    Symbols    ::=    'S' [mcko]?
[35]    Others    ::=    'C' [cfon]?
Note:  The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

[Definition:]   [Unicode Database] groups code points into a number of blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo, CJK Compatibility, etc. The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}. The complement of this set is specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).

Block Escape
[36]    IsBlock    ::=    'Is' [a-zA-Z0-9#x2D]+

The following table specifies the recognized block names (for more information, see the "Blocks.txt" file in [Unicode Database]).

Start Code End Code Block Name   Start Code End Code Block Name
#x0000 #x007F BasicLatin   #x0080 #x00FF Latin-1Supplement
#x0100 #x017F LatinExtended-A   #x0180 #x024F LatinExtended-B
#x0250 #x02AF IPAExtensions   #x02B0 #x02FF SpacingModifierLetters
#x0300 #x036F CombiningDiacriticalMarks   #x0370 #x03FF Greek
#x0400 #x04FF Cyrillic   #x0530 #x058F Armenian
#x0590 #x05FF Hebrew   #x0600 #x06FF Arabic
#x0700 #x074F Syriac   #x0780 #x07BF Thaana
#x0900 #x097F Devanagari   #x0980 #x09FF Bengali
#x0A00 #x0A7F Gurmukhi   #x0A80 #x0AFF Gujarati
#x0B00 #x0B7F Oriya   #x0B80 #x0BFF Tamil
#x0C00 #x0C7F Telugu   #x0C80 #x0CFF Kannada
#x0D00 #x0D7F Malayalam   #x0D80 #x0DFF Sinhala
#x0E00 #x0E7F Thai   #x0E80 #x0EFF Lao
#x0F00 #x0FFF Tibetan   #x1000 #x109F Myanmar
#x10A0 #x10FF Georgian   #x1100 #x11FF HangulJamo
#x1200 #x137F Ethiopic   #x13A0 #x13FF Cherokee
#x1400 #x167F UnifiedCanadianAboriginalSyllabics   #x1680 #x169F Ogham
#x16A0 #x16FF Runic   #x1780 #x17FF Khmer
#x1800 #x18AF Mongolian   #x1E00 #x1EFF LatinExtendedAdditional
#x1F00 #x1FFF GreekExtended   #x2000 #x206F GeneralPunctuation
#x2070 #x209F SuperscriptsandSubscripts   #x20A0 #x20CF CurrencySymbols
#x20D0 #x20FF CombiningMarksforSymbols   #x2100 #x214F LetterlikeSymbols
#x2150 #x218F NumberForms   #x2190 #x21FF Arrows
#x2200 #x22FF MathematicalOperators   #x2300 #x23FF MiscellaneousTechnical
#x2400 #x243F ControlPictures   #x2440 #x245F OpticalCharacterRecognition
#x2460 #x24FF EnclosedAlphanumerics   #x2500 #x257F BoxDrawing
#x2580 #x259F BlockElements   #x25A0 #x25FF GeometricShapes
#x2600 #x26FF MiscellaneousSymbols   #x2700 #x27BF Dingbats
#x2800 #x28FF BraillePatterns   #x2E80 #x2EFF CJKRadicalsSupplement
#x2F00 #x2FDF KangxiRadicals   #x2FF0 #x2FFF IdeographicDescriptionCharacters
#x3000 #x303F CJKSymbolsandPunctuation   #x3040 #x309F Hiragana
#x30A0 #x30FF Katakana   #x3100 #x312F Bopomofo
#x3130 #x318F HangulCompatibilityJamo   #x3190 #x319F Kanbun
#x31A0 #x31BF BopomofoExtended   #x3200 #x32FF EnclosedCJKLettersandMonths
#x3300 #x33FF CJKCompatibility   #x3400 #x4DB5 CJKUnifiedIdeographsExtensionA
#x4E00 #x9FFF CJKUnifiedIdeographs   #xA000 #xA48F YiSyllables
#xA490 #xA4CF YiRadicals   #xAC00 #xD7A3 HangulSyllables
 
  #xE000 #xF8FF PrivateUse
#xF900 #xFAFF CJKCompatibilityIdeographs   #xFB00 #xFB4F AlphabeticPresentationForms
#xFB50 #xFDFF ArabicPresentationForms-A   #xFE20 #xFE2F CombiningHalfMarks
#xFE30 #xFE4F CJKCompatibilityForms   #xFE50 #xFE6F SmallFormVariants
#xFE70 #xFEFE ArabicPresentationForms-B   #xFEFF #xFEFF Specials
#xFF00 #xFFEF HalfwidthandFullwidthForms   #xFFF0 #xFFFD Specials
Note:  The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.
Note:  [Unicode Database] is subject to future revision. For example, the grouping of code points into blocks might be updated. All ·minimally conforming· processors ·must· support the blocks defined in the version of [Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard.

For example, the ·block escape· for identifying the ASCII characters is \p{IsBasicLatin}.

[Definition:]   A multi-character escape provides a simple way to identify a commonly used set of characters:

Multi-Character Escape
[37]    MultiCharEsc    ::=    '\' [sSiIcCdDwW]
[37a]    WildcardEsc    ::=    '.'

Character sequence Equivalent ·character class·
. [^\n\r]
\s [#x20\t\n\r]
\S [^\s]
\i the set of initial name characters, those ·match·ed by Letter | '_' | ':'
\I [^\i]
\c the set of name characters, those ·match·ed by NameChar
\C [^\c]
\d \p{Nd}
\D [^\d]
\w [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
\W [^\w]

Note:  The ·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.
posted on 2005-04-22 01:29  Laser.NET  阅读(604)  评论(0)    收藏  举报
无觅相关文章插件,快速提升流量