This article was digested from XML Schema Part 2: Datatypes Second Edition and posted here for the purpose of reference.
[Definition:] A character class escape is a short sequence of characters that identifies predefined character class. The valid character class escapes are the ·single character escape·s, the ·multi-character escape·s, and the ·category escape·s (including the ·block escape·s).
A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s). A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s).
[Definition:] A single character escape identifies a set containing a only one character -- usually because that character is difficult or impossible to write directly into a ·regular expression·.
| Single Character Escape
|
| [24]
|
SingleCharEsc
|
::= |
'\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]
|
|
| The valid ·single character escape·s are:
|
Identifying the set of characters C(R) containing:
|
\n
|
the newline character (#xA) |
\r
|
the return character (#xD) |
\t
|
the tab character (#x9) |
\\
|
\ |
\|
|
| |
\.
|
. |
\-
|
- |
\^
|
^ |
\?
|
? |
\*
|
* |
\+
|
+ |
\{
|
{ |
\}
|
} |
\(
|
( |
\)
|
) |
\[
|
[ |
\]
|
] |
[Definition:] [Unicode Database] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X}. The complement of this set is specified with the category escape \P{X}. ([\P{X}] = [^\p{X}]).
Note: [Unicode Database] is subject to future revision. For example, the mapping from code points to character properties might be updated. All
·minimally conforming· processors
·must· support the character properties defined in the version of
[Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the character properties defined in any future version.
The following table specifies the recognized values of the "General Category" property.
| Category |
Property |
Meaning |
| Letters |
L |
All Letters |
| Lu |
uppercase |
| Ll |
lowercase |
| Lt |
titlecase |
| Lm |
modifier |
| Lo |
other |
| |
| Marks |
M |
All Marks |
| Mn |
nonspacing |
| Mc |
spacing combining |
| Me |
enclosing |
| |
| Numbers |
N |
All Numbers |
| Nd |
decimal digit |
| Nl |
letter |
| No |
other |
| |
| Punctuation |
P |
All Punctuation |
| Pc |
connector |
| Pd |
dash |
| Ps |
open |
| Pe |
close |
| Pi |
initial quote (may behave like Ps or Pe depending on usage) |
| Pf |
final quote (may behave like Ps or Pe depending on usage) |
| Po |
other |
| |
| Separators |
Z |
All Separators |
| Zs |
space |
| Zl |
line |
| Zp |
paragraph |
| |
| Symbols |
S |
All Symbols |
| Sm |
math |
| Sc |
currency |
| Sk |
modifier |
| So |
other |
| |
| Other |
C |
All Others |
| Cc |
control |
| Cf |
format |
| Co |
private use |
| Cn |
not assigned |
Note: The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.
[Definition:] [Unicode Database] groups code points into a number of blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo, CJK Compatibility, etc. The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}. The complement of this set is specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
| Block Escape
|
| [36]
|
IsBlock
|
::= |
'Is' [a-zA-Z0-9#x2D]+
|
|
The following table specifies the recognized block names (for more information, see the "Blocks.txt" file in [Unicode Database]).
| Start Code |
End Code |
Block Name |
|
Start Code |
End Code |
Block Name |
| #x0000 |
#x007F |
BasicLatin |
|
#x0080 |
#x00FF |
Latin-1Supplement |
| #x0100 |
#x017F |
LatinExtended-A |
|
#x0180 |
#x024F |
LatinExtended-B |
| #x0250 |
#x02AF |
IPAExtensions |
|
#x02B0 |
#x02FF |
SpacingModifierLetters |
| #x0300 |
#x036F |
CombiningDiacriticalMarks |
|
#x0370 |
#x03FF |
Greek |
| #x0400 |
#x04FF |
Cyrillic |
|
#x0530 |
#x058F |
Armenian |
| #x0590 |
#x05FF |
Hebrew |
|
#x0600 |
#x06FF |
Arabic |
| #x0700 |
#x074F |
Syriac |
|
#x0780 |
#x07BF |
Thaana |
| #x0900 |
#x097F |
Devanagari |
|
#x0980 |
#x09FF |
Bengali |
| #x0A00 |
#x0A7F |
Gurmukhi |
|
#x0A80 |
#x0AFF |
Gujarati |
| #x0B00 |
#x0B7F |
Oriya |
|
#x0B80 |
#x0BFF |
Tamil |
| #x0C00 |
#x0C7F |
Telugu |
|
#x0C80 |
#x0CFF |
Kannada |
| #x0D00 |
#x0D7F |
Malayalam |
|
#x0D80 |
#x0DFF |
Sinhala |
| #x0E00 |
#x0E7F |
Thai |
|
#x0E80 |
#x0EFF |
Lao |
| #x0F00 |
#x0FFF |
Tibetan |
|
#x1000 |
#x109F |
Myanmar |
| #x10A0 |
#x10FF |
Georgian |
|
#x1100 |
#x11FF |
HangulJamo |
| #x1200 |
#x137F |
Ethiopic |
|
#x13A0 |
#x13FF |
Cherokee |
| #x1400 |
#x167F |
UnifiedCanadianAboriginalSyllabics |
|
#x1680 |
#x169F |
Ogham |
| #x16A0 |
#x16FF |
Runic |
|
#x1780 |
#x17FF |
Khmer |
| #x1800 |
#x18AF |
Mongolian |
|
#x1E00 |
#x1EFF |
LatinExtendedAdditional |
| #x1F00 |
#x1FFF |
GreekExtended |
|
#x2000 |
#x206F |
GeneralPunctuation |
| #x2070 |
#x209F |
SuperscriptsandSubscripts |
|
#x20A0 |
#x20CF |
CurrencySymbols |
| #x20D0 |
#x20FF |
CombiningMarksforSymbols |
|
#x2100 |
#x214F |
LetterlikeSymbols |
| #x2150 |
#x218F |
NumberForms |
|
#x2190 |
#x21FF |
Arrows |
| #x2200 |
#x22FF |
MathematicalOperators |
|
#x2300 |
#x23FF |
MiscellaneousTechnical |
| #x2400 |
#x243F |
ControlPictures |
|
#x2440 |
#x245F |
OpticalCharacterRecognition |
| #x2460 |
#x24FF |
EnclosedAlphanumerics |
|
#x2500 |
#x257F |
BoxDrawing |
| #x2580 |
#x259F |
BlockElements |
|
#x25A0 |
#x25FF |
GeometricShapes |
| #x2600 |
#x26FF |
MiscellaneousSymbols |
|
#x2700 |
#x27BF |
Dingbats |
| #x2800 |
#x28FF |
BraillePatterns |
|
#x2E80 |
#x2EFF |
CJKRadicalsSupplement |
| #x2F00 |
#x2FDF |
KangxiRadicals |
|
#x2FF0 |
#x2FFF |
IdeographicDescriptionCharacters |
| #x3000 |
#x303F |
CJKSymbolsandPunctuation |
|
#x3040 |
#x309F |
Hiragana |
| #x30A0 |
#x30FF |
Katakana |
|
#x3100 |
#x312F |
Bopomofo |
| #x3130 |
#x318F |
HangulCompatibilityJamo |
|
#x3190 |
#x319F |
Kanbun |
| #x31A0 |
#x31BF |
BopomofoExtended |
|
#x3200 |
#x32FF |
EnclosedCJKLettersandMonths |
| #x3300 |
#x33FF |
CJKCompatibility |
|
#x3400 |
#x4DB5 |
CJKUnifiedIdeographsExtensionA |
| #x4E00 |
#x9FFF |
CJKUnifiedIdeographs |
|
#xA000 |
#xA48F |
YiSyllables |
| #xA490 |
#xA4CF |
YiRadicals |
|
#xAC00 |
#xD7A3 |
HangulSyllables |
|
|
|
|
|
|
|
|
|
|
|
#xE000 |
#xF8FF |
PrivateUse |
| #xF900 |
#xFAFF |
CJKCompatibilityIdeographs |
|
#xFB00 |
#xFB4F |
AlphabeticPresentationForms |
| #xFB50 |
#xFDFF |
ArabicPresentationForms-A |
|
#xFE20 |
#xFE2F |
CombiningHalfMarks |
| #xFE30 |
#xFE4F |
CJKCompatibilityForms |
|
#xFE50 |
#xFE6F |
SmallFormVariants |
| #xFE70 |
#xFEFE |
ArabicPresentationForms-B |
|
#xFEFF |
#xFEFF |
Specials |
| #xFF00 |
#xFFEF |
HalfwidthandFullwidthForms |
|
#xFFF0 |
#xFFFD |
Specials |
Note: The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.
Note: [Unicode Database] is subject to future revision. For example, the grouping of code points into blocks might be updated. All
·minimally conforming· processors
·must· support the blocks defined in the version of
[Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard.
For example, the ·block escape· for identifying the ASCII characters is \p{IsBasicLatin}.
[Definition:] A multi-character escape provides a simple way to identify a commonly used set of characters:
| Multi-Character Escape
|
| [37]
|
MultiCharEsc
|
::= |
'\' [sSiIcCdDwW]
|
| [37a]
|
WildcardEsc
|
::= |
'.'
|
|
| Character sequence |
Equivalent ·character class·
|
| . |
[^\n\r] |
| \s |
[#x20\t\n\r] |
| \S |
[^\s] |
| \i |
the set of initial name characters, those ·match·ed by Letter | '_' | ':'
|
| \I |
[^\i] |
| \c |
the set of name characters, those ·match·ed by NameChar
|
| \C |
[^\c] |
| \d |
\p{Nd} |
| \D |
[^\d] |
| \w |
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
|
| \W |
[^\w] |
Note: The
·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in
[Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.