[Quoted from W3C] Character Escapes - Laser.NET

公告

This article was digested from XML Schema Part 2: Datatypes Second Edition and posted here for the purpose of reference.

F.1.1 Character Class Escapes
[Definition:] A character class escape is a short sequence of characters that identifies predefined character class. The valid character class escapes are the ·single character escape·s, the ·multi-character escape·s, and the ·category escape·s (including the ·block escape·s).
A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s). A is a short sequence of characters that identifies predefined character class. The valid character class escapes are the s, the s, and the s (including the s).

Character Class Escape

[23] charClassEsc ::= ( SingleCharEsc | MultiCharEsc | catEsc | complEsc )

[Definition:] A single character escape identifies a set containing a only one character -- usually because that character is difficult or impossible to write directly into a ·regular expression·.

Single Character Escape

[24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

The valid ·single character escape·s are:	Identifying the set of characters C(R) containing:
`\n`	the newline character (#xA)
`\r`	the return character (#xD)
`\t`	the tab character (#x9)
`\\`	\
`\\|`	\|
`\.`	.
`\-`	-
`\^`	^
`\?`	?
`\*`	*
`\+`	+
`\{`	{
`\}`	}
`\(`	(
`\)`	)
`\[`	[
`\]`	]

[Definition:] [Unicode Database] specifies a number of possible values for the "General Category" property and provides mappings from code points to specific character properties. The set containing all characters that have property X, can be identified with a category escape \p{X}. The complement of this set is specified with the category escape \P{X}. ([\P{X}] = [^\p{X}]).

Category Escape

[25]	`catEsc`	::=	`'\p{' charProp '}'`
[26]	`complEsc`	::=	`'\P{' charProp '}'`
[27]	`charProp`	::=	`IsCategory \| IsBlock`

Note: [Unicode Database] is subject to future revision. For example, the mapping from code points to character properties might be updated. All ·minimally conforming· processors ·must· support the character properties defined in the version of [Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the character properties defined in any future version.

The following table specifies the recognized values of the "General Category" property.

Category	Property	Meaning
Letters	L	All Letters
	Lu	uppercase
	Ll	lowercase
	Lt	titlecase
	Lm	modifier
	Lo	other

Marks	M	All Marks
	Mn	nonspacing
	Mc	spacing combining
	Me	enclosing

Numbers	N	All Numbers
	Nd	decimal digit
	Nl	letter
	No	other

Punctuation	P	All Punctuation
	Pc	connector
	Pd	dash
	Ps	open
	Pe	close
	Pi	initial quote (may behave like Ps or Pe depending on usage)
	Pf	final quote (may behave like Ps or Pe depending on usage)
	Po	other

Separators	Z	All Separators
	Zs	space
	Zl	line
	Zp	paragraph

Symbols	S	All Symbols
	Sm	math
	Sc	currency
	Sk	modifier
	So	other

Other	C	All Others
	Cc	control
	Cf	format
	Co	private use
	Cn	not assigned

Categories

[28]	`IsCategory`	::=	`Letters \| Marks \| Numbers \| Punctuation \| Separators \| Symbols \| Others`
[29]	`Letters`	::=	`'L' [ultmo]?`
[30]	`Marks`	::=	`'M' [nce]?`
[31]	`Numbers`	::=	`'N' [dlo]?`
[32]	`Punctuation`	::=	`'P' [cdseifo]?`
[33]	`Separators`	::=	`'Z' [slp]?`
[34]	`Symbols`	::=	`'S' [mcko]?`
[35]	`Others`	::=	`'C' [cfon]?`

Note: The properties mentioned above exclude the Cs property. The Cs property identifies "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

[Definition:] [Unicode Database] groups code points into a number of blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo, CJK Compatibility, etc. The set containing all characters that have block name X (with all white space stripped out), can be identified with a block escape \p{IsX}. The complement of this set is specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).

Block Escape

[36] IsBlock ::= 'Is' [a-zA-Z0-9#x2D]+

The following table specifies the recognized block names (for more information, see the "Blocks.txt" file in [Unicode Database]).

Start Code	End Code	Block Name	Start Code	End Code	Block Name
#x0000	#x007F	BasicLatin	#x0080	#x00FF	Latin-1Supplement
#x0100	#x017F	LatinExtended-A	#x0180	#x024F	LatinExtended-B
#x0250	#x02AF	IPAExtensions	#x02B0	#x02FF	SpacingModifierLetters
#x0300	#x036F	CombiningDiacriticalMarks	#x0370	#x03FF	Greek
#x0400	#x04FF	Cyrillic	#x0530	#x058F	Armenian
#x0590	#x05FF	Hebrew	#x0600	#x06FF	Arabic
#x0700	#x074F	Syriac	#x0780	#x07BF	Thaana
#x0900	#x097F	Devanagari	#x0980	#x09FF	Bengali
#x0A00	#x0A7F	Gurmukhi	#x0A80	#x0AFF	Gujarati
#x0B00	#x0B7F	Oriya	#x0B80	#x0BFF	Tamil
#x0C00	#x0C7F	Telugu	#x0C80	#x0CFF	Kannada
#x0D00	#x0D7F	Malayalam	#x0D80	#x0DFF	Sinhala
#x0E00	#x0E7F	Thai	#x0E80	#x0EFF	Lao
#x0F00	#x0FFF	Tibetan	#x1000	#x109F	Myanmar
#x10A0	#x10FF	Georgian	#x1100	#x11FF	HangulJamo
#x1200	#x137F	Ethiopic	#x13A0	#x13FF	Cherokee
#x1400	#x167F	UnifiedCanadianAboriginalSyllabics	#x1680	#x169F	Ogham
#x16A0	#x16FF	Runic	#x1780	#x17FF	Khmer
#x1800	#x18AF	Mongolian	#x1E00	#x1EFF	LatinExtendedAdditional
#x1F00	#x1FFF	GreekExtended	#x2000	#x206F	GeneralPunctuation
#x2070	#x209F	SuperscriptsandSubscripts	#x20A0	#x20CF	CurrencySymbols
#x20D0	#x20FF	CombiningMarksforSymbols	#x2100	#x214F	LetterlikeSymbols
#x2150	#x218F	NumberForms	#x2190	#x21FF	Arrows
#x2200	#x22FF	MathematicalOperators	#x2300	#x23FF	MiscellaneousTechnical
#x2400	#x243F	ControlPictures	#x2440	#x245F	OpticalCharacterRecognition
#x2460	#x24FF	EnclosedAlphanumerics	#x2500	#x257F	BoxDrawing
#x2580	#x259F	BlockElements	#x25A0	#x25FF	GeometricShapes
#x2600	#x26FF	MiscellaneousSymbols	#x2700	#x27BF	Dingbats
#x2800	#x28FF	BraillePatterns	#x2E80	#x2EFF	CJKRadicalsSupplement
#x2F00	#x2FDF	KangxiRadicals	#x2FF0	#x2FFF	IdeographicDescriptionCharacters
#x3000	#x303F	CJKSymbolsandPunctuation	#x3040	#x309F	Hiragana
#x30A0	#x30FF	Katakana	#x3100	#x312F	Bopomofo
#x3130	#x318F	HangulCompatibilityJamo	#x3190	#x319F	Kanbun
#x31A0	#x31BF	BopomofoExtended	#x3200	#x32FF	EnclosedCJKLettersandMonths
#x3300	#x33FF	CJKCompatibility	#x3400	#x4DB5	CJKUnifiedIdeographsExtensionA
#x4E00	#x9FFF	CJKUnifiedIdeographs	#xA000	#xA48F	YiSyllables
#xA490	#xA4CF	YiRadicals	#xAC00	#xD7A3	HangulSyllables

			#xE000	#xF8FF	PrivateUse
#xF900	#xFAFF	CJKCompatibilityIdeographs	#xFB00	#xFB4F	AlphabeticPresentationForms
#xFB50	#xFDFF	ArabicPresentationForms-A	#xFE20	#xFE2F	CombiningHalfMarks
#xFE30	#xFE4F	CJKCompatibilityForms	#xFE50	#xFE6F	SmallFormVariants
#xFE70	#xFEFE	ArabicPresentationForms-B	#xFEFF	#xFEFF	Specials
#xFF00	#xFFEF	HalfwidthandFullwidthForms	#xFFF0	#xFFFD	Specials

Note: The blocks mentioned above exclude the HighSurrogates, LowSurrogates and HighPrivateUseSurrogates blocks. These blocks identify "surrogate" characters, which do not occur at the level of the "character abstraction" that XML instance documents operate on.

Note: [Unicode Database] is subject to future revision. For example, the grouping of code points into blocks might be updated. All ·minimally conforming· processors ·must· support the blocks defined in the version of [Unicode Database] that is current at the time this specification became a W3C Recommendation. However, implementors are encouraged to support the blocks defined in any future version of the Unicode Standard.

For example, the ·block escape· for identifying the ASCII characters is \p{IsBasicLatin}.

[Definition:] A multi-character escape provides a simple way to identify a commonly used set of characters:

Multi-Character Escape

[37]	`MultiCharEsc`	::=	`'\' [sSiIcCdDwW]`
[37a]	`WildcardEsc`	::=	`'.'`

Character sequence	Equivalent ·character class·
.	[^\n\r]
\s	[#x20\t\n\r]
\S	[^\s]
\i	the set of initial name characters, those ·match·ed by Letter \| '_' \| ':'
\I	[^\i]
\c	the set of name characters, those ·match·ed by NameChar
\C	[^\c]
\d	\p{Nd}
\D	[^\d]
\w	[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] (all characters except the set of "punctuation", "separator" and "other" characters)
\W	[^\w]

Note: The ·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.

posted on 2005-04-22 01:29 Laser.NET 阅读(604) 评论(0) 收藏举报

刷新页面返回顶部