Python 6th Day

正则表达式

元字符 (metacharacters)

. ^ $ * + ? { } [ ] \ | ( )

[], 用来指定一个字符集（character class），字符集可以单个列出或者指定一个范围，For example,

[abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]

在 [] 中，元字符不起特殊作用，For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature

在 [] 中使用 '^' 可以表示取非，For example, [^5] will match any character except '5'

使用 \ (backslash) 转义，if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\

predefined sets of characters:

\d == [0-9]
\D == [^0-9]
\s == [ \t\n\r\f\v] # 所有的空格字符
\S == [^ \t\n\r\f\v] # 所有的非空字符
\w == [a-zA-Z0-9_]
\W == [^a-zA-Z0-9_]

字符集可以嵌套使用，For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'

'.' matches anything except a newline character

'*' it specifies that the previous character can be matched zero or more times

'+' which matches one or more times

'?' matches either once or zero times, For example, home-?brew matches either homebrew or home-brew

Compiling Regular Expressions

正则表达式被编译成模式对象（pattern objects），模式对象可以用很多种方法进行匹配或者操作。

>>> import re
>>> p = re.compile('ab*')
>>> p  
<_sre.SRE_Pattern object at 0x...>

使用原生字符串（raw string notation: r）

Regular String	Raw string
"ab*"	r"ab*"
"\\\\section"	r"\\section"
"\\w+\\s+\\1"	r"\w+\s+\1"

字符串匹配

match()　　从头匹配，成功返回 match object, 否则返回 None

search()　　匹配整个字符串，成功返回 match object，没有匹配返回 None

findall()　　查找整个字符串并返回列表

finditer()　　查找整个字符串并以 match object 形式返回迭代器

match object 对象实例包括以下重要方法：

group()　　　　以字符串格式返回匹配部分（substring）

start()　　　　返回匹配的起始位置

end()　　　　返回匹配的结束位置

span()　　　　返回包括起始结束位置的元祖

分组

使用 () 分组，组序号从 0 开始，group 0 就是表达式本身，所以 match object 的方法中都包含 group 0 作为默认参数。

>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'

分组可以嵌套

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

group() 可以一次访问多个组成员，返回元祖

>>> m.group(2,1,2)
('b', 'abc', 'b')

groups() 以元祖返回所有 subgroups

>>> m.groups()
('abc', 'b')

修改字符串

分割字符串

>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

posted @ 2016-06-18 02:00 garyyang 阅读(148) 评论(0) 收藏举报

刷新页面返回顶部

garyyang