Python 6th Day

正则表达式

元字符 (metacharacters)

. ^ $ * + ? { } [ ] \ | ( )

[], 用来指定一个字符集(character class),字符集可以单个列出或者指定一个范围,For example,

[abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your RE would be [a-z]

在 [] 中,元字符不起特殊作用,For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$'; '$' is usually a metacharacter, but inside a character class it’s stripped of its special nature

在 [] 中使用 '^' 可以表示取非,For example, [^5] will match any character except '5'

使用 \ (backslash) 转义,if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\

predefined sets of characters:

\d == [0-9]
\D == [^0-9]
\s == [ \t\n\r\f\v] # 所有的空格字符
\S == [^ \t\n\r\f\v] # 所有的非空字符
\w == [a-zA-Z0-9_]
\W == [^a-zA-Z0-9_]

字符集可以嵌套使用,For example, [\s,.] is a character class that will match any whitespace character, or ',' or '.'

'.' matches anything except a newline character

'*' it specifies that the previous character can be matched zero or more times

'+' which matches one or more times

'?' matches either once or zero times, For example, home-?brew matches either homebrew or home-brew

Compiling Regular Expressions

正则表达式被编译成模式对象(pattern objects),模式对象可以用很多种方法进行匹配或者操作。

>>> import re
>>> p = re.compile('ab*')
>>> p  
<_sre.SRE_Pattern object at 0x...>

使用原生字符串(raw string notation: r)

Regular String Raw string
"ab*" r"ab*"
"\\\\section" r"\\section"
"\\w+\\s+\\1" r"\w+\s+\1"

字符串匹配

match()  从头匹配,成功返回 match object, 否则返回 None

search()  匹配整个字符串,成功返回 match object,没有匹配返回 None

findall()  查找整个字符串并返回列表

finditer()  查找整个字符串并以 match object 形式返回迭代器

match object 对象实例包括以下重要方法:

group()    以字符串格式返回匹配部分(substring)

start()    返回匹配的起始位置

end()    返回匹配的结束位置

span()    返回包括起始结束位置的元祖

分组

使用 () 分组,组序号从 0 开始,group 0 就是表达式本身,所以 match object 的方法中都包含 group 0 作为默认参数。

>>> p = re.compile('(a)b')
>>> m = p.match('ab')
>>> m.group()
'ab'
>>> m.group(0)
'ab'

分组可以嵌套

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

group() 可以一次访问多个组成员,返回元祖

>>> m.group(2,1,2)
('b', 'abc', 'b')

groups() 以元祖返回所有 subgroups

>>> m.groups()
('abc', 'b')

修改字符串

分割字符串

>>> p = re.compile(r'\W+')
>>> p.split('This is a test, short and sweet, of split().')
['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
>>> p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

 

posted @ 2016-06-18 02:00  garyyang  阅读(145)  评论(0)    收藏  举报