正则表达式

学习动机：
文本数据处理，对文本内容的搜索，定位，提取是逻辑复杂的工作，于是产生了正则表达式。

文本的高级匹配模式，本质是一系列字符和特殊字符组成的字串。
原理：
通过普通字符和特定含义的字符，来组成字符串，用以描述一定的字符串规则。来表达某类特定的字符串，进而匹配。

学习目标：
熟练掌握正则表达式元字符
读懂正则表达式，编辑简单的正则规则。
熟练使用re模块操作正则表达式。

普通字符

每个普通字符匹配其对应的字符
re.findall('a','abc')

| 或

匹配 | 两侧任意的表达式即可
re.findall('com|cn','www.baidu.com/www.jd.cn')

. 匹配任意单个字符，不可以匹配换行符\n

re.findall('h.llo','hello,hnllo,hollo')

匹配字符集[]

[aeiou] 任一字符
[a-z] [A-Z] [0-9] 匹配区间任一字符
[_#?0-9a-z] 混合书写，一般区间表达写在后面

[字符集] 匹配字符集中的任意一个字符
re.findall('[a-z]|[A-Z]','How are you')
re.findall([aeiou],'how are you')

匹配字符集反集[^字符集]

匹配除了字符集以外的任意一个字符。
re.findall('[^aeiou ]','how are you')

匹配字符重复

*

匹配前面的字符重复出现0或多次。
re.findall('wo*','woooo---~w')

+

匹配前面的字符重复出现1或多次。
re.findall('wo+','woooo---~w')

训练：
今天是2021年3月30日
re.findall('[0-9]+','今天是2021年3月30日') 提取数字。

There are moments in life when you miss someone so much that you just want to pick them from your dreams and hug them for real! Dream what you want to dream;go where you want to go;be what you want to be,because you have only one life and one chance to do all the things you want to do.

　　May you have enough happiness to make you sweet,enough trials to make you strong,enough sorrow to keep you human,enough hope to make you happy? Always put yourself in others’shoes.If you feel that it hurts you,it probably hurts the other person, too.
匹配英文中的大写开头的单词。
re.findall('[A-Z][a-z]* ',html)

？

匹配前面的字符出现0或1次。
re.findall('wo?','woooo-w')

匹配数字：
-12°的气温，战士背着30Kg重装备。
re.findall('-?[0-9]+','-12°的气温，战士背着30Kg重装备。')

n表示前面的字符出现的次数
re.findall('wo{3}','wooooo')

匹配手机号
re.findall('13[0-9]{10}',)

/^(13[0-9]|14[01456879]|15[0-35-9]|16[2567]|17[0-8]|18[0-9]|19[0-35-9])\d{8}$/
匹配QQ号

re.findall('[129][0-9]{4,10}','183403385')

匹配前面的字符出现 m - n 次
re.findall('wo{2,5}','wooooooow')

^ 匹配字符串开头位置

re.findall('^Jame','Jame,Hi')

$ 匹配字符串结尾位置

re.findall('Jame$','Hi,Jame')

通过正则表达式验证密码是否为数字字母下划线构成，并且是6-12位
re.findall('^[1]{6,12}$','')

数字字符\d \D

匹配任意数字字符\d [0-9]
匹配任意非数字字符\D [^0-9]
re.findall('\d{2,8}','mysql:3306,ssh：80')

普通字符\w \W

匹配任意普通字符：数字，字母，下划线，汉字 \w
匹配任意非普通字符 \W
re.findall('\w+','server_port = 8888')

空字符\s \S

匹配任意的空字符\s:空格，\r,\n,\t,\v,\f字符。
匹配任意的非空字符\S
re.findall('\w+\s+\w+','hello world')

单词边界\b \B

单词边界指的是：数字字母汉字下划线与其它字符的交界位置。
匹配单词的边界
匹配非单词的边界
re.findall('\bis\b','this is the test')
re.findall(r'\bis','this is the test')

 总结：
 匹配字符： . [...] [^...] \d \D \w \W \s \S 
 匹配重复： * + ？ {n} {m,n}
 匹配位置： ^ $ \b \B
 其他：     | () \

提取出-12 30 1.25 -3.6
解答：re.findall('-?\d+\S?[\d]*','-12 30 1.25 -3.6 5')

贪婪模式和非贪婪模式

贪婪：默认情况下，匹配重复的元字符总是尽可能多的向后匹配内容，比如：* + ？ {m,n}。
非贪婪：让匹配重复的元字符尽可能少的向后匹配内容。在后面加？就变成了非贪婪。
*？0个
+？1个
{m,n}? m个
？？0个

() 子组