《PYTHON自然语言处理》第3章处理原始文本

更多更复杂有关处理HTML内容 http://www.crummy.com/software/BeautifulSoup/

3.11 深入阅读

PEP-100 http://www.python.org/dev/peps/pep-0100/
http://amk.ca/
Frederik Lundh, Python Unicode Objects, http://effbot.org/zone/unicode-objects.htm
Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Posi-tively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html

http://sighan.org/

http://www.aclweb.org/

3.12 练习

1 s = 'colorless'
2 s[:4] + 'u' + s[4:]

1 s = ['dishes', 'running', 'nationality', 'undo', 'preheat']
2 s[0][:4]
3 s[1][:3]
4 s[2][:6]
5 s[3][2:]
6 s[4][3:]

3 负数索引会回绕，s[-1]是字符串最后一个字符。

4、5

monty = 'Monty Python'

monty[6:11:2] => 'Pto'

第一个是起点，第二个是终点，第三个是步长。

monty[10:5:-2] => 'otP'

步长为负，反方向（左）取字符。

monty[::-1] = > 'nohtyP ytnoM'

倒序取整个字符串。

a.[a-zA-Z]+ 纯字符构成的单词

b.[A-Z][a-z]* 首字母大写的纯字符单词，词长至少是1

c. p[aeiou]{,2}t 首字母为p，跟0-2个元音字母，跟字母t的单词。

d.\d+(\.\d+)? 整数或小数。数字1个或多个，跟(小数点.1个，数字1个或多个），小数点部分或0个或1个。

e.([^aeiou][aeiou][^aeiou])* （1非小写字母，可以是大写元音字母+1小写元音字母+1非小写元音字母）0个或多个。

e.的范围比预想要广，测试。

1 wl = ['9iy', 'LoL', 'WoW', 'AoE', 'abc', '123']
2 [w for w in wl if re.search('([^aeiou][aeiou][^aeiou])*', w)]
3 
4 ['9iy', 'LoL', 'WoW', 'AoE', 'abc', '123']
5 
6 [w for w in wl if re.search('([^aeiou][aeiou][^aeiou])', w)]
7 # 去掉了*
8 
9 ['9iy', 'LoL', 'WoW', 'AoE']

第一种情况，123是模式为0的情况。abc为什么也能通过，也是模式为0？？是的，nltk.re_show('([^aeiou][aeiou][^aeiou])', 'abc') 结果是{}a{}b{}c

第二种情况，是预想的情形。

二者只差一个*，差别太大了。

f.\w+ | [^\w\s]+ 至少一个字母数字字符或者至少一个不含字母数字字符空白的字符串。

a. 没想出来

b. ^-?[1-9]\d*$ 匹配整数

\*乘号

\+加号

^-?[1-9]\d*$\*^-?[1-9]\d*$\+^-?[1-9]\d*$

8、9. 正则表达式分词

10.

1 sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
2 [(word, len(word)) for word in sent]

11. Define a string raw containing a sentence of your own choosing. Now, split raw on some character other than space, such as 's'.

这句中文译文译得真头疼。“分裂raw的一些字符以外的空间，例如's'”

应为：“以其他字符（非空格），如's' 来分词”。

1 s = 'The dog gave John the newspaper'
2 s.split()
3 s.split(' ')
4 s.split('a')
5 s.split('a ')
6 s.split('o')
7 s.split('J')

13.

split()能识别出\t,将其视为做分隔符，split(' ')严格将空格作为分隔符，\t视为内容放入词链表。

1 s = 'The dog     gave J\t ohn the news paper.'
2 s.split()
3 s.split(' ')

words.sort()，words本身变为有序词链。

sorted(words)，输出排序词链，但words本身没有变化，依然保持原序。

15.

3 * 7 => 21

# “3" * 7是字符”3"重复7次。

"3" * 7 => '3333333'

int("3") => 3

str(3) =>'3'

16. 略

17.

>>> s ='HelloWorld'
>>> s
'HelloWorld'
>>> '%6s' % s
'HelloWorld'
>>> '%-6s' % s
'HelloWorld'

=====结束分割线=====

（待续……)

posted on 2015-03-24 20:42 猩球崛起阅读(227) 评论(0) 收藏举报

刷新页面返回顶部

导航

《PYTHON自然语言处理》第3章 处理原始文本

3.11 深入阅读

3.12 练习

《PYTHON自然语言处理》第3章处理原始文本