Python自然语言处理学习笔记(6):1.4 回到Python:决策和控制

新手上路,翻译不恰之处,恳请指出,不胜感谢 

Updated log

1st:2011/8/6

 

 

1.4 Back to Python: Making Decisions and Taking Control

 回到Python:决策和控制

 

So far, our little programs have had some interesting qualities: the ability to work with language, and the potential to save human effort through automation. A key feature of programming is the ability of machines to make decisions on our behalf, executing instructions when certain conditions are met, or repeatedly looping through text data until some condition is satisfied. This feature is known as control, and is the focus of this section.编程的一个关键特性是机器能有按照我们的意愿决策,遇到特定条件时执行命令,或者对文本数据从头到尾重复循环直到条件满足。这一性质也称为控制,这是本小节的内容。

 

Conditionals 条件 

 

Python supports a wide range of operators, such as < and >=, for testing the relationship between values. The full set of these relational operators(关系运算符) are shown in Table 1-3.

          Table 1-3. Numerical comparison operators

Operator

Relationship

<

Less than

<=

Less than or equal to

==

Equal to (note this is two “=”signs, not one)

!=

Not equal to

>

Greater than

>=

Greater than or equal to

 

We can use these to select different words from a sentence of news text. Here are some examples—notice only the operator is changed from one line to the next. They all use sent7, the first sentence from text7 (Wall Street Journal). As before, if you get an error saying that sent7 is undefined, you need to first type: from nltk.book import *.

>>> sent7

[
'Pierre''Vinken'',''61''years''old'',''will''join''the',

'board''as''a''nonexecutive''director''Nov.''29''.']

>>> [w for w in sent7 if len(w) < 4]

[
',''61''old'',''the''as''a''29''.']

>>> [w for w in sent7 if len(w) <= 4]

[
',''61''old'',''will''join''the''as''a''Nov.''29''.']

>>> [w for w in sent7 if len(w) == 4]

[
'will''join''Nov.']

>>> [w for w in sent7 if len(w) != 4]

[
'Pierre''Vinken'',''61''years''old'',''the''board',

'as''a''nonexecutive''director''29''.']

>>> 

 

There is a common pattern to all of these examples: [w for w in text if condition], where condition is a Python “test” that yields either true or false. In the cases shown in the previous code example, the condition is always a numerical comparison(数值比较). However, we can also test various properties of words, using the functions listed in Table 1-4.

 

                 Table 1-4. Some word comparison operators

Function

Meaning

s.startswith(t)

Test if s starts with t

s.endswith(t)

Test if s ends with t

t in s

Test if t is contained inside s

s.islower()

Test if all cased characters in s are lowercase

s.isupper()

Test if all cased characters in s are uppercase

s.isalpha()

Test if all characters in s are alphabetic 字母

s.isalnum()

Test if all characters in s are alphanumeric 字母数字

s.isdigit()

Test if all characters in s are digits 数字

s.istitle()

Test if s is titlecased (all words in s have initial capitals)

Here are some examples of these operators being used to select words from our texts: words ending with -ableness; words containing gnt; words having an initial capital; and words consisting entirely of digits.

>>> sorted([w for w in set(text1) if w.endswith('ableness')])

[
'comfortableness''honourableness''immutableness''indispensableness', ...]

>>> sorted([term for term in set(text4) if 'gnt' in term])

[
'Sovereignty''sovereignties''sovereignty']

>>> sorted([item for item in set(text6) if item.istitle()])

[
'A''Aaaaaaaaah''Aaaaaaaah''Aaaaaah''Aaaah''Aaaaugh''Aaagh', ...]

>>> sorted([item for item in set(sent7) if item.isdigit()])

[
'29''61']

>>> 

 

We can also create more complex conditions(对条件进行组合). If c is a condition, then not c is also a condition. If we have two conditions c1 and c2, then we can combine them to form a new condition using conjunction(合取) and disjunction(析取): c1 and c2, c1 or c2.

 

Your Turn: Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.


>>> sorted([w for w in set(text7) if '-' in w and 'index' in w]) 

>>> sorted([wd for wd in set(text3) if wd.istitle() and len(wd) > 10])
>>> sorted([w for w in set(sent7) if not w.islower()])

>>> sorted([t for t in set(text2) if 'cie' in t or 'cei' in t]) 

 

Operating on Every Element 对每个元素上进行操作

  

In Section 1.3, we saw some examples of counting items other than words. Let’s take a closer look at the notation we used:

>>> [len(w) for w in text1]

[
144268419118214115217613452, ...]

>>> [w.upper() for w in text1]

[
'[''MOBY''DICK''BY''HERMAN''MELVILLE''1851'']''ETYMOLOGY''.', ...]

>>> 

These expressions have the form [f(w) for ...] or [w.f() for ...], where f is a function that operates on a word to compute its length, or to convert it to uppercase. For now, you don’t need to understand the difference between the notations f(w) and w.f(). Instead, simply learn this Python idiom which performs the same operation on every element of a list. In the preceding examples, it goes through each word in text1, assigning each one in turn to the variable w and performing the specified operation on the variable.

 

The notation just described is called a “list comprehension.列表推导” This is our first example of a Python idiom(习语), a fixed notation that we use habitually without bothering to analyze each time. Mastering such idioms is an important part of becoming a fluent Python programmer.

Let’s return to the question of vocabulary size, and apply the same idiom here:

>>> len(text1) 单词量

260819

>>> len(set(text1)) 去掉重复的

19317

>>> len(set([word.lower() for word in text1])) 去掉大小写一样的

17231

>>> 

 

Now that(由于) we are not double-counting words like This and this, which differ only in capitalization, we’ve wiped 2,000 off the vocabulary count! We can go a step further and eliminate numbers and punctuation from the vocabulary count by filtering out any non-alphabetic items(接下来是时候消灭数字和标点符号了):

>>> len(set([word.lower() for word in text1 if word.isalpha()]))

16948

>>> 

 

This example is slightly complicated: it lowercases all the purely alphabetic items. Perhaps it would have been simpler just to count the lowercase-only items, but this gives the wrong answer (why?).(这里我没看明白啊,小写的只是去掉大小写一样的单词而已)

Don’t worry if you don’t feel confident with list comprehensions yet, since you’ll see many more examples along with explanations in the following chapters.

 

Nested Code Blocks 嵌套代码块

 

Most programming languages permit us to execute a block of code when a conditional expression, or if statement, is satisfied. We already saw examples of conditional tests in code like [w for w in sent7 if len(w) < 4]. In the following program, we have created a variable called word containing the string value 'cat'. The if statement checks whether the test len(word) < 5 is true. It is, so the body of the if statement is invoked and the print statement is executed, displaying a message to the user. Remember to indent(缩进,Python的风格) the print statement by typing four spaces.

>>> word = 'cat'

>>> if len(word) < 5:

... 
print 'word length is less than 5'

... ①

word length 
is less than 5

>>> 

 

When we use the Python interpreter we have to add an extra blank line ① in order for it to detect that the nested block is complete.

If we change the conditional test to len(word) >= 5, to check that the length of word is greater than or equal to 5, then the test will no longer be true. This time, the body of the if statement will not be executed, and no message is shown to the user:

>>> if len(word) >= 5:

... 
print 'word length is greater than or equal to 5'

...

>>> 

 

An if statement is known as a control structure(控制结构) because it controls whether the code in the indented block will be run. Another control structure is the for loop. Try the following, and remember to include the colon(冒号) and the four spaces:

>>> for word in ['Call''me''Ishmael''.']:

... 
print word

...

Call

me

Ishmael

.

>>> 

 

This is called a loop(循环) because Python executes the code in circular fashion. It starts by performing the assignment word = 'Call', effectively using the word variable to name the first item of the list. Then, it displays the value of word to the user. Next, it goes back to the for statement, and performs the assignment word = 'me' before displaying this new value to the user, and so on. It continues in this fashion until every item of the list has been processed.

 

Looping with Conditions 条件循环

 

Now we can combine the if and for statements. We will loop over every item of the list, and print the item only if it ends with the letter l(是字母l不是数字). We’ll pick another name for the variable to demonstrate that Python doesn’t try to make sense of variable names.

 

>>> sent1 = ['Call''me''Ishmael''.']

>>> for xyzzy in sent1:

... 
if xyzzy.endswith('l'):

... 
print xyzzy

...

Call

Ishmael

>>>

 

You will notice that if and for statements have a colon at the end of the line, before the indentation begins. In fact, all Python control structures end with a colon. The colon indicates that the current statement relates to the indented block that follows.

We can also specify an action to be taken if the condition of the if statement is not met. Here we see the elif (else if) statement, and the else statement. Notice that these also have colons before the indented code.

>>> for token in sent1:

... 
if token.islower():

... 
print token, 'is a lowercase word'

... 
elif token.istitle():

... 
print token, 'is a titlecase word'

... 
else:

... 
print token, 'is punctuation'

...

Call 
is a titlecase word

me 
is a lowercase word

Ishmael 
is a titlecase word

is punctuation

>>> 

 

As you can see, even with this small amount of Python knowledge, you can start to build multiline Python programs. It’s important to develop such programs in pieces(按块), testing that each piece does what you expect before combining them into a program. This is why the Python interactive interpreter is so invaluable(无价的), and why you should get comfortable using it.

 

Finally, let’s combine the idioms we’ve been exploring. First, we create a list of cie and cei words, then we loop over each item and print it. Notice the comma at the end of the print statement, which tells Python to produce its output on a single line.(最后的逗号是表示单行输出)

 

>>> tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])

>>> for word in tricky:

... 
print word,

ancient ceiling conceit conceited conceive conscience

conscientious conscientiously deceitful deceive ...

>>> 

 

posted @ 2011-07-10 21:49  牛皮糖NewPtone  阅读(1300)  评论(0编辑  收藏  举报