Python自然语言处理学习笔记(70):8.2 语法有什么作用

8.2   What's the Use of Syntax? 语法有什么作用?

Beyond n-grams  n-grams之外

We gave an example in Chapter 2 of how to use the frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerates into nonsense. Here's another pair of examples that we created by computing the bigrams over the text of a childrens' story, The Adventures of Buster Brown (http://www.gutenberg.org/files/22816/22816.txt):

(4)

a.

He roared with me the pail slip down his back

b.

The worst part and clumsy looking for whoever heard light

You intuitively know that these sequences are "word-salad", but you probably find it hard to pin down(确定)what's wrong with them. One benefit of studying grammar is that it provides a conceptual framework(概念框架)and vocabulary for spelling out these intuitions. Let's take a closer look at the sequence the worst part and clumsy looking. This looks like a coordinate structure(并列结构), where two phrases are joined by a coordinating conjunction such as and, but or or. Here's an informal (and simplified) statement of how coordination works syntactically:

Coordinate Structure:

If v1 and v2 are both phrases of grammatical category(文法种类) X, then v1 and v2 is also a phrase of category X.

Here are a couple of examples. In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.

(5)

a.

The book's ending was (NP the worst part and the best part) for me.

b.

On land they are (AP slow and clumsy looking).

What we can't do is conjoin an NP and an AP, which is why the worst part and clumsy looking is ungrammatical. Before we can formalize these ideas, we need to understand the concept of constituent structure(成分结构).

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given by substitutability — that is, a sequence of words in a well-formed sentence can be replaced by a shorter sequence without rendering(表现) the sentence ill-formed. To clarify this idea, consider the following sentence:

(6)

The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for The little bear indicates that the latter sequence is a unit. By contrast, we cannot replace little bear saw in the same way.

(7)

a.

He saw the fine fat trout in the brook.

b.

*The he the fine fat trout in the brook.

In Figure 8.1, we systematically substitute longer sequences by shorter ones in a way which preserves grammaticality. Each sequence that forms a unit can in fact be replaced by a single word, and we end up with just two elements.

wps_clip_image-13229[6]

Figure 8.1: Substitution of Word Sequences: working from the top row, we can replace particular sequences of words (e.g. the brook) with individual words (e.g. it); repeating this process we arrive at a grammatical two-word sentence.

In Figure 8.2, we have added grammatical category labels to the words we saw in the earlier figure. The labels NP, VP, and PP stand for noun phrase, verb phrase and prepositional phrase respectively.

wps_clip_image-18355[6]

Figure 8.2: Substitution of Word Sequences Plus Grammatical Categories: This diagram reproduces Figure 8.1 along with grammatical categories corresponding to noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and nominals (Nom).

If we now strip out(删掉) the words apart from the topmost row, add an S node, and flip the figure over(把图形翻转), we end up with a standard phrase structure tree, shown in (8). Each node in this tree (including the words) is called a constituent(成分). The immediate constituents(直接成分) of S are NP and VP.

(8)

wps_clip_image-17950[6]

As we will see in the next section, a grammar specifies how the sentence can be subdivided into its immediate constituents, and how these can be further subdivided until we reach the level of individual words.

Note

As we saw in Section 8.1, sentences can have arbitrary length. Consequently, phrase structure trees can have arbitrary depth. The cascaded chunk parsers we saw in Section 7.4 can only produce structures of bounded depth, so chunking methods aren't applicable here.

posted @ 2012-05-14 21:35  牛皮糖NewPtone  阅读(1555)  评论(0编辑  收藏  举报