Python自然语言处理学习笔记(46):5.7 如何判断词的分类

5.7 How to Determine the Category of a Word  如何判断词的分类

Now that we have examined word classes in detail, we turn to a more basic question: how do we decide what category a word belongs to in the first place? In general, linguists use morphological(形态学的), syntactic(语法的), and semantic clues to determine the category of a word.

 

Morphological Clues   形态线索

The internal structure of a word may give useful clues as to the word’s category. For example, -ness is a suffix that combines with an adjective to produce a noun, e.g., happy happiness, ill illness. So if we encounter a word that ends in -ness, this is very likely to be a noun. Similarly, -ment is a suffix that combines with some verbs to produce a noun, e.g., govern government and establish establishment.

English verbs can also be morphologically complex. For instance, the present participle of a verb ends in -ing, and expresses the idea of ongoing, incomplete action (e.g., falling, eating). The -ing suffix also appears on nouns derived from verbs, e.g., the falling of the leaves (this is known as the gerund 动名词).

 

Syntactic Clues   语法线索

Another source of information is the typical contexts in which a word can occur. For example, assume that we have already determined the category of nouns. Then we might say that a syntactic criterion for an adjective in English is that it can occur immediately before a noun, or immediately following the words be or very. According to these tests, near should be categorized as an adjective:

       (2)

     a. the near window

          b. The end is (very) near.

 

Semantic Clues 语义线索

Finally, the meaning of a word is a useful clue as to its lexical category. For example, the best-known definition of a noun is semantic: “the name of a person, place, or thing.” Within modern linguistics, semantic criteria for word classes are treated with suspicion(怀疑), mainly because they are hard to formalize. Nevertheless, semantic criteria underpin(巩固) many of our intuitions about word classes, and enable us to make a good guess about the categorization of words in languages with which we are unfamiliar. For example, if all we know about the Dutch word verjaardag is that it means the same as the English word birthday, then we can guess that verjaardag is a noun in Dutch. However, some care is needed: although we might translate zij is vandaag jarig as it’s her birthday today, the word jarig is in fact an adjective in Dutch, and has no exact equivalent in English.

 

New Words 新词

All languages acquire new lexical items. A list of words recently added to the Oxford Dictionary of English includes cyberslacker, fatoush, blamestorm, SARS, cantopop, bupkis, noughties, muggle, and robata. Notice that all these new words are nouns, and this is reflected in calling nouns an open class(开放类). By contrast, prepositions are regarded as a closed class(封闭类). That is, there is a limited set of words belonging to the class (e.g., above, along, at, below, beside, between, during, for, from, in, near, on, outside, over, past, through, towards, under, up, with), and membership of the set only changes very gradually over time.

 

Morphology in Part-of-Speech Tagsets 词性标注集合中的词态学

 

Common tagsets often capture some morphosyntactic information, that is, information about the kind of morphological markings that words receive by virtue of(借助) their syntactic role. Consider, for example, the selection of distinct grammatical forms of the word go illustrated in the following sentences:

  (3)  a. Go away!

          b. He sometimes goes to the cafe.

          c. All the cakes have gone.

          d. We went on the excursion(旅行).

Each of these forms—go, goes, gone, and went—is morphologically distinct from the othersgo的词态上是不同滴). Consider the form goes. This occurs in a restricted set of grammatical contexts, and requires a third person singular subject. Thus, the following sentences are ungrammatical.

       (4) a. *They sometimes goes to the cafe.

          b. *I sometimes goes to the cafe.

 

By contrast, gone is the past participle form; it is required after have (and cannot be replaced in this context by goes), and cannot occur as the main verb of a clause.

              (5) a. *All the cakes have goes.

                 b. *He sometimes gone to the café

 

We can easily imagine a tagset in which the four distinct grammatical forms just discussed were all tagged as VB. Although this would be adequate for some purposes, a more fine-grained tagset provides useful information about these forms that can help other processors that try to detect patterns in tag sequences(细粒度的标记集能偶提供有用的信息). The Brown tagset captures these distinctions, as summarized in Table 5-7.

 

Table 5-7. Some morphosyntactic distinctions in the Brown tagset

Form            Category          Tag

go                  base                VB

goes      third singular present VBZ

gone      past participle          VBN

going            gerund              VBG

went            simple past         VBD

 

 

In addition to this set of verb tags, the various forms of the verb to be have special tags: be/BE, being/BEG, am/BEM, are/BER, is/BEZ, been/BEN, were/BED, and was/BEDZ (plus extra tags for negative forms(否定形式?) of the verb). All told(总共), this fine-grained tagging of verbs means that an automatic tagger that uses this tagset is effectively carrying out a limited amount of morphological analysis.

Most part-of-speech tagsets make use of the same basic categories, such as noun, verb, adjective, and preposition. However, tagsets differ both in how finely they divide words into categories, and in how they define their categories. For example, is might be tagged simply as a verb in one tagset, but as a distinct form of the lexeme be in another tagset (as in the Brown Corpus). This variation in tagsets is unavoidable, since part-of-speech tags are used in different ways for different tasks. In other words, there is no one “right way” to assign tags, only more or less useful ways depending on one’s goals.

posted @ 2011-08-30 22:45  牛皮糖NewPtone  阅读(1958)  评论(0编辑  收藏  举报