Python自然语言处理学习笔记(9)：2.1 访问文本语料库

Updated 1st 2011.8.6

CHAPTER 2

Accessing Text Corpora and Lexical Resources

访问文本语料库和词汇资源

Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora. The goal of this chapter is to answer the following questions:

1. What are some useful text corpora and lexical resources, and how can we access them with Python?
      什么是有用的文本预料可和词汇资源，我们如何通过Python访问它们？
2. Which Python constructs are most helpful for this work?
     Python构造的哪个方面对于这一项工作是最有帮助的？
3. How do we avoid repeating ourselves when writing Python code?
      在写Python代码的时候，我们如何避免重复？

This chapter continues to present programming concepts by example, in the context of a linguistic processing task. We will wait until later before exploring each Python construct systematically. Don’t worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and—if you’re game(勇敢的)—modify it by substituting(代替) some part of the code with a different text or word. This way you will associate（联系） a task with a programming idiom, and learn the hows and whys later.

2.1 Accessing Text Corpora 访问文本语料库

As just mentioned, a text corpus is a large body of text. Many corpora are designed to contain a careful balance of material in one or more genres. We examined some small text collections in Chapter 1, such as the speeches known as the US Presidential Inaugural Addresses. This particular corpus actually contains dozens of individual texts—one per address—but for convenience we glued them end-to-end and treated them as a single text. Chapter 1 also used various predefined texts that we accessed by typing from book import *. However, since we want to be able to work with other texts, this section examines a variety of text corpora. We’ll see how to select individual texts, and how to work with them.

Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive(古腾堡电子文本存档), which contains some 25,000(现在是36,000了) free electronic books, hosted at http://www.gutenberg.org/. We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus:

>>> import nltk

>>> nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt','chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Let’s pick out the first of these texts—Emma by Jane Austen—and give it a short name, emma, then find out how many words it contains:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')

>>> len(emma)

192427

In Section 1.1, we showed how you could carry out concordancing of a text such as text1 with the command text1.concordance(). However, this assumes that you are using one of the nine texts obtained as a result of doing from nltk.book import *. Now that you have started examining data from nltk.corpus, as in the previous example, you have to employ the following pair of statements to perform concordancing and other tasks from Section 1.1:

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

>>> emma.concordance("surprize")

When we defined emma, we invoked the words() function of the gutenberg object in NLTK’s corpus package. But since it is cumbersome(累赘的) to type such long names all the time, Python provides another version of the import statement, as follows:

>>> from nltk.corpus import gutenberg

>>> gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...]

>>> emma = gutenberg.words('austen-emma.txt')

Let’s write a short program to display other information about each text, by looping over all the values of fileid（文件标识） corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will make sure that the numbers are all integers, using int().

>>> for fileid in gutenberg.fileids():

...     num_chars = len(gutenberg.raw(fileid)) ①

...     num_words = len(gutenberg.words(fileid))

...     num_sents = len(gutenberg.sents(fileid))

...     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))

...           print int(num_chars/num_words), int(num_words/num_sents),     int(num_words/num_vocab), fileid

...

4 21 26 austen-emma.txt

4 23 16 austen-persuasion.txt

4 24 22 austen-sense.txt

4 33 79 bible-kjv.txt

4 18 5 blake-poems.txt

4 17 14 bryant-stories.txt

4 17 12 burgess-busterbrown.txt

4 16 12 carroll-alice.txt

4 17 11 chesterton-ball.txt

4 19 11 chesterton-brown.txt

4 16 10 chesterton-thursday.txt

4 18 24 edgeworth-parents.txt

4 24 15 melville-moby_dick.txt

4 52 10 milton-paradise.txt

4 12 8 shakespeare-caesar.txt

4 13 7 shakespeare-hamlet.txt

4 13 6 shakespeare-macbeth.txt

4 35 12 whitman-leaves.txt

This program displays three statistics for each text: average word length平均字长, average sentence length平均句长, and the number of times each vocabulary item appears in the text on average本文中每个词汇平均出现数量 (our lexical diversity score我们的词汇多样性得分). Observe that average word length appears to be a general property of English, since it has a recurrent（周期性的） value of 4. (In fact, the average word length is really 3, not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

The previous example also showed how we can access the “raw” text of the book①, not split up into tokens. The raw() function gives us the contents of the file without any linguistic processing(对文件的内容不进行任何语言处理). So, for example, len(gutenberg.raw('blake-poems.txt') tells us how many letters occur in the text, including the spaces between words. The sents() function divides the text up into its sentences, where each sentence is a list of words（把文本分割成句子，每个句子是一个由单词组成的列表）:

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')

>>> macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]

>>> macbeth_sentences[1037]

['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty']

>>> longest_len = max([len(s) for s in macbeth_sentences])

>>> [s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', ...], ...]

Most NLTK corpus readers include a variety of access methods apart from words(), raw(), and sents(). Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters.

Web and Chat Text Web和聊天文本

Although Project Gutenberg contains thousands of books, it represents established literature. It is important to consider less formal language as well. NLTK’s small collection of web text includes content from a Firefox discussion forum, conversations overheard（无意听到的） in New York, the movie script of Pirates of the Carribean（加勒比海盗）, personal advertisements, and wine reviews:

>>> from nltk.corpus import webtext

>>> for fileid in webtext.fileids():

... print fileid, webtext.raw(fileid)[:65], '...'

...

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se...

grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop...

overheard.txt White guy: So, do you have any plans for this evening? Asian girl...

pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr...

singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun...

wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb...

There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators（捕食者）.The corpus contains over 10,000 posts（帖子）, anonymized by replacing usernames with generic names(通用名) of the form “UserNNN”, and manually edited to remove any other identifying information. The corpus is organized into 15 files, where each file contains several hundred posts collected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom). The filename contains the date, chat-room, and number of posts; e.g., 10-19-20s_706posts.xml contains 706 posts gathered from the 20s chat room on 10/19/2006.

>>> from nltk.corpus import nps_chat

>>> chatroom = nps_chat.posts('10-19-20s_706posts.xml')

>>> chatroom[123]

['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',','I', 'can', 'look', 'in', 'a', 'mirror','.']

Brown Corpus 布朗语料库

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre (for a complete list, see http://icame.uib.no/brown/bcm-los.html).

ID	File	Genre	Description
A16	`ca16`	news	Chicago Tribune: Society Reportage
B02	`cb02`	editorial	Christian Science Monitor: Editorials
C17	`cc17`	reviews	Time Magazine: Reviews
D12	`cd12`	religion	Underwood: Probing the Ethics of Realtors
E36	`ce36`	hobbies	Norling: Renting a Car in Europe
F25	`cf25`	lore	Boroff: Jewish Teenage Culture
G22	`cg22`	belles_lettres	Reiner: Coping with Runaway Technology
H15	`ch15`	government	US Office of Civil and Defence Mobilization: The Family Fallout Shelter
J17	`cj19`	learned	Mosteller: Probability with Statistical Applications
K04	`ck04`	fiction	W.E.B. Du Bois: Worlds of Color
L13	`cl13`	mystery	Hitchens: Footsteps in the Night
M01	`cm01`	science_fiction	Heinlein: Stranger in a Strange Land
N14	`cn15`	adventure	Field: Rattlesnake Ridge
P12	`cp12`	romance	Callaghan: A Passion in Rome
R06	`cr06`	humor	Thurber: The Future, If Any, of Comedy

Table 2-1. Example document for each section of the Brown Corpus

We can access the corpus as a list of words or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

>>> from nltk.corpus import brown

>>> brown.categories()

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies',

'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance',

'science_fiction']

>>> brown.words(categories='news')

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

>>> brown.words(fileids=['cg22'])

['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]

>>> brown.sents(categories=['news', 'editorial', 'reviews'])

[['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...], ...]

The Brown Corpus is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry(语言学研究) known as stylistics（文体学）. Let’s compare genres in their usage of modal verbs. The first step is to produce the counts for a particular genre. Remember to import nltk before doing the following:

>>> from nltk.corpus import brown

>>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist([w.lower() for w in news_text])

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for m in modals:

... print m + ':', fdist[m],

...

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Your Turn: Choose a different section of the Brown Corpus, and adapt the preceding example to count a selection of wh words, such as what, when, where, who and why.

Next, we need to obtain counts for each genre of interest. We’ll use NLTK’s support for conditional frequency distributions. These are presented systematically in Section 2.2, where we also unpick(拆散) the following code line by line. For the moment, you can ignore the details and just concentrate on the output（忽略细节，专注于结果）.

>>> cfd = nltk.ConditionalFreqDist(

...           (genre, word)

...           for genre in brown.categories()

...           for word in brown.words(categories=genre))

>>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> cfd.tabulate(conditions=genres, samples=modals)

                 can  could  may might must will

           news   93   86   66   38   50  389

       religion   82   59   78   12   54   71

        hobbies  268   58  131   22   83  264

science_fiction 16   49    4   12    8   16

       romance   74  193   11   51   45   43

         humor   16   30    8    8    9   13

Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could. Would you have predicted this? The idea that word counts might distinguish（区分） genres will be taken up（采纳） again in Chapter 6.

Reuters Corpus 路透社语料库

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words. The documents have been classified into 90 topics, and grouped into two sets, called “training” and “test”（训练和测试）; thus, the text with fileid 'test/14826' is a document drawn from the test set. This split（分割） is for training and testing algorithms that automatically detect the topic of a document, as we will see in Chapter 6.

>>> from nltk.corpus import reuters

>>> reuters.fileids()

['test/14826', 'test/14828', 'test/14829', 'test/14832', ...]

>>> reuters.categories()

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa',

'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn',

'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', ...]

Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other（相互覆盖，也就是内容有重复）, simply because a news story often covers multiple topics. We can ask for the topics covered by one or more documents, or for the documents included in one or more categories. For convenience, the corpus methods accept a single fileid or a list of fileids.

>>> reuters.categories('training/9865')

['barley', 'corn', 'grain', 'wheat']

>>> reuters.categories(['training/9865', 'training/9880'])

['barley', 'corn', 'grain', 'money-fx', 'wheat']

>>> reuters.fileids('barley')

['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', ...]

>>> reuters.fileids(['barley', 'corn'])

['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106',

'test/15287', 'test/15341', 'test/15618', 'test/15618', 'test/15648', ...]

Similarly, we can specify the words or sentences we want in terms of（按照） files or categories. The first handful（少数） of words in each of these texts are the titles, which by convention（按照惯例） are stored as uppercase.

>>> reuters.words('training/9865')[:14]

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS',

'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export']

>>> reuters.words(['training/9865', 'training/9880'])

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

>>> reuters.words(categories='barley')

['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...]

>>> reuters.words(categories=['barley', 'corn'])

['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]

Inaugural Address Corpus 就职演说语料库

In Section 1.1, we looked at the Inaugural Address Corpus, but treated it as a single text. The graph in Figure 1-2 used “word offset”（单词位移） as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension（时间维度，奥巴马的也有）:

>>> from nltk.corpus import inaugural

>>> inaugural.fileids()

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...]

>>> [fileid[:4] for fileid in inaugural.fileids()]

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]

Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].

Let’s look at how the words America and citizen are used over time. The following code converts the words in the Inaugural corpus to lowercase using w.lower()①, then checks whether they start with either of the “targets” america or citizen using startswith()①. Thus it will count words such as American’s and Citizens. We’ll learn about conditional frequency distributions in Section 2.2; for now, just consider the output, shown in Figure 2-1.

>>> cfd = nltk.ConditionalFreqDist(

...           (target, file[:4])

...           for fileid in inaugural.fileids()

...           for w in inaugural.words(fileid)

...           for target in ['america', 'citizen']

...           if w.lower().startswith(target)) ①

>>> cfd.plot()

运行有问题，类型错误

Traceback (most recent call last):

  File "E:/Test/NLTK/2.1.py", line 6, in <module>

    for fileid in inaugural.fileids()

  File "C:\Python26\lib\site-packages\nltk\probability.py", line 1740, in __init__

    for (cond, sample) in cond_samples:

  File "E:/Test/NLTK/2.1.py", line 9, in <genexpr>

    if w.lower().startswith(target))

TypeError: 'type' object is unsubscriptable

Figure 2-1. Plot of a conditional frequency distribution: All words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

Annotated Text Corpora 注释文本语料库

Many text corpora contain linguistic annotations, representing part-of-speech tags, named entities, syntactic structures（句法结构）, semantic roles（语义角色）, and so forth. NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. Table 2-2 lists some of the corpora. For information about downloading them, see http://www.nltk.org/data .For more examples of how to access NLTK corpora, please consult the Corpus HOWTO at http://www.nltk.org/howto .

Table 2-2. Some of the corpora and corpus samples distributed with NLTK

Corpus	Compiler	Contents
Brown Corpus	Francis, Kucera	15 genres, 1.15M words, tagged, categorized
CESS Treebanks	CLiC-UB	1M words, tagged and parsed (Catalan, Spanish)
Chat-80 Data Files	Pereira & Warren	World Geographic Database
CMU Pronouncing Dictionary	CMU	127k entries
CoNLL 2000 Chunking Data	CoNLL	270k words, tagged and chunked
CoNLL 2002 Named Entity	CoNLL	700k words, pos- and named-entity-tagged (Dutch, Spanish)
CoNLL 2007 Dependency Treebanks (sel)	CoNLL	150k words, dependency parsed (Basque, Catalan)
Dependency Treebank	Narad	Dependency parsed version of Penn Treebank sample
Floresta Treebank	Diana Santos et al	9k sentences, tagged and parsed (Portuguese)
Gazetteer Lists	Various	Lists of cities and countries
Genesis Corpus	Misc web sources	6 texts, 200k words, 6 languages
Gutenberg (selections)	Hart, Newby, et al	18 texts, 2M words
Inaugural Address Corpus	CSpan	US Presidential Inaugural Addresses (1789-present)
Indian POS-Tagged Corpus	Kumaran et al	60k words, tagged (Bangla, Hindi, Marathi, Telugu)
MacMorpho Corpus	NILC, USP, Brazil	1M words, tagged (Brazilian Portuguese)
Movie Reviews	Pang, Lee	2k movie reviews with sentiment polarity classification
Names Corpus	Kantrowitz, Ross	8k male and female names
NIST 1999 Info Extr (selections)	Garofolo	63k words, newswire and named-entity SGML markup
NPS Chat Corpus	Forsyth, Martell	10k IM chat posts, POS-tagged and dialogue-act tagged
PP Attachment Corpus	Ratnaparkhi	28k prepositional phrases, tagged as noun or verb modifiers
Proposition Bank	Palmer	113k propositions, 3300 verb frames
Question Classification	Li, Roth	6k questions, categorized
Reuters Corpus	Reuters	1.3M words, 10k news documents, categorized
Roget's Thesaurus	Project Gutenberg	200k words, formatted text
RTE Textual Entailment	Dagan et al	8k sentence pairs, categorized
SEMCOR	Rus, Mihalcea	880k words, part-of-speech and sense tagged
Senseval 2 Corpus	Pedersen	600k words, part-of-speech and sense tagged
Shakespeare texts (selections)	Bosak	8 books in XML format
State of the Union Corpus	CSPAN	485k words, formatted text
Stopwords Corpus	Porter et al	2,400 stopwords for 11 languages
Swadesh Corpus	Wiktionary	comparative wordlists in 24 languages
Switchboard Corpus (selections)	LDC	36 phonecalls, transcribed, parsed
Univ Decl of Human Rights	United Nations	480k words, 300+ languages
Penn Treebank (selections)	LDC	40k words, tagged and parsed
TIMIT Corpus (selections)	NIST/LDC	audio files and transcripts for 16 speakers
VerbNet 2.1	Palmer et al	5k verbs, hierarchically organized, linked to WordNet
Wordlist Corpus	OpenOffice.org et al	960k words and 20k affixes for 8 languages
WordNet 3.0 (English)	Miller, Fellbaum	145k synonym sets

Corpora in Other Languages 其他语言的语料库

NLTK comes with corpora for many languages, though in some cases you will need to learn how to manipulate character encodings in Python before using these corpora (see Section 3.3).

>>> nltk.corpus.cess_esp.words()

['El', 'grupo', 'estatal', 'Electricit\xe9_de_France', ...]

>>> nltk.corpus.floresta.words()

['Um', 'revivalismo', 'refrescante', 'O', '7_e_Meio', ...]

>>> nltk.corpus.indian.words('hindi.pos')

['\xe0\xa4\xaa\xe0\xa5\x82\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xa3',

'\xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xac\xe0\xa4

\x82\xe0\xa4\xa7', ...]

>>> nltk.corpus.udhr.fileids()

['Abkhaz-Cyrillic+Abkh', 'Abkhaz-UTF8', 'Achehnese-Latin1', 'Achuar-Shiwiar-Latin1',

'Adja-UTF8', 'Afaan_Oromo_Oromiffa-Latin1', 'Afrikaans-Latin1', 'Aguaruna-Latin1',

'Akuapem_Twi-UTF8', 'Albanian_Shqip-Latin1', 'Amahuaca', 'Amahuaca-Latin1', ...]

>>> nltk.corpus.udhr.words('Javanese-Latin1')[11:]

[u'Saben', u'umat', u'manungsa', u'lair', u'kanthi', ...]

The last of these corpora, udhr, contains the Universal Declaration of Human Rights（国际人权宣言）in over 300 languages. The fileids for this corpus include information about the character encoding used in the file, such as UTF8 or Latin1. Let’s use a conditional frequency distribution to examine the differences in word lengths for a selection of languages included in the udhr corpus. The output is shown in Figure 2-2 (run the program yourself to see a color plot). Note that True and False are Python’s built-in Boolean values.

>>> from nltk.corpus import udhr

>>> languages = ['Chickasaw', 'English', 'German_Deutsch',

...     'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

>>> cfd = nltk.ConditionalFreqDist(

...           (lang, len(word))

...           for lang in languages

...           for word in udhr.words(lang + '-Latin1'))

>>> cfd.plot(cumulative=True)

Figure 2-2. Cumulative word length distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having five or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.

Your Turn: Pick a language of interest in udhr.fileids(), and define a variable raw_text = udhr.raw(Language-Latin1). Now plot a frequency distribution of the letters of the text using nltk.FreqDist(raw_text).plot().

不知道为什么Chinese_Mandarin-UTF8不能用，留下该问题继续看

Unfortunately, for many languages, substantial corpora are not yet available. Often there is insufficient（不足的） government or industrial support for developing language resources, and individual efforts are piecemeal（零碎的） and hard to discover or reuse. Some languages have no established writing system, or are endangered. (See Section 2.7 for suggestions on how to locate(查找) language resources.)

Text Corpus Structure 文本语料库结构

We have seen a variety of corpus structures so far; these are summarized in Figure 2-3. The simplest kind lacks any structure: it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably（尤其） in the case of topical（时事问题） categories, as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure（时态结构）, news collections being the most common example. NLTK’s corpus readers support efficient access to a variety of corpora, and can be used to work with new corpora. Table 2-3 lists functionality provided by the corpus readers.

Figure 2-3. Common structures for text corpora: The simplest kind of corpus is a collection of isolated texts with no particular organization; some corpora are structured into categories, such as genre (Brown Corpus); some categorizations overlap, such as topic categories (Reuters Corpus); other corpora represent language use over time (Inaugural Address Corpus).4种不同类型的语料库

Table 2-3. Basic corpus functionality defined in NLTK: More documentation can be found using help(nltk.corpus.reader) and by reading the online Corpus HOWTO at http://www.nltk.org/howto .

Example	Description
fileids()	The files of the corpus
fileids([categories])	The files of the corpus corresponding to these categories
categories()	The categories of the corpus
categories([fileids])	The categories of the corpus corresponding to these files
raw()	The raw content of the corpus
raw(fileids=[f1,f2,f3])	The raw content of the specified files
raw(categories=[c1,c2])	The raw content of the specified categories
words()	The words of the whole corpus
words(fileids=[f1,f2,f3])	The words of the specified fileids
words(categories=[c1,c2])	The words of the specified categories
sents()	The sentences of the specified categories
sents(fileids=[f1,f2,f3])	The sentences of the specified fileids
sents(categories=[c1,c2])	The sentences of the specified categories
abspath(fileid)	The location of the given file on disk
encoding(fileid)	The encoding of the file (if known)
open(fileid)	Open a stream for reading the given corpus file
root()	The path to the root of locally installed corpus
readme()	The contents of the README file of the corpus

We illustrate the difference between some of the corpus access methods here:

>>> raw = gutenberg.raw("burgess-busterbrown.txt")

>>> raw[1:20]     #这个按单个字符算的

'The Adventures of B'

>>> words = gutenberg.words("burgess-busterbrown.txt")

>>> words[1:20]        #这个按单个词和符号数字算的

['The', 'Adventures', 'of', 'Buster', 'Bear', 'by', 'Thornton', 'W', '.',

'Burgess', '1920', ']', 'I', 'BUSTER', 'BEAR', 'GOES', 'FISHING', 'Buster',

'Bear']

>>> sents = gutenberg.sents("burgess-busterbrown.txt")

>>> sents[1:20]  #按句子，那么这个I为啥算单独的一句？

[['I'], ['BUSTER', 'BEAR', 'GOES', 'FISHING'], ['Buster', 'Bear', 'yawned', 'as',

'he', 'lay', 'on', 'his', 'comfortable', 'bed', 'of', 'leaves', 'and', 'watched',

'the', 'first', 'early', 'morning', 'sunbeams', 'creeping', 'through', ...], ...]

Loading Your Own Corpus 装载你自己的语料库

If you have a your own collection of text files that you would like to access using the methods discussed earlier, you can easily load them with the help of NLTK’s Plain textCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict（这是Linux的吧）. Whatever the location, set this to be the value of corpus_root①. The second parameter of the PlaintextCorpusReader initializer②can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see Section 3.4 for information about regular expressions).

>>> from nltk.corpus import PlaintextCorpusReader

>>> corpus_root = '/usr/share/dict' ①

>>> wordlists = PlaintextCorpusReader(corpus_root, '.*') ②

>>> wordlists.fileids()

['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']

>>> wordlists.words('connectives')

['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]

As another example, suppose you have your own local copy of Penn Treebank (release 3), in C:\corpora. We can use the BracketParseCorpusReader to access this corpus. We specify the corpus_root to be the location of the parsed Wall Street Journal component of the corpus①, and give a file_pattern that matches the files contained within its subfolders② (using forward slashes斜杠).

>>> from nltk.corpus import BracketParseCorpusReader

>>> corpus_root = r"C:\corpora\penntreebank\parsed\mrg\wsj" ①

>>> file_pattern = r".*/wsj_.*\.mrg" ②

>>> ptb = BracketParseCorpusReader(corpus_root, file_pattern)

>>> ptb.fileids()

['00/wsj_0001.mrg','00/wsj_0002.mrg', '00/wsj_0003.mrg', '00/wsj_0004.mrg', ...]

>>> len(ptb.sents())

49208

>>> ptb.sents(fileids='20/wsj_2013.mrg')[19]

['The', '55-year-old', 'Mr.', 'Noriega', 'is', "n't", 'as', 'smooth', 'as', 'the',

'shah', 'of', 'Iran', ',', 'as', 'well-born', 'as', 'Nicaragua', "'s", 'Anastasio',

'Somoza', ',', 'as', 'imperial', 'as', 'Ferdinand', 'Marcos', 'of', 'the', 'Philippines',

'or', 'as', 'bloody', 'as', 'Haiti', "'s", 'Baby', Doc', 'Duvalier', '.']

posted @ 2011-07-10 21:40 牛皮糖NewPtone 阅读(5304) 评论(0) 收藏举报

刷新页面返回顶部

Python自然语言处理学习笔记(9)：2.1 访问文本语料库

公告