corpus

"matter of any kind," literally "a body," (plural corpora), late 14c., "body," from Latin corpus, literally "body" (see corporeal). The sense of "body of a person" (mid-15c. in English) and "collection of facts or things" (1727 in English) both were present in Latin.

Also used in various medical phrases, such as corpus callosum (1706, literally "tough body"), corpus luteum (1788, literally "yellow body").

corporeal

physical. 体育课: Physical Education (PE)

corporal

  • 牛津
    • 1. of human body; 2. 下士
  • LDOCE
    • a low rank in the army, air force etc, from Old Italian caporale, from capo 'head'
  • 柯林斯:
    • 1. a non-commissioned officer in the army or United States Marines.
    • 2. Corporal punishment is the punishment of people by hitting them.

In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation. An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

Some corpora have further structured levels of analysis applied. In particular, a number of smaller corpora may be fully parsed. Such corpora are usually called Treebanks or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for morphology, semantics and pragmatics.

Corpora are the main knowledge base in corpus linguistics. Other notable areas of application include:

  • Language technology, natural language processing, computational linguistics
  • Machine translation
  • Philologies [语文学] Text corpora are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship.

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or speaker identification engine). In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental interference.

六级/考研单词: literal, plural, physics, educate, corporal, rank, unite, marine, punish, linguistic, nowadays, electron, hypothesis, valid, territory, data, multiple, seldom, tag, verb, noun, adjective, farther, consistent, million, notable, translate, script, bible, audio, converse, feasible, interfere

posted on 2022-03-05 08:58  华容道专家  阅读(213)  评论(0)    收藏  举报