Natural Language Processing

*How to process Human language?

  *Human speech; Web document with human language;

*Natural Language Processing(NLP)

  *Morphology/Syntactic analysis  

  *Semantic analysis

  *Advanced NLP technology

 

*Related NLP fields:

  *Web Content Mining -- Analyzing web content;

  *Speech Processing -- Processing and analysing spoken lanauge;

  *Information extraction -- Extracting sementic information from text;

  *Document Classification/Clustering -- Classify/Cluster web documents;

*Basic steps of NLP:

  1, Sentence spliting: Dividing a string of written language into its component sentences; Sentence boundary = period + space(s) + capital letter.

eg: All 298 passengers on board a Malaysia Airlines plane died after the airliner crashed in eastern Ukraine, close to the border with Russia. Flight MH17 was travelling over the conflict-hit region when it disappeared from radar. A total of 283 passengers, including 80 children, and 15 crew members were on board.

==> All 298 passengers on board a Malaysia Airlines plane died after the airliner crashed in eastern Ukraine, close to the border with Russia. 

==> Flight MH17 was travelling over the conflict-hit region when it disappeared from radar.

==> A total of 283 passengers, including 80 children, and 15 crew members were on board.

  2, Tokenization: convert a sentence into a sequnce of tokens; divides the text into smallest units(usually words)

*A total of 283 passengers, including 80 children, and 15 crew members were on board.

 ==>A, total, of, 283, passengers, including, 80, children, and, 15, crew, members, were, on, board, 15个units

  3, Part-of-speech tagging: classifiying words into their parts of speech and labelling them accordingly is known as part-of-speech tagging, POS tagging.

NN Noun,singular VB Verb, base form
NNS Noun, plural VBD Verb, past tense
DT Determiner VBG Verb, gerund or present participle
IN Preposition JJ Adjective
CC Coordinating conjunction CD Cardinal number

*A, total, of, 283, passengers, including, 80, children, and, 15, crew, members, were, on, board

==>A/DT, total/NN, of/IN, 283/CD, passengers/NNS, including/VBG, 80/CD, children/NNS, and/CC, 15/CD, crew/NN, members/NNS, were/VBD, on/IN, board/NN.

  4, Shallow parsing (Chunking): an analysis of a sentence which identifies the constitudents(noun phrase{NP}, preposition{PP}, verb phrase{VP}, etc.).

*A/DT, total/NN, of/IN, 283/CD, passengers/NNS, including/VBG, 80/CD, children/NNS, and/CC, 15/CD, crew/NN, members/NNS, were/VBD, on/IN, board/NN.

==>{A/DT, total/NN}NP, {of/IN}PP, {283/CD, passengers/NNS}NP, {including/VBG}PP, {80/CD, children/NNS}NP, {and/CC}O, {15/CD, crew/NN, members/NNS}NP, {were/VBD}VP, {on/IN, board/NN}NP.

  5, Named entity recognition (Optional)

*WSD(Word Sense Disambiguation) semantic Analysis

  *Words have multiple distinct meansings of senses:

    *how many apples do you have? (apple fruit? or appla produce?)

  *WSD: determine which of the senses of an ambiguous word is invoked in a particular use of the word.

  *Approaches

    *Dictionary-based word sense disambiguation

    *Supervised word sense disambiguation

    *Unsupervised word sense disambiguation

posted on 2016-04-26 12:28  yeatschen  阅读(116)  评论(0)    收藏  举报

导航