Natural Language Processing
*How to process Human language?
*Human speech; Web document with human language;
*Natural Language Processing(NLP)
*Morphology/Syntactic analysis
*Semantic analysis
*Advanced NLP technology
*Related NLP fields:
*Web Content Mining -- Analyzing web content;
*Speech Processing -- Processing and analysing spoken lanauge;
*Information extraction -- Extracting sementic information from text;
*Document Classification/Clustering -- Classify/Cluster web documents;
*Basic steps of NLP:
1, Sentence spliting: Dividing a string of written language into its component sentences; Sentence boundary = period + space(s) + capital letter.
eg: All 298 passengers on board a Malaysia Airlines plane died after the airliner crashed in eastern Ukraine, close to the border with Russia. Flight MH17 was travelling over the conflict-hit region when it disappeared from radar. A total of 283 passengers, including 80 children, and 15 crew members were on board.
==> All 298 passengers on board a Malaysia Airlines plane died after the airliner crashed in eastern Ukraine, close to the border with Russia.
==> Flight MH17 was travelling over the conflict-hit region when it disappeared from radar.
==> A total of 283 passengers, including 80 children, and 15 crew members were on board.
2, Tokenization: convert a sentence into a sequnce of tokens; divides the text into smallest units(usually words)
*A total of 283 passengers, including 80 children, and 15 crew members were on board.
==>A, total, of, 283, passengers, including, 80, children, and, 15, crew, members, were, on, board, 15个units
3, Part-of-speech tagging: classifiying words into their parts of speech and labelling them accordingly is known as part-of-speech tagging, POS tagging.
| NN | Noun,singular | VB | Verb, base form |
| NNS | Noun, plural | VBD | Verb, past tense |
| DT | Determiner | VBG | Verb, gerund or present participle |
| IN | Preposition | JJ | Adjective |
| CC | Coordinating conjunction | CD | Cardinal number |
*A, total, of, 283, passengers, including, 80, children, and, 15, crew, members, were, on, board
==>A/DT, total/NN, of/IN, 283/CD, passengers/NNS, including/VBG, 80/CD, children/NNS, and/CC, 15/CD, crew/NN, members/NNS, were/VBD, on/IN, board/NN.
4, Shallow parsing (Chunking): an analysis of a sentence which identifies the constitudents(noun phrase{NP}, preposition{PP}, verb phrase{VP}, etc.).
*A/DT, total/NN, of/IN, 283/CD, passengers/NNS, including/VBG, 80/CD, children/NNS, and/CC, 15/CD, crew/NN, members/NNS, were/VBD, on/IN, board/NN.
==>{A/DT, total/NN}NP, {of/IN}PP, {283/CD, passengers/NNS}NP, {including/VBG}PP, {80/CD, children/NNS}NP, {and/CC}O, {15/CD, crew/NN, members/NNS}NP, {were/VBD}VP, {on/IN, board/NN}NP.
5, Named entity recognition (Optional)
*WSD(Word Sense Disambiguation) semantic Analysis
*Words have multiple distinct meansings of senses:
*how many apples do you have? (apple fruit? or appla produce?)
*WSD: determine which of the senses of an ambiguous word is invoked in a particular use of the word.
*Approaches
*Dictionary-based word sense disambiguation
*Supervised word sense disambiguation
*Unsupervised word sense disambiguation
浙公网安备 33010602011771号