SentiAnalysis

“Opinion Lexicon” Based

1. Refer to the “Mining Twitter for Airline Consumer Sentiment”(1)

Loading twitter data into R with twitterR package

Loading Hu and Liu’s “opinion lexicon”(2), nearly 6,800 words as positive or negative

Sentiment scoring algorithm

Set the scoring threshold for each level, positive, neutral, negative

- Advantage

Existed R code, we can use the method with R directly, or porting to python easily.

Quick and effective method for Demo.

Easy to extension by change opinion lexicon.

- Disadvantage

Simple method, can’t handle complex sentimental problem

(1)http://www.inside-r.org/howto/mining-twitter-airline-consumer-sentiment

(2)http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar

 

2. SentiStrength, http://sentistrength.wlv.ac.uk/

SentiStrength estimates the strength of positive and negative sentiment in short texts, even for informal language.

参见下面两篇文章, 后面记下简单笔记

Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558.

Thelwall, M., Buckley, K., & Paltoglou, G. (2012). Sentiment strength detection for the social Web, Journal of the American Society for Information Science and Technology, 63(1), 163-173.

 

Classification Method

- Refer to “Twitter Sentiment”, http://twittersentiment.appspot.com/

Collect training data consists of tweets with emoticons, http://www.stanford.edu/~alecmgo/papers/TwitterDistantSupervision09.pdf

Positive, :)  :-)  : ) :D =)

Negative, :( :-( : (

Machine Learning Classifiers, Naive Bayes, Maximum Entropy, and SVM.

Feature extractors, unigrams, bigrams, unigrams and bigrams, and unigrams with part of speech tags

- Advantage

Auto collect training data, rather than human labeled

Open Corpus, http://www.stanford.edu/~alecmgo/cs224n/trainingandtestdata.zip

More advanced method based on machine learning technique

- Disadvantage

Not include the neutral class

No open source code

Need long time to build and debug the classification model

 

 

 

Sentiment Strength Detection for the Social Web

 

Sentiment analysis

The two most common sentiment analysis tasks are subjectivity and polarity detection. The former predicts whether a given text is subjective or not and the latter predicts whether a subjective text is positive or negative overall. Less common is sentiment strength detection, which predicts the strength of positive or negative sentiment within a text.

说了情感分析的3个task, 下面主要谈的是polarity detection的相关技术

 

A common approach for sentiment analysis is to select a machine learning algorithm and a method of extracting features from texts and then train the classifier with a human-coded corpus. The features used are typically words but can also be stemmed words or part-of-speech tagged words, and also may be combined into bigrams (e.g., two consecutive words) and trigrams (Pang & Lee, 2008).

 

An alternative polarity detection method is to identify the likely average polarity of words within texts by estimating how often they co-occur with a set of seed words of known and unambiguous sentiment (e.g., good, terrible), typically using web search engines to estimate relative co-occurrence frequencies (Turney, 2002).

先有一个seed词集合包含已知的情感词, 我们也可以仅仅用这些词去做sentiment分析, 但是除了这样的明显的情感词外, 其他的词也会或多或少带有些情感, 所以如果能把这些词也考虑进来会跟合理一些. 那么怎么样来衡量其他word的情感strength, 其实就是分析这个词和明显的情感词共同出现的频率, 如果很高, 就表示这个词也具有相同的情感, 并且通过search engine来统计这种co-occurence.

这个方法很有启发性, 优点如下,

This approach needs relatively little lexical input knowledge.

This approach is flexible for different domains in the sense that a small set of initial general keywords can be used to generate a different lexicon for each application domain.

 

Sentiment Strength Detection in Short Informal Text

 

The SentiStrength Sentiment Strength Detection Algorithm

The SentiStrength emotion detection algorithm was developed on an initial set of 2,600 MySpace classifications used for the pilot testing. The key elements of SentiStrength are listed below.

  • The core of the algorithm is the sentiment word strength list. This is a collection of 298 positive terms and 465 negative terms classified for either positive or negative sentiment strength with a value from 2 to 5. The default classifications are based upon human judgements during the development stage, with automatic modification occurring later during the training phase (see below). Following LIWC, some of the words include wild cards (e.g., xx*) matches any number ≥2 of consecutive xs. Some terms are standard English words and others are non-standard but common in MySpace (e.g., luv, xox, lol, haha, muah). The emotion strength is specific to the contexts in which the words tend to be used in MySpace. For example, “love” was originally classified as strength 4 positive but was reduced to strength 3 due to many casual uses such as “Just showin love 2 ur page”. Some of the words explicitly express emotion, such as “love” or “hate” but others, normally given a weak strength 2, are indirectly associated with positive or negative contexts (e.g., appreciate, help, birthday). The SentiStrength algorithm includes procedures (described below) to fine-tune the sentiment strengths using a set of training data.
  • The above default manual word strengths are modified by a training algorithm to optimise the sentiment word strengths. This algorithms starts with the baseline human-allocated term strengths for the predefined list and then for each term assesses whether an increase or decrease of the strength by 1 would increase the accuracy of the classifications. Any change that increases the overall accuracy by at least 2 is kept. The minimum increase could also be set to 1 which would risk over-fitting, whereas 2 risks loosing useful changes to rare words. Here 2 was selected to make the algorithm run faster, due to less changes, rather than for any theoretical reason (in fact the algorithm worked better on the test data with 1, as the results show). The algorithm tests all words in the sentiment list at random and is repeated until all words have been checked without their strengths being changed.
  • The word “miss” was allocated a positive and negative strength of 2. This was the only word classed as both positive and negative. It was typically used in the phrase “I miss you”, suggesting both sadness and love.
  • A spelling correction algorithm identifies the standard spellings of words that have been miss-spelled by the inclusion of repeated letters. For example hellllloooo would be identified as “hello” by this algorithm. The algorithm (a) automatically deletes repeated letters above twice (e.g., helllo -> hello); (b) deletes repeated letters occurring twice for letters rarely occurring twice in English (e.g., niice -> nice) and (c) deletes letters occurring twice if not a standard word but would form a standard word if deleted (e.g., nnice -> nice but not hoop -> hop nor baaz -> baz). Formal spelling correction algorithms (see Pollock & Zamora, 1984) were tried but not used as they made very few corrections and had problems with names and slang.
  • A booster word list contains words that boost or reduce the emotion of subsequent words, whether positive or negative. Each word increases emotion strength by 1 or 2 (e.g., very, extremely) or decreases it by 1 (e.g., some).
  • A negating word list contains words that invert subsequent emotion words (including any preceding booster words). For example, if “very happy” had positive strength 4 then “not very happy” would have negative strength 4. The possibility that some negating terms do not negate was not incorporated as this did not seem to occur often in the pilot data set.
  • Repeated letters above those needed for correct spelling are used to give a strength boost of 1 to emotion words, as long as there are at least two additional letters. The use of repeated letters is a common device for expressing emotion or energy in MySpace comments, but one repeated letter often appeared to be a typing error.
  • An emoticon list with associated strengths (positive or negative 2) supplements the sentiment word strength list (and punctuation included in emoticons is not processed further for the purposes below).
  • Any sentence with an exclamation mark was allocated a minimum positive strength of 2.
  • Repeated punctuation including at least one exclamation mark gives a strength boost of 1 to the immediately preceding emotion word (or sentence).
  • Negative emotion was ignored in questions. For example, the question “are you angry?” would be classified as not containing sentiment, despite the presence of the word “angry”. This was not applied to positive sentiment because many question sentences appeared to contain mild positive sentiment. In particular, sentences like “whats up?” were typically classified as containing mild positive sentiment (strength 2).

 

Some additional modifications were added to SentiStrength but subsequently rejected after additional testing, or were found to be impractical.

  • Phrase identification was not extensively used except for a few frequent examples found in the initial 2,600 development comments. Although idiomatic phrases were common, their variety was such that it did not seem practical to systematically identify them. Future work could perhaps identify booster phrases like “so much” and “a lot”, and use phrase identification to separate weak uses of the word “love” with stronger uses, such as “I love you”.
  • Semantic disambiguation was not used for ambiguous words because of the problems caused by highly non-standard grammar. This could potentially improve the algorithm but would require considerable computational effort. For example, the word “rock” was sometimes strongly positive (e.g., you rock!!!) and sometime neutral (e.g., do you listen to rock music?).

posted on 2012-03-31 16:09  fxjwind  阅读(893)  评论(0编辑  收藏  举报