Emma_zha

Those who turn back never reach the summit.

导航

Language Model

Posted on 2017-08-16 15:07  Emma_zha  阅读(135)  评论(0)    收藏  举报

##20170801
##notes for lec2-2.pdf about language model

Evaluating a Language Model

Intuition about Perplexity

Evaluating N‐grams with Perplexity

Sparsity is Always a Problem
    Dealing with Sparsity
        General approach: modify observed counts to improve estimates
            – Discounting:  allocate  probability mass for unobserved
            events by discounting counts for observed events


            – Interpolation: approximate counts of N‐gram using  
            combination of estimates from related denser histories
            

            – Back‐off: approximate counts of unobserved N‐gram based
            on the proportion of back‐off events (e.g., N‐1 gram)

            Add‐One Smoothing
                • We have V words in the vocabulary, N is the number of words
                in the training set
                • Smooth observed counts by adding one to all the counts and
                renormalize
                – Unigram case:
                – Bigram  case:
                • More general case:  add‐α, when α is added instead of one.

            Linear Interpolation

            Tuning Hyperparameters
                • Both add‐α and linear interpolation have hyperparameters.
                • The selection of their values is crucial for the smoothing
                The selection of their values is crucial for the smoothing
                performance
                • Their values are tuned to maximize the likelihood of held‐out
                data
                – For linear interpolation, we will use EM to find optimal
                parameters (in few lectures)


            Kneser‐Ney Smoothing
                • Observed n‐grams occur more in the training data than in the
                new data
                • Absolute discounting: count*(x)=count(x)‐d
                P ad ( w i | w i − 1 ) =
                count ( w i , w i − 1 ) − d
                + α P ˆ ( w i )
                count ( w i − 1 )
                • Distribute the remaining mass based on the skewness in the
                distribution of the lower order N‐gram (i.e., the number of
                words it can follow)
                P ˆ ( w i ) ∝ | w i − 1 : count ( w i , w i − 1 ) > 0 |
                • Kneser‐Ney is repeatedly proven as very successful estimator