Language Model

Posted on 2017-08-16 15:07 Emma_zha 阅读(135) 评论(0) 收藏举报

##20170801
##notes for lec2-2.pdf about language model

Evaluating a Language Model

Intuition about Perplexity

Evaluating N‐grams with Perplexity

Sparsity is Always a Problem
   Dealing with Sparsity
       General approach: modify observed counts to improve estimates
           – Discounting: allocate probability mass for unobserved
           events by discounting counts for observed events

           – Interpolation: approximate counts of N‐gram using
           combination of estimates from related denser histories


           – Back‐off: approximate counts of unobserved N‐gram based
           on the proportion of back‐off events (e.g., N‐1 gram)

           Add‐One Smoothing
               • We have V words in the vocabulary, N is the number of words
               in the training set
               • Smooth observed counts by adding one to all the counts and
               renormalize
               – Unigram case:
               – Bigram case:
               • More general case: add‐α, when α is added instead of one.

           Linear Interpolation

           Tuning Hyperparameters
               • Both add‐α and linear interpolation have hyperparameters.
               • The selection of their values is crucial for the smoothing
               The selection of their values is crucial for the smoothing
               performance
               • Their values are tuned to maximize the likelihood of held‐out
               data
               – For linear interpolation, we will use EM to find optimal
               parameters (in few lectures)

           Kneser‐Ney Smoothing
               • Observed n‐grams occur more in the training data than in the
               new data
               • Absolute discounting: count*(x)=count(x)‐d
               P ad ( w i | w i − 1 ) =
               count ( w i , w i − 1 ) − d
               + α P ˆ ( w i )
               count ( w i − 1 )
               • Distribute the remaining mass based on the skewness in the
               distribution of the lower order N‐gram (i.e., the number of
               words it can follow)
               P ˆ ( w i ) ∝ | w i − 1 : count ( w i , w i − 1 ) > 0 |
               • Kneser‐Ney is repeatedly proven as very successful estimator

刷新页面返回顶部

Emma_zha

导航

公告

Language Model