# 朴素贝叶斯文本分类简单介绍

The goal of classification is to take a single observation, extract some useful
features, and thereby classify the observation into one of a set of discrete classes.

di表示第i篇文档，ci表示第i个类别。目标是：寻找一个分类器，这个分类器能够：当丢给它一篇新文档d，它就输出d （最有可能）属于哪个类别c

①Generative classifier

②Discriminative classifier

Generative classifiers like naive Bayes build a model of each class. Given an observation,they return the class most likely to have generated the observation.
Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes.

We represent a text document as if it were a bag-of-words,
that is, an unordered set of words with their position ignored, keeping only their frequency in the document.

（公式一）

c就是：在所有的类别C={c1，c2，……cm} 中，使得：条件概率P(c|d)取最大值的类别。使用贝叶斯公式，将（公式一）转换成如下形式：

(公式二)

(公式三)

(公式四)

（公式五）

（公式六）

①先验概率P(c)的计算

P(c)的意思是：在所有的文档中，类别为c的文档出现的概率有多大？假设训练数据中一共有Ndoc篇文档，只要数一下类别c的文档有多少个就能计算p(c)了，类别c的文档共有Nc篇，先验概率的计算公式如下：

(公式七)

【先验概率 其实就是 准备干一件事情时，目前已经掌握了哪些信息了】关于先验信息理解，可参考：这篇文章

For the document prior P(c) we ask what percentage of the documents in our training set are in each class c.
Let Nc be the number of documents in our training data with
class c and Ndoc be the total number of documents

②似然函数P(wi|c)的计算

（公式八）

Here the vocabulary V consists of the union of all the word types in all classes, not just the words in one class c.

But since naive Bayes naively multiplies all the feature likelihoods together, zero
probabilities in the likelihood term for any class will cause the probability of the
class to be zero, no matter the other evidence!

（公式九）

-  just plain boring
-  entirely predictable and lacks energy
-  no surprises and very few laughs

+  very powerful
+  the most fun film of the summer

predictable with no fun

very、the)重复出现了两次，故词库V的大小为 20。因此单词predictable对应的似然概率如下：

p(predictable|'-')=(1+1)/(14+20)=2/34

p(fun|'-')=(0+1)/(14+20)                    p(fun|'+')=(1+1)/(9+20)

比较上面两个概率的大小，就可以知道将“predictable with no fun”归为 '-' 类别。

CS 124: From Languages to Information

posted @ 2017-12-29 19:19  大熊猫同学  阅读(18861)  评论(1编辑  收藏  举报