朴素贝叶斯
朴素贝叶斯分类器的构造基础是贝叶斯理论。采用概率模型来表述,定义x=<x1,x2,...,xn>为某一n维特征向量,y∈{c1,c2,...ck}为该特征向量x所有k种可能的类别,记
P(y=ci|x)为特征向量x属于类别ci的概率。贝叶斯原理:
P(y|x)=P(x|y)P(y)/P(x)
#代码1:读取20类新闻文本的数据细节
#从sklearn.datasets里导入新闻数据抓取器fetch_20newsgroups
from sklearn.datasets import fetch_20newsgroups
#需要从互联网下载数据
news=fetch_20newsgroups(subset='all')
print(len(news.data))
print(news.data[0])
from sklearn.datasets import fetch_20newsgroups
#需要从互联网下载数据
news=fetch_20newsgroups(subset='all')
print(len(news.data))
print(news.data[0])
18846
From: Mamatha Devineni Ratnam <mr47+@Andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host:
po4.andrew.cmu.edu
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
From: Mamatha Devineni Ratnam <mr47+@Andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
可能出现的问题 fetch_20newsgroups 数据集导入失败: 1. 下载20news-bydate.tar.gz(http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz)
到C:User\Adminster\scikit_learn_data\20news_home 下
2. 修改 C:\Anaconda\Lib\site-packages\sklearn\datasets\twenty_
newsgroups.py 里面的 download_20newsgroups()函数,注释掉下面的代码
并修改
archive_path = os.path.join(target_dir, r'20newsbydate.tar.gz')
3. 运行, fetch_20newsgroups会自动解压20news-bydate.tar.gz,生成缓存文件20news-bydate_py3.pkz路径为(C:User\Adminster\scikit_learn_data\20news-bydate_py3.pkz)
2. 修改 C:\Anaconda\Lib\site-packages\sklearn\datasets\twenty_
# if os.path.exists(archive_path):
# # Download is not complete as the .tar.gz file is removed after
# # download.
# logger.warning("Download was incomplete, downloading again.")
# os.remove(archive_path)
# logger.warning("Downloading dataset from %s (14 MB)", URL)
# opener = urlopen(URL)
# with open(archive_path, 'wb') as f:
# f.write(opener.read())
# # Download is not complete as the .tar.gz file is removed after
# # download.
# logger.warning("Download was incomplete, downloading again.")
# os.remove(archive_path)
# logger.warning("Downloading dataset from %s (14 MB)", URL)
# opener = urlopen(URL)
# with open(archive_path, 'wb') as f:
# f.write(opener.read())
并修改
archive_path = os.path.join(target_dir, r'20newsbydate.tar.gz')
3. 运行, fetch_20newsgroups会自动解压20news-bydate.tar.gz,生成缓存文件20news-bydate_py3.pkz路径为(C:User\Adminster\scikit_learn_data\20news-bydate_py3.pkz)
#20类新闻文本数据分割
from sklearn.cross_validation import train_test_split
#随机采样25%的数据用于测试,剩下的75%用于构建训练集合
X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#使用朴素贝叶斯分类器对新闻文本数据进行类别预测
#从sklearn.feature_extraction.text里导入用于文本特征向量转换模块
from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
X_test=vec.transform(X_test)
#从sklearn.naive_bayas里导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
#利用训练数据对模型参数进行估计
mnb.fit(X_train,y_train)
#预测结果存储在变量y_predict中
y_predict=mnb.predict(X_test)
#对朴素贝叶斯分类器在新闻文本数据上的表现性能进行评估
#使用模型自带的评估函数进行准确性测评
print('The Accuracy of Naïve Bayes is',mnb.score(X_test,y_test))
#从sklearn.metrics里导入classification_report模块
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict,target_names=news.target_names))
from sklearn.cross_validation import train_test_split
#随机采样25%的数据用于测试,剩下的75%用于构建训练集合
X_train,X_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33)
#使用朴素贝叶斯分类器对新闻文本数据进行类别预测
#从sklearn.feature_extraction.text里导入用于文本特征向量转换模块
from sklearn.feature_extraction.text import CountVectorizer
vec=CountVectorizer()
X_train=vec.fit_transform(X_train)
X_test=vec.transform(X_test)
#从sklearn.naive_bayas里导入朴素贝叶斯模型
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
#利用训练数据对模型参数进行估计
mnb.fit(X_train,y_train)
#预测结果存储在变量y_predict中
y_predict=mnb.predict(X_test)
#对朴素贝叶斯分类器在新闻文本数据上的表现性能进行评估
#使用模型自带的评估函数进行准确性测评
print('The Accuracy of Naïve Bayes is',mnb.score(X_test,y_test))
#从sklearn.metrics里导入classification_report模块
from sklearn.metrics import classification_report
print(classification_report(y_test,y_predict,target_names=news.target_names))
The Accuracy of Naïve Bayes is 0.8397707979626485
precision recall f1-score support
alt.atheism 0.86 0.86 0.86 201
comp.graphics 0.59 0.86 0.70 250
comp.os.ms-windows.misc 0.89 0.10 0.17 248
comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240
comp.sys.mac.hardware 0.93 0.78 0.85 242
comp.windows.x 0.82 0.84 0.83 263
misc.forsale 0.91 0.70 0.79 257
rec.autos 0.89 0.89 0.89 238
rec.motorcycles 0.98 0.92 0.95 276
rec.sport.baseball 0.98 0.91 0.95 251
rec.sport.hockey 0.93 0.99 0.96 233
sci.crypt 0.86 0.98 0.91 238
sci.electronics 0.85 0.88 0.86 249
sci.med 0.92 0.94 0.93 245
sci.space 0.89 0.96 0.92 221
soc.religion.christian 0.78 0.96 0.86 232
talk.politics.guns 0.88 0.96 0.92 251
talk.politics.mideast 0.90 0.98 0.94 231
talk.politics.misc 0.79 0.89 0.84 188
talk.religion.misc 0.93 0.44 0.60 158
avg / total 0.86 0.84 0.82 4712
precision recall f1-score support
alt.atheism 0.86 0.86 0.86 201
comp.graphics 0.59 0.86 0.70 250
comp.sys.ibm.pc.hardware 0.60 0.88 0.72 240
comp.sys.mac.hardware 0.93 0.78 0.85 242
comp.windows.x 0.82 0.84 0.83 263
misc.forsale 0.91 0.70 0.79 257
rec.autos 0.89 0.89 0.89 238
rec.motorcycles 0.98 0.92 0.95 276
rec.sport.baseball 0.98 0.91 0.95 251
rec.sport.hockey 0.93 0.99 0.96 233
sci.crypt 0.86 0.98 0.91 238
sci.electronics 0.85 0.88 0.86 249
sci.med 0.92 0.94 0.93 245
sci.space 0.89 0.96 0.92 221
soc.religion.christian 0.78 0.96 0.86 232
talk.politics.guns 0.88 0.96 0.92 251
talk.politics.mideast 0.90 0.98 0.94 231
talk.politics.misc 0.79 0.89 0.84 188
talk.religion.misc 0.93 0.44 0.60 158
avg / total 0.86 0.84 0.82 4712
浙公网安备 33010602011771号