1. 数据准备:收集数据与读取
from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.naive_bayes import MultinomialNB from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer import csv
2. 数据预处理:处理数据
file_path=r'SMSSpamCollectionjsn.txt' a=open(file_path,'r',encoding='utf-8') a_data=[] a_label=[] csv_reader=csv.reader(a,delimiter='\t') for line in csv_reader: a_label.append(line[0]) a_data.append(line[1]) a.close() def preprocessing(text): preprocessing_text= text return preprocessing_text
3. 训练集与测试集:将先验数据按一定比例进行拆分。
X_train, X_test, Y_train, Y_test = train_test_split(a_data, a_label, test_size=0.3, random_state=0,stratify=a_label) ectorizer=TfidfVectorizer(min_df=2,ngram_range=(1,2),stop_words='english',strip_accents='unicode',norm='l2') X_train=vectorizer.fit_transform(X_train) X_test=vectorizer.transform(X_test) clf = MultinomialNB().fit(X_train,Y_train) y_nb_pred = clf.predict(X_test)
4. 提取数据特征,将文本解析为词向量 。
from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(sms_data,sms_label,test_size=0.3,random_state=0,stratify=sms_label) from sklearn.feature_extraction.text import TfidfVectorizer vectorizer=TfidfVectorizer(min_df=2,ngram_range=(1,2),stop_words='english',strip_accents='unicode',norm='12') X_train=vectorizer.fit_transform(x_train) X_test=vectorizer.transform(x_test)
5. 训练模型:建立模型,用训练数据训练模型。即根据训练样本集,计算词项出现的概率P(xi|y),后得到各类下词汇出现概率的向量 。
from sklearn.naive_bayes import MultinomialNB clf=MultinomialNB().fit(x_train,y_train)
6. 测试模型:用测试数据集评估模型预测的正确率。
混淆矩阵
准确率、精确率、召回率、F值
from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report cm=confusion_matrix(y_test.y_nb_pred) print(cm) cr=classification_report(y_test.y_nb_pred) print(cr)
7. 预测一封新邮件的类别
new_email=['新邮件'] vectorizer(new_email) clf.predict(new_email)