信用评分预测模型（六）--支持向量机算法

Author：LieDra
https://www.cnblogs.com/LieDra/

前言

下面将利用支持向量机算法对数据进行处理分析。

支持向量机介绍

支持向量机（support vector machines, SVM）是一种二分类模型，它的基本模型是定义在特征空间上的间隔最大的线性分类器，间隔最大有别于感知机；
SVM还包括核技巧，这使它实质上成为非线性分类器。
SVM的的学习策略就是间隔最大化。
对于线性可分的数据集来说，分离超平面有无穷多个，但是其中间隔最大的分离超平面却是唯一的。关于支持向量机最优分类面的求解问题，可转化为求解数据样本分类间隔最大化的二次函数的解，而求解的关键在于求得分类面间隔最大值的目标解。
对于线性不可分的数据集来说，引入核函数的概念。核函数应用于支持向量机的分类过程，其主要目的是将原低维空间中非线性不可分数据映射到高维空间中，从而解决低维空间无法构造分类超平面的问题。
在解决不同的数据分类问题时，选择不同的核函数有不同的效果。核函数主要分为线性核、多项式核、Sigmoid 核和 Gauss 径向基核等，可以根据需要选择不同的核函数。
训练完成后，大部分的训练样本都不需要保留，最终模型仅与支持向量有关。

下面代码中分别对原始数据使用svm（），对标准化数据使用svm（），对标准化数据使用调参的svm（）。最终得到的结果是标准化后使用svm相对要好一些，这很容易理解（标准化之前有的属性值偏高或偏低都会影响结果）。

代码示例

#标准化数据
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
 
#读取文件
# readFileName="D:/study/5/code/python/python Data analysis and mining/class/dataset/german-全标准化.xls"
readFileName="D:/study/5/code/python/python Data analysis and mining/class/dataset/german.xls"



#读取excel
df=pd.read_excel(readFileName)

# list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns

#random_state 相当于随机数种子
# x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=38)

x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,train_size=0.6,random_state=38)
x_test2,x_check,y_test2,y_check=train_test_split(x_test,y_test,train_size=0.25,random_state=38)
svm=SVC()
svm.fit(x_train,y_train)
print("accuracy on the training subset:{:.3f}".format(svm.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(svm.score(x_check,y_check)))

'''
accuracy on the training subset:1.000
accuracy on the test subset:0.700
 
'''

#观察数据是否标准化
# plt.tick_params(labelsize=8.5)
# plt.plot(names,x_train.min(axis=0),'o',label='Min')
# plt.plot(names,x_train.max(axis=0),'v',label='Max')
# plt.xlabel('Feature Index')
# plt.ylabel('Feature magnitude in log scale')
# plt.yscale('log')
# plt.xticks(rotation=90)

# plt.legend(loc='upper right')

#标准化数据
x_train_scaled = preprocessing.scale(x_train)
x_train_scaled = preprocessing.scale(x_train_scaled)
x_test_scaled = preprocessing.scale(x_check)
x_test_scaled = preprocessing.scale(x_test_scaled)
svm1=SVC()
svm1.fit(x_train_scaled,y_train)
print("accuracy on the scaled training subset:{:.3f}".format(svm1.score(x_train_scaled,y_train)))
print("accuracy on the scaled test subset:{:.3f}".format(svm1.score(x_test_scaled,y_check)))
'''
accuracy on the scaled training subset:0.867
accuracy on the scaled test subset:0.800
'''


# 改变C参数，调优,kernel表示核函数，用于平面转换，probability表示是否需要计算概率
# c相当于惩罚松弛变量，c值小，对误分类的惩罚减小，允许容错,泛化能力较强
# gamma：'rbf','poly' 和'sigmoid'的核函数参数。默认是’auto’
# kernel ：核函数，默认是rbf，可以是'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'
# probability：是否启用概率估计。 这必须在调用fit()之前启用，并且使fit()方法速度变慢

svm2=SVC(C=1,gamma="auto",kernel='rbf',probability=True)
svm2.fit(x_train_scaled,y_train)
print("after c parameter=10,accuracy on the scaled training subset:{:.3f}".format(svm2.score(x_train_scaled,y_train)))
print("after c parameter=10,accuracy on the scaled test subset:{:.3f}".format(svm2.score(x_test_scaled,y_check)))
'''
after c parameter=10,accuracy on the scaled training subset:0.972
after c parameter=10,accuracy on the scaled test subset:0.716
'''
 
# plt.show()
#计算样本点到分割超平面的函数距离
#print (svm2.decision_function(x_train_scaled))
 
#print (svm2.decision_function(x_train_scaled)[:20]>0)
#支持向量机分类
print(svm2.classes_)
 
# 输出概率
# print(svm2.predict_proba(x_test_scaled))
print(len(svm2.predict_proba(x_test_scaled)))
# a = svm2.predict_proba(x_test_scaled)
# b = svm2.predict(x_test_scaled)
# c = []
# for i in range(len(a)):
#     if (a[i][1] > 0.5 and b[i] == 0):
#         print(a[i],b[i],i+1)
#判断数据属于哪一类，0或1表示
print(svm2.predict(x_test_scaled))
# plt.show()

结果如下：

accuracy on the training subset:0.715
accuracy on the test subset:0.713
accuracy on the scaled training subset:0.857
accuracy on the scaled test subset:0.790
after c parameter=10,accuracy on the scaled training subset:0.857
after c parameter=10,accuracy on the scaled test subset:0.790
[0 1]
300
[1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1
 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0
 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 1
 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1
 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 1 1 1]

posted @ 2020-01-01 08:02 LieDra 阅读(815) 评论(0) 收藏举报

刷新页面返回顶部

LieDra

信用评分预测模型（六）--支持向量机算法

前言

支持向量机介绍

代码示例

公告