近期一直在学习数据挖掘的内容,特别是运用sci-kit learn 对数据进行处理。感觉skleran是一个非常快速的入门和上手解决实际问题

数据挖掘的一般问题的解决步骤

1.导入数据库

2.数据预处理

3.选择合适的模型

4.调整模型参数

5.验证模型

 

其中最重要的就是,数据预处理部分,在Sklearn.procession类里面有着很多t数据预处理方法,包括StandardScaler,MinmaxScaler, 在特征选择中有feature_selection类进行特征的自动选择。

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
import matplotlib as plt
boston = load_boston() # add data sets
print(boston.DESCR)
X = boston.data
y = boston.target
names = boston.feature_names
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
print("shape of X_train:{}".format(X_train.shape))
print("shape of X_test:{}".format(X_test.shape))
import pandas as pd
boston_dataframe = pd.DataFrame(X_train,columns = names)
print(boston_dataframe)
grr = pd.scatter_matrix(boston_dataframe, c = y_train , figsize=(15,15),
                       marker = 'X', hist_kwds = {"bins":20},s=20 , alpha = 0.8) # get big piture about the data
standard = StandardScaler()
X_train_standard = standard.fit_transform(X_train)
X_test_standard = standard.fit_transform(X_test)
select = SelectPercentile(f_regression,50)  # Select features based on percentile 
select.fit(X_train,y_train)
X_train_select = select.transform(X_train_standard)
X_test_select = select.transform(X_test_standard)
print("x_train_select shape{}".format(X_train_select.shape))
print('x_test_select shape{}'.format(X_test_select.shape))
print('x_train shape:{}'.format(X_train.shape))
svr =SVR(kernel="rbf")
param_grid = {'C':[0.001,0.01,0.1,1,10,100],
             'gamma':[0.001,0.01,0.1,1,10,100]}
grid = GridSearchCV(svr,param_grid=param_grid,cv=5)
grid.fit(X_train_select,y_train)
print("best cross-validation accuracy:{:.3f}".format(grid.best_score_))
print("best set score:{:.3f}".format(grid.score(X_test_select,y_test)))
print("best parameters:{}".format(grid.best_params_))

批注:对于特征选择还得继续加强,另外学会用PIPELINE方法