近期一直在学习数据挖掘的内容,特别是运用sci-kit learn 对数据进行处理。感觉skleran是一个非常快速的入门和上手解决实际问题
数据挖掘的一般问题的解决步骤
1.导入数据库
2.数据预处理
3.选择合适的模型
4.调整模型参数
5.验证模型
其中最重要的就是,数据预处理部分,在Sklearn.procession类里面有着很多t数据预处理方法,包括StandardScaler,MinmaxScaler, 在特征选择中有feature_selection类进行特征的自动选择。
from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.feature_selection import SelectPercentile, f_regression from sklearn.preprocessing import StandardScaler from sklearn.svm import SVR from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error import matplotlib as plt boston = load_boston() # add data sets print(boston.DESCR) X = boston.data y = boston.target names = boston.feature_names X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42) print("shape of X_train:{}".format(X_train.shape)) print("shape of X_test:{}".format(X_test.shape)) import pandas as pd boston_dataframe = pd.DataFrame(X_train,columns = names) print(boston_dataframe) grr = pd.scatter_matrix(boston_dataframe, c = y_train , figsize=(15,15), marker = 'X', hist_kwds = {"bins":20},s=20 , alpha = 0.8) # get big piture about the data standard = StandardScaler() X_train_standard = standard.fit_transform(X_train) X_test_standard = standard.fit_transform(X_test) select = SelectPercentile(f_regression,50) # Select features based on percentile select.fit(X_train,y_train) X_train_select = select.transform(X_train_standard) X_test_select = select.transform(X_test_standard) print("x_train_select shape{}".format(X_train_select.shape)) print('x_test_select shape{}'.format(X_test_select.shape)) print('x_train shape:{}'.format(X_train.shape)) svr =SVR(kernel="rbf") param_grid = {'C':[0.001,0.01,0.1,1,10,100], 'gamma':[0.001,0.01,0.1,1,10,100]} grid = GridSearchCV(svr,param_grid=param_grid,cv=5) grid.fit(X_train_select,y_train) print("best cross-validation accuracy:{:.3f}".format(grid.best_score_)) print("best set score:{:.3f}".format(grid.score(X_test_select,y_test))) print("best parameters:{}".format(grid.best_params_))
批注:对于特征选择还得继续加强,另外学会用PIPELINE方法
浙公网安备 33010602011771号