scikit learn
Statistical learning: the setting and the estimator object in scikit-learn
数据类型
数据的第一个轴是样本轴,第二个轴是特征轴
A simple example shipped with the scikit: iris dataset
>>>
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)
即数据的shape必须得是(n_samples, n_features)
如果不符合,就得reshape
An example of reshaping data would be the digits dataset
The digits dataset is made of 1797 8x8 images of hand-written digits
>>>
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> data = digits.images.reshape((digits.images.shape[0], -1))
估计器objects
定义:实现了fit(X, y) and predict(T).的Python对象
scikit-learn实现的主要api就是 estimator,它是一个从data学习的object
因此必须实现>>> estimator.fit(data)
关于参数
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
学习到的参数带一个下划线
`
estimator.estimated_param_
`
参数的选取一般采用grid search and cross validation.
模型保存
用 pickle
A tutorial on statistical-learning for scientific data processing
Supervised learning: predicting an output variable from high-dimensional observations
线性回归
- 无regular:
regr = linear_model.LinearRegression() - 有regular:Ridge
数据少的时候noise in the observations induces high variance
因此需要shrink the regression coefficients to zero
regr = linear_model.Ridge(alpha=.1)
寻找参数的例子
>>> alphas = np.logspace(-4, -1, 6)
>>> from __future__ import print_function
>>> print([regr.set_params(alpha=alpha
... ).fit(diabetes_X_train, diabetes_y_train,
... ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])
[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]
- 稀疏性 Lasso
多个特征易出现‘维度灾难’,鉴于有些特征并不是 informative features,因此
将non-informative ones设为0
>>> regr = linear_model.Lasso()
>>> scores = [regr.set_params(alpha=alpha
... ).fit(diabetes_X_train, diabetes_y_train
... ).score(diabetes_X_test, diabetes_y_test)
... for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute='auto',
tol=0.0001, warm_start=False)
>>> print(regr.coef_)
[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0.
-187.19554705 69.38229038 508.66011217 71.84239008]
- 分类LogisticRegression.
用线性回归label iris 的时候,对于远离决策边界的数据会give too much weight
线性的解决方法是fit a sigmoid function or logistic function:
\(y = \textrm{sigmoid}(X\beta - \textrm{offset}) + \epsilon = \frac{1}{1 + \textrm{exp}(- X\beta + \textrm{offset})} + \epsilon\)
>>> logistic = linear_model.LogisticRegression(C=1e5)
>>> logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, penalty='l2',
random_state=None, tol=0.0001)
Support vector machines (SVMs)
svm属于判决模型family
>>> from sklearn import svm
>>> svc = svm.SVC(kernel='linear')
>>> svc.fit(iris_X_train, iris_y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
注意:对于很多估计器来说,将数据处理成单位标准差会提升性能。
浙公网安备 33010602011771号