scikit learn

Statistical learning: the setting and the estimator object in scikit-learn

数据类型

数据的第一个轴是样本轴,第二个轴是特征轴

A simple example shipped with the scikit: iris dataset
>>>
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

即数据的shape必须得是(n_samples, n_features)
如果不符合,就得reshape

An example of reshaping data would be the digits dataset
The digits dataset is made of 1797 8x8 images of hand-written digits
>>>
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> data = digits.images.reshape((digits.images.shape[0], -1))

估计器objects

定义:实现了fit(X, y) and predict(T).的Python对象
scikit-learn实现的主要api就是 estimator,它是一个从data学习的object
因此必须实现>>> estimator.fit(data)
关于参数

>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1

学习到的参数带一个下划线
`

estimator.estimated_param_
`
参数的选取一般采用grid search and cross validation.

模型保存

用 pickle

A tutorial on statistical-learning for scientific data processing

Supervised learning: predicting an output variable from high-dimensional observations

线性回归

  • 无regular:
    regr = linear_model.LinearRegression()
  • 有regular:Ridge
    数据少的时候noise in the observations induces high variance
    因此需要shrink the regression coefficients to zero
    regr = linear_model.Ridge(alpha=.1)
    寻找参数的例子
>>> alphas = np.logspace(-4, -1, 6)
>>> from __future__ import print_function
>>> print([regr.set_params(alpha=alpha
...             ).fit(diabetes_X_train, diabetes_y_train,
...             ).score(diabetes_X_test, diabetes_y_test) for alpha in alphas]) 
[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]
  • 稀疏性 Lasso
    多个特征易出现‘维度灾难’,鉴于有些特征并不是 informative features,因此
    将non-informative ones设为0
>>> regr = linear_model.Lasso()
>>> scores = [regr.set_params(alpha=alpha
...             ).fit(diabetes_X_train, diabetes_y_train
...             ).score(diabetes_X_test, diabetes_y_test)
...        for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute='auto',
   tol=0.0001, warm_start=False)
>>> print(regr.coef_)
[   0.         -212.43764548  517.19478111  313.77959962 -160.8303982    -0.
 -187.19554705   69.38229038  508.66011217   71.84239008]
  • 分类LogisticRegression.
    用线性回归label iris 的时候,对于远离决策边界的数据会give too much weight
    线性的解决方法是fit a sigmoid function or logistic function:
    \(y = \textrm{sigmoid}(X\beta - \textrm{offset}) + \epsilon = \frac{1}{1 + \textrm{exp}(- X\beta + \textrm{offset})} + \epsilon\)
>>> logistic = linear_model.LogisticRegression(C=1e5)
>>> logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, penalty='l2',
          random_state=None, tol=0.0001)

Support vector machines (SVMs)

svm属于判决模型family

>>> from sklearn import svm
>>> svc = svm.SVC(kernel='linear')
>>> svc.fit(iris_X_train, iris_y_train)    
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

注意:对于很多估计器来说,将数据处理成单位标准差会提升性能。

posted @ 2015-03-10 10:08  marquis  阅读(286)  评论(0)    收藏  举报