【机器学习实战】-- Titanic 数据集（3）-- 逻辑回归

1. 写在前面:

本篇属于实战部分，更注重于算法在实际项目中的应用。如需对逻辑回归算法本身有详细的了解，可参考以下链接，在本人学习的过程中，起到了很大的帮助：

统计学习方法李航

逻辑回归原理小结 https://www.cnblogs.com/pinard/p/6029432.html

2. 数据集：

数据集地址：https://www.kaggle.com/c/titanic

Titanic数据集是Kaggle上参与人数最多的项目之一。数据本身简单小巧，适合初学者上手，深入了解比较各个机器学习算法。

数据集包含11个变量：PassengerID、Pclass、Name、Sex、Age、SibSp、Parch、Ticket、Fare、Cabin、Embarked，通过这些数据来预测乘客在Titanic事故中是否幸存下来。

3. 算法简介：

这一节中简单介绍一下逻辑回归，虽然名字中有回归，但其实逻辑回归是经典的分类算法，属于对数线性模型。

3.1 逻辑回归模型：

3.1.1 二元逻辑回归模型：

假设$x\in R^{n}$，$y \in \{0,1\}$, $w$为带偏置的权值向量，$h_{w}(x)$为模型输出，则：

$h_{w}(x) = \frac{1}{1+e^{-w \cdot x}}$,

$h_{w}(x)$属于逻辑斯蒂分布函数，是一条S形曲线，以点$(0, \frac{1}{2})$为中心对称。当模型输出$h_{w}(x) > 0.5$时，分类结果为1；$h_{w}(x) < 0.5$时，分类结果为0。

则其条件概率为：

$P(Y=1|x) = \frac{e^{w \cdot x }}{1 + e^{w \cdot x }}$

$P(Y=0|x) = \frac{1}{1 + e^{w \cdot x }}$

引入几率概念（几率是指该事件发生的概率与该事件不发生的概率的比值，对数几率用logit表示）：

logit$(p)$ = log$\frac{P(Y=1|x)}{P(Y=0|x)} = w \cdot x$

3.1.2 多元逻辑回归模型：

假设$x\in R^{n}$，$y \in \{1,2,...,K\}$, $w_{k}$为带偏置的权值向量，$h_{w_{k}}(x)$为模型输出，则：

$h_{w_{k}}(x) = P(Y=k|x) = \frac{e^{w_{k} \cdot x}}{1+\sum_{k=1}^{K-1}e^{w_{k} \cdot x}}$, k=1,2,...,K-1

$h_{w_{K}}(x) = P(Y=K|x) = \frac{1}{1+\sum{k=1}^{K-1}e^{w_{k} \cdot x}}$

并且，其对数条件概率满足：

$ln\frac{P(Y=1|x, w)}{P(Y=K|x, w)} = w_{1} \cdot x$

...

$ln\frac{P(Y=K-1|x, w)}{P(Y=K|x, w)} = w_{K-1} \cdot x$

注：和感知机不同，感知机模型是一个由输入空间到输出空间的函数；逻辑回归的模型是条件概率分布，所以我们使用逻辑回归进行分类时，不仅可以得到预测的分类结果，也可以得到每个分类的概率。对应在sklearn的应用中，即不仅有predict()方法，同时也有predict_proba()方法。

3.2 逻辑回归损失函数：

逻辑回归的损失函数通过极大似然法获得，令$P(Y=1|x) = \frac{1}{1+e^{-w \cdot x}} = \pi (x)$，$P(Y=0|x) = \frac{e^{-w \cdot x}}{1+e^{-w \cdot x}} = 1-\pi (x)$

则关于参数w的似然函数为：

$$ L(w) = \prod_{i=1}^{N} \pi(x_{i})^{y_{i}}\left [ 1-\pi (x_{i}) \right ]^{1 - y_{i}} $$

对数似然函数为：

$$ l(w) = \sum _{i=1}^{N}y_{i}log \pi (x_{i}) + (1-y_{i})log\left [ 1-\pi_{x_{i}} \right ] $$

逻辑回归的损失函数定义为负的对数似然函数：

$$ J(w) = -\sum _{i=1}^{N}y_{i}log \pi (x_{i}) + (1-y_{i})log\left [ 1-\pi_{x_{i}} \right ] $$

3.3 逻辑回归学习过程：

逻辑回归的学习过程即求解感知机损失函数的最小化问题，通常可以采用梯度下降法即拟牛顿法求解，下面介绍一下梯度下降法的求解过程：

首先，将损失函数写成矩阵形式：

$$ J(w) = -Y^{\mathsf{T}}log\pi(x) - (\textbf{E}- Y)^{\mathsf{T}}log\left [ \textbf{E} - \pi(x) \right ] $$

接着，对其求导可得：

$$$

\begin{align*}
\frac{\partial J(w)}{\partial w} &= X^{\mathsf{T}}\left [ \frac{1}{\pi (x)} \cdot \left ( E -\pi(x) \right ) \cdot \pi(x) \right ] \cdot \left ( -Y \right ) + X^{\mathsf{T}} \frac{1}{E - \pi(x)} \cdot \left [ E - \pi(x) \right ] \cdot \pi(x) \cdot (E - Y) \\
&= X^{\mathsf{T}} \left [ E - \pi(x) \right ] \cdot (-Y) + X^{\mathsf{T}} \pi(x) \cdot (E-Y)\\
&= X^{\mathsf{T}} \left [ -E \cdot Y + \pi(x) \cdot Y + \pi(x) - \pi(x) \cdot Y \right ]\\
&= X^{\mathsf{T}} \left [ \pi(x) - Y \right ]
\end{align*}

$$$

最后，根据梯度下降公式，在每次迭代过程中，更新$w$：

$$ w = w - \alpha X^{\mathsf{T}}\left [ \pi(x) - y \right ] $$

4. 实战：

1. Sklearn中主要提供了LogsiticRegression和LogisticRegressionCV两个类来应用逻辑回归，其中LogisticRegressionCV，从名字就可以看出，其自带Cross validation。据Sklearn官方文档所说“The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. ” 即这类estiamtor可以利用之前几步的cross-validation的结果来实现热启动，而这通常会改善运算速度。但据个人使用下来发现，其自带的grid seacrh只能设定，正则项系数Cs这一个超参数。对于其他的超参数，如正则化方式penalty, 类别权重class_weight等并不能选择，因此使用过程中还是使用了 LogisticRegression 配合 GridSearchCV 进行参数的选择。

2. 在 param_grid 的设置过程中，我们将 penalty = 'l1' 和 'l2' 区分开了，这是因为两种正则化的系数'C'的最佳区间可能不在一个数量级的范围内，分开后在调参的过程中可以单独给出合适的范围。比如在本例中，penalty='l1', C<=0.01会有明显的欠拟合；而penalty='l2'，C<0.0001才会有明显的欠拟合。以上可以用过grid_search.cv_results_对网格搜索的结果进行查看。

3. GridSearchCV的 cv_results_这个attribute返回的结果乍一看比较杂乱，其实是可以通过pd.DataFrame()将其转换成dataframe格式，看起来非常清晰，甚至可以对其按照感兴趣的列，如‘mean_test_score’等进行排序，非常方便。

下面附上简单的代码及其运行结果：

 1 import pandas as pd
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
 5 from sklearn.impute import SimpleImputer
 6 from sklearn.pipeline import Pipeline, FeatureUnion
 7 from sklearn.model_selection import cross_val_score, GridSearchCV, ParameterGrid, StratifiedKFold, ShuffleSplit
 8 from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
 9 from sklearn.metrics import accuracy_score, precision_score, recall_score
10 from sklearn.base import TransformerMixin, BaseEstimator
11 
12 
13 class DataFrameSelector(BaseEstimator, TransformerMixin):
14     def __init__(self, attribute_name):
15         self.attribute_name = attribute_name
16 
17     def fit(self, x, y=None):
18         return self
19 
20     def transform(self, x):
21         return x[self.attribute_name].values
22 
23 
24 # Load data
25 data_train = pd.read_csv('train.csv')
26 
27 train_x = data_train.drop('Survived', axis=1)
28 train_y = data_train['Survived']
29 
30 # Data cleaning
31 cat_attribs = ['Pclass', 'Sex', 'Embarked']
32 dis_attribs = ['SibSp', 'Parch']
33 con_attribs = ['Age', 'Fare']
34 
35 # encoder: OneHotEncoder()、OrdinalEncoder()
36 cat_pipeline = Pipeline([
37     ('selector', DataFrameSelector(cat_attribs)),
38     ('imputer', SimpleImputer(strategy='most_frequent')),
39     ('encoder', OneHotEncoder()),
40 ])
41 
42 dis_pipeline = Pipeline([
43     ('selector', DataFrameSelector(dis_attribs)),
44     ('scaler', StandardScaler()),
45     ('imputer', SimpleImputer(strategy='most_frequent')),
46 ])
47 
48 con_pipeline = Pipeline([
49     ('selector', DataFrameSelector(con_attribs)),
50     ('scaler', StandardScaler()),
51     ('imputer', SimpleImputer(strategy='mean')),
52 ])
53 
54 full_pipeline = FeatureUnion(
55     transformer_list=[
56         ('con_pipeline', con_pipeline),
57         ('dis_pipeline', dis_pipeline),
58         ('cat_pipeline', cat_pipeline),
59     ]
60 )
61 
62 train_x_cleaned = full_pipeline.fit_transform(train_x)
63 
64 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
65 clf1 = LogisticRegression(tol=1e-4, max_iter=1000, solver='liblinear')
66 
67 param_grid = [{'penalty': ['l1'],
68                'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10],
69                'class_weight': ['balanced', None],
70                },
71               {'penalty': ['l2'],
72                'C': [1e-4, 1e-3, 1e-2, 1e-1, 1, 10],
73                'class_weight': ['balanced', None],
74                },
75               {'penalty': ['none'],
76                'class_weight': ['balanced', None],
77                'solver': ['lbfgs']
78                }
79               ]
80 
81 grid_search = GridSearchCV(clf1, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True)
82 
83 grid_search.fit(train_x_cleaned, train_y)
84 predicted_y = grid_search.predict(train_x_cleaned)
85 
86 df_cv_results = pd.DataFrame(grid_search.cv_results_)
87 print(accuracy_score(train_y, predicted_y))
88 print(precision_score(train_y, predicted_y))
89 print(recall_score(train_y, predicted_y))

运行结果：

1 0.8069584736251403 2 0.7814569536423841 3 0.6900584795321637

可以看到在相同的数据清洗步骤下，作为对数线性模型的逻辑回归和作为线性模型的感知机，在分类精度上相差无几。

和前几篇一样，我们将最优参数 " C = 0.1, class_weight = None, penalty = 'l2' "，用于预测集，并将结果上传kaggle，结果如下：

	训练集 accuracy	训练集 precision	训练集 recall	预测集 accuracy（需上传kaggle获取结果）
朴素贝叶斯最优解	0.790	0.731	0.716	0.756
感知机	0.771	0.694	0.722	0.722
逻辑回归	0.807	0.781	0.690	0.768

posted @ 2020-12-26 17:08 古渡镇阅读(1131) 评论(0) 收藏举报

刷新页面返回顶部