十四、案例实战:泰坦尼克获救预测

原文:https://blog.csdn.net/JonyHwang/article/details/78932466

本节内容:

  • 船员数据分析
  • 数据预处理
  • 使用回归算法进行预测
  • 使用随机森林改进模型
  • 随机森林特征重要性分析

 

1、船员数据分析

PassengerId ,乘客的id号,这个我觉得对生存率没影响。因为一个人的id号不会影响我是否生存下来吧。这列可以忽略

Survived ,生存的标号,上面图的数值1表示这个人很幸运,生存了下来。数值0,则表示遗憾。

Pclass ,船舱等级,就是我们坐船有等级之分,像高铁,飞机都有。这个属性会对生产率有影响。因为一般有钱人,权贵才会住头等舱的。保留。

Name ,名字,这个不影响生存率。我觉得可以不用这列数据。可以忽略

Sex , 性别,这个因为全球都说lady first,女士优先,所有这列保留。

Age , 年龄,因为优先保护老幼,这个保留。

SibSp ,兄弟姐妹,就是有些人和兄弟姐妹一起上船的。这个会有影响,因为有可能因为救他们而导致自己没有上救生船船。保留这列

Parch , 父母和小孩。就是有些人会带着父母小孩上船的。这个也可能因为要救父母小孩耽误上救生船。保留

Ticket , 票的编号。这个没有影响吧。

Fare , 费用。这个和Pclass有相同的道理,有钱人和权贵比较有势力和影响力。这列保留

Cabin ,舱号。住的舱号没有影响。忽略。

Embarked ,上船的地方。这列可能有影响。我认为登陆地点不同,可能显示人的地位之类的不一样。我们先保留这列。

所以,我们大概知道需要提取哪些列的信息了,有:

Pclass、Sex、Age、SibSp、Parch、Fare、Embarked

 

好了,我们分析完表格,就是我们要分析的这个数据集。我那时,就想,下一步应该要建模型来帮我们做分析了。

所以下一步,我就开始进入jupyter notebook去撸代码了。

 

import pandas #ipython notebook
 #导入pandas库,这个库是专门设计用来做数据分析的工具
titanic = pandas.read_csv("titanic_train.csv")#读取文件titanic_train数据集
titanic.head(5)
#print (titanic.describe())#describe()函数是用来描述数据属性的,然后print打印出来

 

2、数据预处理

titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median()) #fillna()表示补充,median()表示求平均值
print titanic.describe()

###
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
###

输入上面代码会显示出上面那幅图的结果:

count,是总数的意思。

mean,是平均值

std,是方差

min,最小值

max,最大值

25%、50%、75%,意思是位置分别是25%、50%、75%大的数据是什么。比如,Age,就是年龄大小排在25%的人的年龄是22岁,小数点表示年月。

 

由上面可知:

Age那列只有714个数。而其他属性都有891个数值。那说明,Age那列有空值,就是数据丢失了,一共有接近200个数据丢失。

那怎么办?因为我们一开始分析了,Age这一列对生存率分析有影响,我们必须保留,不能忽略。所以我选择补充上去。

那么问题来了,补充什么样的数据比较合理呢?才不会导致数据的准确性呢?我想应该补充平均值。

 

print titanic["Sex"].unique()  #['male' 'female']
#返回其参数数组中所有不同的值,并且按照从小到大的顺序排列 # Replace all the occurences of male with the number 0. 
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic[
"Sex"] == "female", "Sex"] = 1

机器学习算法一般来说解决不了对字符的分类。因为我们是要对Survived这列‘’0‘’和"1"进行分类嘛。

所以我们就要把"Sex"这一列的数据进行处理,把它改为数值型。那我们就把"male"和“female”进行处理,分别用0和1替代。

 

同时,我们也把"Embarked"这一列数据进行同样的处理:

print titanic["Embarked"].unique()  #['S' 'C' 'Q' nan]
titanic["Embarked"] = titanic["Embarked"].fillna('S')
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

 

3、使用回归算法进行预测

下面我们分别用线性模型、逻辑回归、随机森林这三种机器学习算法模型来分析这个案例,就是分析生存率。

(1)我们先用线性模型进行分析。

# Import the linear regression class
from sklearn.linear_model import LinearRegression  #在线性模型模块中导入线性回归
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold #交叉验证,把训练数据集分成三份。最后去平均值

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]#这里的数据都对预测生存有影响

# Initialize our algorithm class
alg = LinearRegression() #把线性回归导进来
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1) #选择做几倍的交叉验证,n_folds意思是几倍//样本平均分成3份,交叉验证

predictions = []
for train, test in kf: #意思是把训练数据拿出来
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]#把线性回归应用在train_pridictors,train_target数据集上
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:]) #我们要进行预测,是要对测试集进行预测
    predictions.append(test_predictions)

 

import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print accuracy  #0.783389450056

 

下面调用sklearn里面的模型,sklearn被称为机器学习的核武器,里面封装了几乎所有我们想得到的算法模型。我们直接调用就行了。

(2)下面我们用逻辑回归算法模型:

from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean()) #0.787878787879

 

titanic_test = pandas.read_csv("test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

 

4、下面是随机森林的算法模型:

(1)随机取样本(有放回的取样)

(2)随机选择特征

(3)多个决策树(投票机制)

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean()) # 0.785634118967

调参数:

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())0.814814814815

 

# Generating a familysize column生成家庭组合得列
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

 

import re

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)  使用正则表达式搜索标题。标题总是由大写和小写字母组成,并以句号结束。
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)  获取所有标题并打印每个标题出现的频率。
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles

###


Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Major         2
Mlle          2
Countess      1
Ms            1
Lady          1
Jonkheer      1
Don           1
Mme           1
Capt          1
Sir           1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64
###

 

 

5、随机森林特征重要性分析

通过加入噪音值前后的错误率的差值来判断特征值的重要程度。

import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)选出分数最高的k个特征
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores获取每个特性的原始p值,并将p值转换为分数
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4)

###
###

 

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)
#0.821548821549

 

titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
    titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

###
 
1     240
2      79
3      72
4      21
7       2
6       2
10      1
5       1
Name: Title, dtype: int64
###

 

predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title"]

algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic[predictors], titanic["Survived"])
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4
predictions


 

posted @ 2018-12-02 19:30  大头swag  阅读(669)  评论(0)    收藏  举报