十四、案例实战:泰坦尼克获救预测
原文:https://blog.csdn.net/JonyHwang/article/details/78932466
本节内容:
- 船员数据分析
- 数据预处理
- 使用回归算法进行预测
- 使用随机森林改进模型
- 随机森林特征重要性分析
1、船员数据分析
PassengerId ,乘客的id号,这个我觉得对生存率没影响。因为一个人的id号不会影响我是否生存下来吧。这列可以忽略
Survived ,生存的标号,上面图的数值1表示这个人很幸运,生存了下来。数值0,则表示遗憾。
Pclass ,船舱等级,就是我们坐船有等级之分,像高铁,飞机都有。这个属性会对生产率有影响。因为一般有钱人,权贵才会住头等舱的。保留。
Name ,名字,这个不影响生存率。我觉得可以不用这列数据。可以忽略
Sex , 性别,这个因为全球都说lady first,女士优先,所有这列保留。
Age , 年龄,因为优先保护老幼,这个保留。
SibSp ,兄弟姐妹,就是有些人和兄弟姐妹一起上船的。这个会有影响,因为有可能因为救他们而导致自己没有上救生船船。保留这列
Parch , 父母和小孩。就是有些人会带着父母小孩上船的。这个也可能因为要救父母小孩耽误上救生船。保留
Ticket , 票的编号。这个没有影响吧。
Fare , 费用。这个和Pclass有相同的道理,有钱人和权贵比较有势力和影响力。这列保留
Cabin ,舱号。住的舱号没有影响。忽略。
Embarked ,上船的地方。这列可能有影响。我认为登陆地点不同,可能显示人的地位之类的不一样。我们先保留这列。
所以,我们大概知道需要提取哪些列的信息了,有:
Pclass、Sex、Age、SibSp、Parch、Fare、Embarked
好了,我们分析完表格,就是我们要分析的这个数据集。我那时,就想,下一步应该要建模型来帮我们做分析了。
所以下一步,我就开始进入jupyter notebook去撸代码了。
import pandas #ipython notebook #导入pandas库,这个库是专门设计用来做数据分析的工具 titanic = pandas.read_csv("titanic_train.csv")#读取文件titanic_train数据集 titanic.head(5) #print (titanic.describe())#describe()函数是用来描述数据属性的,然后print打印出来
2、数据预处理
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median()) #fillna()表示补充,median()表示求平均值
print titanic.describe()
###
PassengerId Survived Pclass Age SibSp \
count 891.000000 891.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.361582 0.523008
std 257.353842 0.486592 0.836071 13.019697 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 22.000000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 35.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000
Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
###
输入上面代码会显示出上面那幅图的结果:
count,是总数的意思。
mean,是平均值
std,是方差
min,最小值
max,最大值
25%、50%、75%,意思是位置分别是25%、50%、75%大的数据是什么。比如,Age,就是年龄大小排在25%的人的年龄是22岁,小数点表示年月。
由上面可知:
Age那列只有714个数。而其他属性都有891个数值。那说明,Age那列有空值,就是数据丢失了,一共有接近200个数据丢失。
那怎么办?因为我们一开始分析了,Age这一列对生存率分析有影响,我们必须保留,不能忽略。所以我选择补充上去。
那么问题来了,补充什么样的数据比较合理呢?才不会导致数据的准确性呢?我想应该补充平均值。
print titanic["Sex"].unique() #['male' 'female']
#返回其参数数组中所有不同的值,并且按照从小到大的顺序排列 # Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
机器学习算法一般来说解决不了对字符的分类。因为我们是要对Survived这列‘’0‘’和"1"进行分类嘛。
所以我们就要把"Sex"这一列的数据进行处理,把它改为数值型。那我们就把"male"和“female”进行处理,分别用0和1替代。
同时,我们也把"Embarked"这一列数据进行同样的处理:
print titanic["Embarked"].unique() #['S' 'C' 'Q' nan] titanic["Embarked"] = titanic["Embarked"].fillna('S') titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0 titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1 titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
3、使用回归算法进行预测
下面我们分别用线性模型、逻辑回归、随机森林这三种机器学习算法模型来分析这个案例,就是分析生存率。
(1)我们先用线性模型进行分析。
# Import the linear regression class
from sklearn.linear_model import LinearRegression #在线性模型模块中导入线性回归
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold #交叉验证,把训练数据集分成三份。最后去平均值
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]#这里的数据都对预测生存有影响
# Initialize our algorithm class
alg = LinearRegression() #把线性回归导进来
# Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1) #选择做几倍的交叉验证,n_folds意思是几倍//样本平均分成3份,交叉验证
predictions = []
for train, test in kf: #意思是把训练数据拿出来
# The predictors we're using the train the algorithm. Note how we only take the rows in the train folds.
train_predictors = (titanic[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = titanic["Survived"].iloc[train]#把线性回归应用在train_pridictors,train_target数据集上
# Training the algorithm using the predictors and target.
alg.fit(train_predictors, train_target)
# We can now make predictions on the test fold
test_predictions = alg.predict(titanic[predictors].iloc[test,:]) #我们要进行预测,是要对测试集进行预测
predictions.append(test_predictions)
import numpy as np # The predictions are in three separate numpy arrays. Concatenate them into one. # We concatenate them on axis 0, as they only have one axis. predictions = np.concatenate(predictions, axis=0) # Map predictions to outcomes (only possible outcomes are 1 and 0) predictions[predictions > .5] = 1 predictions[predictions <=.5] = 0 accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions) print accuracy #0.783389450056
下面调用sklearn里面的模型,sklearn被称为机器学习的核武器,里面封装了几乎所有我们想得到的算法模型。我们直接调用就行了。
(2)下面我们用逻辑回归算法模型:
from sklearn import cross_validation from sklearn.linear_model import LogisticRegression # Initialize our algorithm alg = LogisticRegression(random_state=1) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3) # Take the mean of the scores (because we have one for each fold) print(scores.mean()) #0.787878787879
titanic_test = pandas.read_csv("test.csv") titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median()) titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median()) titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1 titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S") titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0 titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1 titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2
4、下面是随机森林的算法模型:
(1)随机取样本(有放回的取样)
(2)随机选择特征
(3)多个决策树(投票机制)
from sklearn import cross_validation from sklearn.ensemble import RandomForestClassifier predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] # Initialize our algorithm with the default paramters # n_estimators is the number of trees we want to make # min_samples_split is the minimum number of rows we need to make a split # min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree) alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) kf = cross_validation.KFold(titanic.shape[0], n_folds=3, random_state=1) scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf) # Take the mean of the scores (because we have one for each fold) print(scores.mean()) # 0.785634118967
调参数:
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2) # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) kf = cross_validation.KFold(titanic.shape[0], 3, random_state=1) scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=kf) # Take the mean of the scores (because we have one for each fold) print(scores.mean())0.814814814815
# Generating a familysize column生成家庭组合得列 titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"] # The .apply method generates a new series titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))
import re # A function to get the title from a name. def get_title(name): # Use a regular expression to search for a title. Titles always consist of capital and lowercase letters, and end with a period. title_search = re.search(' ([A-Za-z]+)\.', name) 使用正则表达式搜索标题。标题总是由大写和小写字母组成,并以句号结束。 # If the title exists, extract and return it. if title_search: return title_search.group(1) return "" # Get all the titles and print how often each one occurs. titles = titanic["Name"].apply(get_title) 获取所有标题并打印每个标题出现的频率。 print(pandas.value_counts(titles)) # Map each title to an integer. Some titles are very rare, and are compressed into the same codes as other titles. title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2} for k,v in title_mapping.items(): titles[titles == k] = v # Verify that we converted everything. print(pandas.value_counts(titles)) # Add in the title column. titanic["Title"] = titles ### Mr 517 Miss 182 Mrs 125 Master 40 Dr 7 Rev 6 Col 2 Major 2 Mlle 2 Countess 1 Ms 1 Lady 1 Jonkheer 1 Don 1 Mme 1 Capt 1 Sir 1 Name: Name, dtype: int64 1 517 2 183 3 125 4 40 5 7 6 6 7 5 10 3 8 3 9 2 Name: Name, dtype: int64 ###
5、随机森林特征重要性分析
通过加入噪音值前后的错误率的差值来判断特征值的重要程度。
import numpy as np from sklearn.feature_selection import SelectKBest, f_classif import matplotlib.pyplot as plt predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "NameLength"] # Perform feature selection selector = SelectKBest(f_classif, k=5)选出分数最高的k个特征 selector.fit(titanic[predictors], titanic["Survived"]) # Get the raw p-values for each feature, and transform from p-values into scores获取每个特性的原始p值,并将p值转换为分数 scores = -np.log10(selector.pvalues_) # Plot the scores. See how "Pclass", "Sex", "Title", and "Fare" are the best? plt.bar(range(len(predictors)), scores) plt.xticks(range(len(predictors)), predictors, rotation='vertical') plt.show() # Pick only the four best features. predictors = ["Pclass", "Sex", "Fare", "Title"] alg = RandomForestClassifier(random_state=1, n_estimators=50, min_samples_split=8, min_samples_leaf=4) ###
###
from sklearn.ensemble import GradientBoostingClassifier import numpy as np # The algorithms we want to ensemble. # We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier. algorithms = [ [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title",]], [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]] ] # Initialize the cross validation folds kf = KFold(titanic.shape[0], n_folds=3, random_state=1) predictions = [] for train, test in kf: train_target = titanic["Survived"].iloc[train] full_test_predictions = [] # Make predictions for each algorithm on each fold for alg, predictors in algorithms: # Fit the algorithm on the training data. alg.fit(titanic[predictors].iloc[train,:], train_target) # Select and predict on the test fold. # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error. test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1] full_test_predictions.append(test_predictions) # Use a simple ensembling scheme -- just average the predictions to get the final classification. test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2 # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction. test_predictions[test_predictions <= .5] = 0 test_predictions[test_predictions > .5] = 1 predictions.append(test_predictions) # Put all the predictions together into one array. predictions = np.concatenate(predictions, axis=0) # Compute accuracy by comparing to the training data. accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions) print(accuracy) #0.821548821549
titles = titanic_test["Name"].apply(get_title) # We're adding the Dona title to the mapping, because it's in the test set, but not the training set title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10} for k,v in title_mapping.items(): titles[titles == k] = v titanic_test["Title"] = titles # Check the counts of each unique title. print(pandas.value_counts(titanic_test["Title"])) # Now, we add the family size column. titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"] ###
###
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title"] algorithms = [ [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors], [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]] ] full_predictions = [] for alg, predictors in algorithms: # Fit the algorithm using the full training data. alg.fit(titanic[predictors], titanic["Survived"]) # Predict using the test dataset. We have to convert all the columns to floats to avoid an error. predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1] full_predictions.append(predictions) # The gradient boosting classifier generates better predictions, so we weight it higher. predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4 predictions
浙公网安备 33010602011771号