kaggle教程--5--分类数据和独热编码

#找到一列中分类数据种类少于10并且dtype类型是object的

low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if
    candidate_train_predictors[cname].nunique() < 10 and
    candidate_train_predictors[cname].dtype == "object"]

 

#找出object数据类型的列,每列的分类数量有多少
object_cols_nunique = {cname: candidate_train_predictors[cname].nunique() for cname in candidate_train_predictors.columns if
candidate_train_predictors[cname].dtype == "object"}

 

#用于训练和测试的特征列,是选出来的分类数小于10的object列和数字列

my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

 

#查看每列的数据类型,object一般表示该列含有字符串

train_predictors.dtypes

 

#对于object类型数据,需要使用panda库中的get_dummies()方法进行one-hot编码,使其可以训练模型

one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

 

#选取特征列时有两种方法,1是drop掉所有object列,只用numeric列训练,2是将object列进行one-hot编码,再加上numeric列一起训练

#one hot编码有时候有用,有时候没用

 

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
#用随机森林训练模型(这里使用的是训练数据的特征和标签),然后使用MAE分数评测模型,返回MAE分数
def get_mae(X, y):
# multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
  return -1 * cross_val_score(RandomForestRegressor(50),
  X, y,
  scoring = 'neg_mean_absolute_error').mean()

 

# 对比只用numeric列训练,和将object列进行one-hot编码,再加上numeric列一起训练的两种模型的分数

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))

 

#列对齐,保证进行one-hot编码的列,测试集的列和训练集的列保持一致
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
join='left',
axis=1)

 

 

##########################总结#############################

1 读入数据,分别存入train_data和test_data

2 将train_data中,标签列为空值的行drop掉

3 从train_data中取出标签列SalePrice,存入变量target

4 从train_data中drop掉所有含有空值的列、'Id', 'SalePrice'列,存入变量candidate_train_predictors

5 从test_data中drop掉所有含有空值的列、'Id'列,存入变量candidate_test_predictors

6 从candidate_train_predictors中,选出object类型,且分类数小于10的列,存入变量low_cardinality_cols

7 从candidate_train_predictors中,选择'int64', 'float64'类型的列,存入变量numeric_cols

8 my_cols = low_cardinality_cols + numeric_cols

9 candidate_train_predictors选取my_cols 的列,存入变量train_predictors

10 candidate_test_predictors选取my_cols 的列,存入变量test_predictors

11 train_predictors进行one hot编码,存入变量one_hot_encoded_training_predictors 

12 定义模型评价函数get_mae(X, y)

13 从train_predictors中只选出不含有object列的列,存入变量predictors_without_categoricals

14 mae_without_categoricals = get_mae(predictors_without_categoricals, target)

15 mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

16 train_predictors进行one hot编码,存入变量one_hot_encoded_training_predictors 

17 test_predictors进行one hot编码,存入变量one_hot_encoded_test_predictors 

18 使用one_hot_encoded_training_predictors.align方法对齐one_hot_encoded_training_predictors和one_hot_encoded_test_predictors的列,存入变量final_train, final_test

posted on 2019-03-01 09:57  wangzhonghan  阅读(480)  评论(0)    收藏  举报

导航