kaggle教程--5--分类数据和独热编码
#找到一列中分类数据种类少于10并且dtype类型是object的
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if
candidate_train_predictors[cname].nunique() < 10 and
candidate_train_predictors[cname].dtype == "object"]
#找出object数据类型的列,每列的分类数量有多少
object_cols_nunique = {cname: candidate_train_predictors[cname].nunique() for cname in candidate_train_predictors.columns if
candidate_train_predictors[cname].dtype == "object"}
#用于训练和测试的特征列,是选出来的分类数小于10的object列和数字列
my_cols = low_cardinality_cols + numeric_cols
train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]
#查看每列的数据类型,object一般表示该列含有字符串
train_predictors.dtypes
#对于object类型数据,需要使用panda库中的get_dummies()方法进行one-hot编码,使其可以训练模型
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
#选取特征列时有两种方法,1是drop掉所有object列,只用numeric列训练,2是将object列进行one-hot编码,再加上numeric列一起训练
#one hot编码有时候有用,有时候没用
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
#用随机森林训练模型(这里使用的是训练数据的特征和标签),然后使用MAE分数评测模型,返回MAE分数
def get_mae(X, y):
# multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
return -1 * cross_val_score(RandomForestRegressor(50),
X, y,
scoring = 'neg_mean_absolute_error').mean()
# 对比只用numeric列训练,和将object列进行one-hot编码,再加上numeric列一起训练的两种模型的分数
predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])
mae_without_categoricals = get_mae(predictors_without_categoricals, target)
mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)
print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))
#列对齐,保证进行one-hot编码的列,测试集的列和训练集的列保持一致
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
join='left',
axis=1)
##########################总结#############################
1 读入数据,分别存入train_data和test_data
2 将train_data中,标签列为空值的行drop掉
3 从train_data中取出标签列SalePrice,存入变量target
4 从train_data中drop掉所有含有空值的列、'Id', 'SalePrice'列,存入变量candidate_train_predictors
5 从test_data中drop掉所有含有空值的列、'Id'列,存入变量candidate_test_predictors
6 从candidate_train_predictors中,选出object类型,且分类数小于10的列,存入变量low_cardinality_cols
7 从candidate_train_predictors中,选择'int64', 'float64'类型的列,存入变量numeric_cols
8 my_cols = low_cardinality_cols + numeric_cols
9 candidate_train_predictors选取my_cols 的列,存入变量train_predictors
10 candidate_test_predictors选取my_cols 的列,存入变量test_predictors
11 train_predictors进行one hot编码,存入变量one_hot_encoded_training_predictors
12 定义模型评价函数get_mae(X, y)
13 从train_predictors中只选出不含有object列的列,存入变量predictors_without_categoricals
14 mae_without_categoricals = get_mae(predictors_without_categoricals, target)
15 mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)
16 train_predictors进行one hot编码,存入变量one_hot_encoded_training_predictors
17 test_predictors进行one hot编码,存入变量one_hot_encoded_test_predictors
18 使用one_hot_encoded_training_predictors.align方法对齐one_hot_encoded_training_predictors和one_hot_encoded_test_predictors的列,存入变量final_train, final_test
posted on 2019-03-01 09:57 wangzhonghan 阅读(480) 评论(0) 收藏 举报
浙公网安备 33010602011771号