kaggle教程--10--数据泄露

数据泄露（Data leakage）

常见的数据泄露有2种：Leaky Predictors和Leaky Validation Strategies

Leaky Predictors：任何在目标属性出现后，会随之更新或出现的特征属性，都应该排除在训练集之外

避免方法：

1 筛选可能造成数据泄露的特征列，找到和目标列有统计学相关的特征列

2 如果你的模型预测结果非常准，则需要检查有没有数据泄露

Leaky Validation Strategies：如果先进行数据预处理（例如插值等），在调用train_test_split，这样就是Leaky Validation Strategies

避免方法：

1 如果模型验证只使用简单的train-test split，则对验证集不要进行任何fitting（包括预处理，插值等）

2 使用管道（scikit-learn Pipelines）

例子：

import pandas as pd

data = pd.read_csv('../input/AER_credit_card_data.csv',
true_values = ['yes'],
false_values = ['no'])
print(data.head())

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

y = data.card
X = data.drop(['card'], axis=1)

# Since there was no preprocessing, we didn't need a pipeline here. Used anyway as best practice
modeling_pipeline = make_pipeline(RandomForestClassifier())
cv_scores = cross_val_score(modeling_pipeline, X, y, scoring='accuracy')
print("Cross-val accuracy: %f" %cv_scores.mean())

Cross-val accuracy: 0.979528

expenditures_cardholders = data.expenditure[data.card] #有这张卡的人的花销
expenditures_noncardholders = data.expenditure[~data.card] #没有这张卡的人花销

print('Fraction of those who received a card with no expenditures: %.2f' \
%(( expenditures_cardholders == 0).mean())) #有这张卡的人的花销==0的人数的均值
print('Fraction of those who received a card with no expenditures: %.2f' \
%((expenditures_noncardholders == 0).mean())) #没有有这张卡的人的花销==0的人数的均值

Fraction of those who received a card with no expenditures: 0.02
Fraction of those who received a card with no expenditures: 1.00

结论证明，有这张卡的人，没有花销的概率是2%。没有这张卡的人，没有花销的概率是100%。证明花销这一列，就是这张卡的花销

所以花销这个属性造成的数据泄露，训练模型时应该drop

posted on 2019-03-12 15:03 wangzhonghan 阅读(479) 评论(0) 收藏举报

刷新页面返回顶部

wangzhonghan

kaggle教程--10--数据泄露

导航

公告