kaggle教程--10--数据泄露

数据泄露(Data leakage)

常见的数据泄露有2种:Leaky Predictors和Leaky Validation Strategies

 

Leaky Predictors:任何在目标属性出现后,会随之更新或出现的特征属性,都应该排除在训练集之外

避免方法:

1 筛选可能造成数据泄露的特征列,找到和目标列有统计学相关的特征列

2 如果你的模型预测结果非常准,则需要检查有没有数据泄露

 

Leaky Validation Strategies:如果先进行数据预处理(例如插值等),在调用train_test_split,这样就是Leaky Validation Strategies

避免方法:

1 如果模型验证只使用简单的train-test split,则对验证集不要进行任何fitting(包括预处理,插值等)

2 使用管道(scikit-learn Pipelines)

 

 

例子:

import pandas as pd

data = pd.read_csv('../input/AER_credit_card_data.csv',
true_values = ['yes'],
false_values = ['no'])
print(data.head())

 

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

y = data.card
X = data.drop(['card'], axis=1)

# Since there was no preprocessing, we didn't need a pipeline here. Used anyway as best practice
modeling_pipeline = make_pipeline(RandomForestClassifier())
cv_scores = cross_val_score(modeling_pipeline, X, y, scoring='accuracy')
print("Cross-val accuracy: %f" %cv_scores.mean())

 

Cross-val accuracy: 0.979528

 

expenditures_cardholders = data.expenditure[data.card] #有这张卡的人的花销
expenditures_noncardholders = data.expenditure[~data.card] #没有这张卡的人花销

print('Fraction of those who received a card with no expenditures: %.2f' \
%(( expenditures_cardholders == 0).mean())) #有这张卡的人的花销==0的人数的均值
print('Fraction of those who received a card with no expenditures: %.2f' \
%((expenditures_noncardholders == 0).mean())) #没有有这张卡的人的花销==0的人数的均值

 

Fraction of those who received a card with no expenditures: 0.02
Fraction of those who received a card with no expenditures: 1.00

结论证明,有这张卡的人,没有花销的概率是2%。没有这张卡的人,没有花销的概率是100%。证明花销这一列,就是这张卡的花销

所以花销这个属性造成的数据泄露,训练模型时应该drop

 

posted on 2019-03-12 15:03  wangzhonghan  阅读(479)  评论(0)    收藏  举报

导航