kaggle教程--10--数据泄露
数据泄露(Data leakage)
常见的数据泄露有2种:Leaky Predictors和Leaky Validation Strategies
Leaky Predictors:任何在目标属性出现后,会随之更新或出现的特征属性,都应该排除在训练集之外
避免方法:
1 筛选可能造成数据泄露的特征列,找到和目标列有统计学相关的特征列
2 如果你的模型预测结果非常准,则需要检查有没有数据泄露
Leaky Validation Strategies:如果先进行数据预处理(例如插值等),在调用train_test_split,这样就是Leaky Validation Strategies
避免方法:
1 如果模型验证只使用简单的train-test split,则对验证集不要进行任何fitting(包括预处理,插值等)
2 使用管道(scikit-learn Pipelines)
例子:
import pandas as pd
data = pd.read_csv('../input/AER_credit_card_data.csv',
true_values = ['yes'],
false_values = ['no'])
print(data.head())
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
y = data.card
X = data.drop(['card'], axis=1)
# Since there was no preprocessing, we didn't need a pipeline here. Used anyway as best practice
modeling_pipeline = make_pipeline(RandomForestClassifier())
cv_scores = cross_val_score(modeling_pipeline, X, y, scoring='accuracy')
print("Cross-val accuracy: %f" %cv_scores.mean())
Cross-val accuracy: 0.979528
expenditures_cardholders = data.expenditure[data.card] #有这张卡的人的花销
expenditures_noncardholders = data.expenditure[~data.card] #没有这张卡的人花销
print('Fraction of those who received a card with no expenditures: %.2f' \
%(( expenditures_cardholders == 0).mean())) #有这张卡的人的花销==0的人数的均值
print('Fraction of those who received a card with no expenditures: %.2f' \
%((expenditures_noncardholders == 0).mean())) #没有有这张卡的人的花销==0的人数的均值
Fraction of those who received a card with no expenditures: 0.02
Fraction of those who received a card with no expenditures: 1.00
结论证明,有这张卡的人,没有花销的概率是2%。没有这张卡的人,没有花销的概率是100%。证明花销这一列,就是这张卡的花销
所以花销这个属性造成的数据泄露,训练模型时应该drop
posted on 2019-03-12 15:03 wangzhonghan 阅读(479) 评论(0) 收藏 举报
浙公网安备 33010602011771号