kaggle-Elo Merchant Category Recommendation(数据探索)(二)

一.字段取值状况

　　1.数据的缺失情况检查

DataFrame.isnull().any()
或者：DataFrame.isnull().sum()

　　2.连续数据

DataFrame.describe()      //分析取值范围和区间
DataFrame.value_counts()  //分析数据出现频次，可用于检查异常点

　　3.离散数据

DataFrame.value_counts()  //分析数据出现频次

二.数据分布差异

　　为了使得模型学习到的关联关系是可以进行泛化的，训练集、验证集和测试集的数据分布要相似，尤其是特征与标签的联合分布一致

# 单变量分布展示
features = ['first_active_month','feature_1','feature_2','feature_3']
train_count = train.shape[0]
test_count = test.shape[0]
for feature in features:
train[feature].value_counts().sort_index().plot()
test[feature].value_counts().sort_index().plot()
plt.xlabel(feature)
plt.legend(['train','test'])
plt.ylabel('count')
plt.show()

　　示例图如下：

# 多变量分布展示
features = ['first_active_month','feature_1','feature_2','feature_3']
train_count = train.shape[0]
test_count = test.shape[0]
def combine_feature(df):
    cols = df.columns
    feature1 = df[cols[0]].astype(str).values.tolist()
    feature2 = df[cols[1]].astype(str).values.tolist()
    return pd.Series([feature1[i]+'&'+feature2[i] for i in range(df.shape[0])])
n = len(features)
for i in range(n-1):
    for j in range(i+1, n):
        cols = [features[i], features[j]]
        print(cols)
        train_dis = combine_feature(train[cols]).value_counts().sort_index()/train_count
        test_dis = combine_feature(test[cols]).value_counts().sort_index()/test_count
        index_dis = pd.Series(train_dis.index.tolist() + test_dis.index.tolist()).drop_duplicates().sort_values()
        (index_dis.map(train_dis).fillna(0)).plot()
        (index_dis.map(train_dis).fillna(0)).plot()
        plt.legend(['train','test'])
        plt.xlabel('&'.join(cols))
        plt.ylabel('ratio')
        plt.show()

　　示例图如下：

三.离散数据编码（根据字典序，并非one-hot编码）

def change_object_cols(se):
    value = se.unique().tolist()
    value.sort()
    return se.map(pd.Series(range(len(value)), index=value)).values

四.异常数据处理

缺失数据处理：
1.离散数据可以用-1进行填充
2.连续数据用平均值进行填充

无穷数据处理：
1.用极大值、极小值进行填充

五.时间处理

转换成unix时间戳

posted @ 2022-02-15 11:17 麦扣阅读(213) 评论(0) 收藏举报

刷新页面返回顶部

麦扣

——吾生有涯而知无涯，以有涯逐无涯，乐此不倦

kaggle-Elo Merchant Category Recommendation(数据探索)(二)

公告

麦 扣

——吾生有涯而知无涯，以有涯逐无涯，乐此不倦

kaggle-Elo Merchant Category Recommendation(数据探索)(二)

公告

麦扣