房租预测------赛题分析
房租预测------赛题分析
赛题分析的目的主要是通过探索性数据分析(EDA),对比赛数据产生一定的了解,方便后续特征工程的进行和模型的构建。主要包括缺失值分析、特征值分析、是否有单调特征列、特征 nunique 分布、统计特征值出现频次大于 100 的特征、Label 分布以及不同特征值的样本的 label 分布等等
1. 读入数据
data_train = pd.read_csv('./train_data.csv')
data_train['Type'] = 'Train'
data_test = pd.read_csv('./test_a.csv')
data_test['Type'] = 'Test'
data_all = pd.concat([data_train, data_test], ignore_index=True)
data_train.head(5)
输出结果:
2. 总体情况预览
print(data_train.info())
print(data_train.describe())
输出结果:
从图中可以看出每个特征列的值的类型以及分布
3. 缺失值分析
def missing_values(df):
alldata_na = pd.DataFrame(df.isnull().sum(), columns={'missingNum'})
alldata_na['existNum'] = len(df) - alldata_na['missingNum']
alldata_na['sum'] = len(df)
alldata_na['missingRatio'] = alldata_na['missingNum'] / len(df) * 100
alldata_na['dtype'] = df.dtypes
alldata_na = alldata_na[alldata_na['missingNum'] > 0].reset_index().sort_values(by=['missingNum', 'index'], ascending=[False, True])
alldata_na.set_index('index', inplace=True)
return alldata_na
missing_values(data_train)
输出结果:
可以看出仅有 pv、uv两个特征列存在缺失值,共 18 条数据
4. 单调特征列分析
def increasing(vals):
cnt = 0
len_ = len(vals)
for i in range(len_ - 1):
if vals[i + 1] > vals[i]:
cnt += 1
return cnt
fea_cols = [col for col in data_train.columns]
for col in fea_cols:
cnt = increasing(data_train[col].values)
if cnt / data_train.shape[0] >= 0.55:
print('单调特征:', col)
print('单调特征值个数:', cnt)
print('单调特征值比例:', cnt / data_train.shape[0])
输出结果:
可以看出存在一单调特征列 tradeTime(交易时间)
5. 特征 nunique 分布
for feature in categorical_feas:
print(feature + "的特征分布如下:")
print(data_train[feature].value_counts())
if feature != 'communityName':
plt.hist(data_all[feature], bins=3)
plt.show()
输出结果:
可以看出一些分类特征的分布情况
6. 统计特征值出现频次大于100的特征
for feature in categorical_feas:
df_value_counts = pd.DataFrame(data_train[feature].value_counts())
df_value_counts = df_value_counts.reset_index()
df_value_counts.columns = [feature, 'counts']
print(df_value_counts[df_value_counts['counts'] >= 100])
输出结果:
7. Label分布
fig, axes = plt.subplots(2, 3, figsize=(20, 5))
fig.set_size_inches(20, 12)
sns.distplot(data_train['tradeMoney'], ax=axes[0][0])
sns.distplot(data_train[(data_train['tradeMoney'] <= 20000)]['tradeMoney'], ax=axes[0][1])
sns.distplot(data_train[(data_train['tradeMoney'] > 20000) & (data_train['tradeMoney'] <= 50000)]['tradeMoney'], ax=axes[0][2])
sns.distplot(data_train[(data_train['tradeMoney'] > 50000) & (data_train['tradeMoney'] <= 100000)]['tradeMoney'], ax=axes[1][0])
sns.distplot(data_train[(data_train['tradeMoney'] > 100000)]['tradeMoney'], ax=axes[1][1])
输出结果:
从图中可以看出在不同的交易金额区间内,交易金额的分布情况
print("money<=10000",len(data_train[(data_train['tradeMoney']<=10000)]['tradeMoney']))
print("10000<money<=20000",len(data_train[(data_train['tradeMoney']>10000)&(data_train['tradeMoney']<=20000)]['tradeMoney']))
print("20000<money<=50000",len(data_train[(data_train['tradeMoney']>20000)&(data_train['tradeMoney']<=50000)]['tradeMoney']))
print("50000<money<=100000",len(data_train[(data_train['tradeMoney']>50000)&(data_train['tradeMoney']<=100000)]['tradeMoney']))
print("100000<money",len(data_train[(data_train['tradeMoney']>100000)]['tradeMoney']))
输出结果:
可以看出在不同的交易金额区间内,交易的总条数
8. 评价指标
竞赛为回归问题,评价指标为R-Square,值越接近 1,拟合效果越好

浙公网安备 33010602011771号