Dataframe根据缺测率对行或列求取平均值
Dataframe自带的求取平均值的算法只能忽略nan,无法根据nan出现的频次计算平均值。
import pandas as pd import numpy as np df = pd.DataFrame([[1,np.nan,np.nan,np.nan,np.nan],[2,3,np.nan,np.nan,np.nan],[3,4,5,np.nan,np.nan],[4,5,6,7,np.nan]], index=['a', 'b', 'c', 'd'], columns=['A','B','C','D','E']) print(df)
结果:
A B C D E a 1 NaN NaN NaN NaN b 2 3.0 NaN NaN NaN c 3 4.0 5.0 NaN NaN d 4 5.0 6.0 7.0 NaN
下面是我编写的自定义函数
def df_mean(u0, axis, limit): """ dataframe对 行或列 根据缺测率求取平均。 :param u0: 求取平均的dataframe :param axis: 行或列,1为逐行对列求平均,0为逐列对行求平均 :param limit: 缺测率标准,0-1。缺测率大于等于limit的,平均值定义为nan :return: """ umean=[] if axis==1: for ij in u0.index: if u0.loc[ij, :].isna().sum()/len(u0.loc[ij,:]) >= limit: umean.append(np.nan) else: umean.append(u0.loc[ij,:].mean()) umean=pd.Series(umean, index=u0.index) elif axis==0: for ij in u0.columns: if u0.loc[:, ij].isna().sum() / len(u0.loc[:, ij]) >= limit: umean.append(np.nan) else: umean.append(u0.loc[:, ij].mean()) umean=pd.Series(umean, index=u0.columns) else: print('Error for axis') return umean
直接用df.mean()求取平均值的结果:
print(df.mean(axis=0))
A 2.5
B 4.0
C 5.5
D 7.0
E NaN
dtype: float64
利用自定义函数求取的结果:
print(df_mean(df, axis=0, limit=0.6))
A 2.5 B 4.0 C 5.5 D NaN E NaN dtype: float64
可以看出,‘D’和‘E’列的缺测率大于0.6,求取的平均值定义为nan了。

浙公网安备 33010602011771号