Dataframe根据缺测率对行或列求取平均值

Dataframe自带的求取平均值的算法只能忽略nan,无法根据nan出现的频次计算平均值。

import pandas as pd
import numpy as np

df = pd.DataFrame([[1,np.nan,np.nan,np.nan,np.nan],[2,3,np.nan,np.nan,np.nan],[3,4,5,np.nan,np.nan],[4,5,6,7,np.nan]], index=['a', 'b', 'c', 'd'], columns=['A','B','C','D','E'])
print(df)

结果:

   A    B    C    D   E
a  1  NaN  NaN  NaN NaN
b  2  3.0  NaN  NaN NaN
c  3  4.0  5.0  NaN NaN
d  4  5.0  6.0  7.0 NaN

下面是我编写的自定义函数

def df_mean(u0, axis, limit):
    """
    dataframe对 行或列 根据缺测率求取平均。
    :param u0: 求取平均的dataframe
    :param axis: 行或列,1为逐行对列求平均,0为逐列对行求平均
    :param limit: 缺测率标准,0-1。缺测率大于等于limit的,平均值定义为nan
    :return:
    """
    umean=[]
    if axis==1:
        for ij in u0.index:
            if u0.loc[ij, :].isna().sum()/len(u0.loc[ij,:]) >= limit:
                umean.append(np.nan)
            else:
                umean.append(u0.loc[ij,:].mean())
        umean=pd.Series(umean, index=u0.index)
    elif axis==0:
        for ij in u0.columns:
            if u0.loc[:, ij].isna().sum() / len(u0.loc[:, ij]) >= limit:
                umean.append(np.nan)
            else:
                umean.append(u0.loc[:, ij].mean())
        umean=pd.Series(umean, index=u0.columns)
    else:
        print('Error for axis')
    return umean

直接用df.mean()求取平均值的结果:

print(df.mean(axis=0))
A    2.5
B    4.0
C    5.5
D    7.0
E    NaN
dtype: float64

利用自定义函数求取的结果:

print(df_mean(df, axis=0, limit=0.6))
A    2.5
B    4.0
C    5.5
D    NaN
E    NaN
dtype: float64

可以看出,‘D’和‘E’列的缺测率大于0.6,求取的平均值定义为nan了。

posted @ 2022-05-13 16:35  气象小白  阅读(727)  评论(0)    收藏  举报