Python-groupby

1. 有行索引的df进行df.groupby(cols)

     df原本的行索引不会自动转为列保留,而是被直接替换成cols。

cols_3 = ['val1','val2','val3']
cols_2 = ['val1','val2']
df1 = pd.DataFrame({'val1':['杭州','大连','大连','大连'],
                    'val2':[1,2,2,2],
                    'val3':[2,3,4,5],
                    'val4':[3,4,5,6],
                    'i0':[3,4,3,2],
                    'i1':[3,4,5,5]}).set_index(cols_3)
display(df1)
df1_max = df1[['i0','i1']].reset_index(cols_2)   # df1_max的列为cols_2 + ['i0','i1'],行索引为'val3'
display(df1_max)
df1_max = df1_max.groupby(cols_2).max()   # 有行索引的df1_max进行groupby; df.max()默认是对df每个列求该列的最大值
display(df1_max)

 

2. df中新增一列, 求每个分组内的记录 在某指定列上的和、平均值(会把 0 也加上去算平均值)、记录个数、最大值、最小值

df = pd.DataFrame({'val1':['杭州','大连','大连','大连'],'val2':[1,2,3,3],'val3':[2,3,4,0],'val4':[3,4,5,6],'i0':[3,4,3,2]})
display(df)

df['val2_sum_grp'] = df.groupby(['val1'])['val2'].transform('sum')
# df = df.set_index(['val1'])
# df['val2_sum_grp'] = df['val2'].sum(level=['val1'])   # 分组内求和

df['val2_mean_grp'] = df.groupby(['val1'])['val2'].transform('mean')
# 等价于以下两句
# df = df.set_index(['val1'])
# df['val2_mean_grp'] = df['val2'].mean(level=['val1'])   # 分组内求平均值

df['val2_count_grp'] = df.groupby(['val1'])['val2'].transform('count')   # 分组内求记录个数

df['val2_max_grp'] = df.groupby(['val1'])['val2'].transform('max')   # 分组内求某一列的最大值
# 等价于以下两句
# df = df.set_index(['val1'])
# df['val2_max_grp'] = df['val2'].max(level=['val1'])

df['val2_min_grp'] = df.groupby(['val1'])['val2'].transform('min')   # 分组内求某一列的最小值
# 等价于以下两句
# df = df.set_index(['val1'])
# df['val2_min_grp'] = df['val2'].min(level=['val1'])

display(df)

 

3. df中不新增列、只留分组依据列+某一列,并将某一列替换为:每个分组内的记录在该列上的统计指标(平均值、和、记录数、最大值、最小值)

df = pd.DataFrame({'val1':['杭州','大连','大连','大连'],'val2':[1,2,3,3],'val3':[2,3,4,0],'val4':[3,4,5,6],'i0':[3,4,3,2]})
display(df)
df = df.groupby(['val1','val2'])[['val3']].mean()      
# df = df.groupby(['val1','val2'])[['val3']].sum()     
# df = df.groupby(['val1','val2'])[['val3']].count()   
# df = df.groupby(['val1','val2'])[['val3']].max()     
# df = df.groupby(['val1','val2'])[['val3']].min()     
display(df)

 

posted @ 2021-11-05 17:03  花与  阅读(272)  评论(0)    收藏  举报