一 前言

pandas学到分组迭代，那么基础的pandas系列就学的差不多了，自我感觉不错，知识追寻者用pandas处理过一些数据，蛮好用的；

二 分组

2.1 数据准备

# -*- coding: utf-8 -*-

import pandas as pd
import numpy as np

frame = pd.DataFrame({
'user' : ['zszxz','craler','rose','zszxz','rose'],
'price' : np.random.randn(5),
'number' : np.random.randn(5)
})
print(frame)


     user    hobby     price    number
1  craler  running -1.410682  0.259869
2    rose   hiking -0.353269 -0.392659
4    rose   hiking -1.348315  2.492047


2.2 分组求均值

# 是个生成器
group = frame['price'].groupby(frame['hobby'])
# 求均值
print(group.mean())


hobby
hiking    -0.850792
running   -1.410682
Name: price, dtype: float64


Tip: 可以理解为 根据爱好分组，查询价格；查询的列必须是数字，否则求均值时会报异常

group = frame['price'].groupby([frame['hobby'],frame['user']])
print(group.mean())


hobby    user
hiking   rose      0.063972
running  craler   -1.395186
Name: price, dtype: float64


group = frame.groupby(frame['hobby'])
print(group.mean())


hobby
hiking  -0.116659 -0.316222
running -0.282676 -0.585124


Tip: 求均值后，默认是对数字类型的数据进行分组求均值；非数字列自动忽略

2.3 分组求数量

group = frame.groupby(frame['hobby'])
print(group.size())


hobby
hiking     2
running    1
dtype: int64


2.4 分组迭代

group = frame['price'].groupby(frame['hobby'])
for key , data in group:
print(key)
print(data)


hiking
2   -0.669410
4   -0.246816
Name: price, dtype: float64
0    1.362191
3   -0.052538
Name: price, dtype: float64
running
1    0.8963
Name: price, dtype: float64


group = frame['price'].groupby([frame['hobby'],frame['user']])
for (key1, key2) , data in group:
print(key1,key2)
print(data)


hiking rose
2   -0.019423
4   -2.642912
Name: price, dtype: float64
0    0.405016
3    0.422182
Name: price, dtype: float64
running craler
1   -0.724752
Name: price, dtype: float64


2.5 分组数据转为字典

dic = dict(list(frame.groupby(frame['hobby'])))
print(dic)


{'hiking':    user   hobby     price    number
2  rose  hiking  0.351633  0.523272
4  rose  hiking  0.800039  0.331646,
'running':      user    hobby     price    number
1  craler  running -2.525633  0.895776}


print(dic['hiking'])


   user   hobby     price    number
2  rose  hiking  0.382225 -0.242055
4  rose  hiking  1.055785 -0.328943


2.6 分组取值

mean = frame.groupby('hobby')['price'].mean()
print(type(mean))
print(mean)


<class 'pandas.core.series.Series'>
hobby
hiking     0.973211
running   -0.286236
Name: price, dtype: float64


Tip: frame.groupby('hobby')['price'] 与 frame['price'] .groupby(frame['hobby']) 相等

mean = frame.groupby('hobby')[['price']].mean()
print(type(mean))
print(mean)


<class 'pandas.core.frame.DataFrame'>
price
hobby
hiking   0.973211
running -0.286236


2.5 Series作为分组

ser = pd.Series(['hiking','reading','running'])
data = frame.groupby(ser).mean()
print(data)


            price    number
hiking   1.233396  0.313839
running -0.797734 -1.230811


Tip: 本质上都是数组，除了Series，还可以使用字典，列表，数组，函数作为分组列

2.6 通过索引层级分组

# 创建2个列，并且指定名称
columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['a', 'b', 'a', 'b', 'c']], names=['language', 'alpha'])
frame = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(frame)

# 根据language进行分组
print(frame.groupby(level='language', axis=1).sum())
# 根据index进行分组
print(frame.groupby(level='alpha', axis=1).sum())


frame输出如下

language Python Java Python Java Python
alpha         a    b      a    b      c
0             9    9      7    4      5
1             3    4      7    6      6
2             6    6      3    9      1
3             1    1      8    5      2
4             6    5      9    5      4


language分组如下

language  Java  Python
0           13      21
1           10      16
2           15      10
3            6      11
4           10      19


alpha分组如下

alpha   a   b  c
0      16  13  5
1      10  10  6
2       9  15  1
3       9   6  2
4      15  10  4

posted @ 2020-05-07 14:54  知识追寻者  阅读(278)  评论(0编辑  收藏  举报