Python Pandas 教程

1. 读取数据

2. 切片

`pandas.IndexSlice()`

pandas.IndexSlice() 用于 multi-index 切片

midx = pd.MultiIndex.from_product([['A0','A1'], ['B0','B1','B2','B3']])
cols = ['foo', 'bar']
df = pd.DataFrame(np.arange(16).reshape((len(midx), len(cols))),
                    index = midx, columns = cols)
#        foo  bar
# A0 B0    0    1
#    B1    2    3
#    B2    4    5
#    B3    6    7
# A1 B0    8    9
#    B1   10   11
#    B2   12   13
#    B3   14   15

# 使用 slice 方法，取出 A0:A1 且 B0:B2 的数据
df.loc[(slice('A1'), slice('B2')), :])
df.loc[(slice('A0', 'A1'), slice('B0', 'B2')), :])
#        foo  bar
# A0 B0    0    1
#    B1    2    3
#    B2    4    5
# A1 B0    8    9
#    B1   10   11
#    B2   12   13

# 取出 A0, B2 的数据
df.loc[('A0', 'B2'), :])
# foo    4
# bar    5

# pandas.IndexSlice 方法，取出所有 B2 的数据
df.loc[pd.IndexSlice[:, 'B2'], :]
df.loc[pd.IndexSlice[:, ['B2']], :]
#        foo  bar
# A0 B2    4    5
# A1 B2   12   13

3. Index 操作

pandas.Index 对象：通过 pandas.DataFrame.Index 或 pandas.DataFrame.columns 属性返回

pandas.Index.name：获取 Index 的 name 属性

3.1. 切片，选取

(1) `DataFrame.reindex()` 和 `DataFrame.reindex_like()` 重新选取

DataFrame.reindex(): 根据索引，返回一个新的 DataFrame（当返回的数据为 1 维，依然是 DataFrame 的格式）。当新索引不在原索引中，则相应的数据会被标记为缺失值。

有两种调用方法：

(index=index_labels, columns=column_labels, ...)
(labels, axis={'index', 'columns'}, ...)

主要参数：

method (str) {None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}：缺失值的填充方法
fill_value：使用特定的值填充缺失值。
copy (bool):

DataFrame.reindex_like() 使用方法类似，通过 other 参数指定其他的 DataFrame，其 index 和 column 用于 reindex。

实例： DataFrame.reindex

df = pd.DataFrame({'http_status'  : [200, 200, 404, 404, 301],
                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
                   index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'])
new_ixs = ['Safari', 'Iceweasel']
new_cols = ['http_status']

# 按 index 取出
df.reindex(new_ixs)
df.reindex(new_ixs)
df.reindex(labels = new_ixs, axis="index")

# 按 column 取出
df.reindex(columns = new_cols)
df.reindex(labels = new_cols, axis = "columns")

# 按 index 和 column 取出
df.reindex(index = new_ixs, columns = new_cols)

3.2. 修改 Index name

(1) `Index.rename()` 和 `Index.set_names()` 修改 Index 的 name

Index.rename(names, inplace=False)：修改 Index 或 MultiIndex 的 name

主要参数： names: (label or list of labels)

Index.set_names(names, level=None, inplace=False)：修改 Index 或 MultiIndex 的 name

主要参数： names: (label or list of label or dict-like for MultiIndex)

主要区别： .rename() 没有 level 参数，修改 MultiIndex 的 name 时，需要全部修改。.set_names() 可以全部修改，也可只修改某一 level 的 name。

3.3. 修改 Index

(1) `DataFrame.reset_index()` 删除（重设） index

DataFrame.reset_index(): 重设（从 0，1，2，...）或删除 index

主要参数：

level (int, str, tuple, or list): 指定需要删除的索引级别，用于 MultiIndex
drop (bool): True 删除索引；False 将索引保存到 column 中。
col_level (int or str): 仅在 drop = False 时可用。如果 column 是 MultiIndex，col_level 用于指定索引（作为数据插入的）column 的 level。
col_fill (object)：如果 column 是 MultiIndex，col_fill` 用于指定索引索引（作为数据插入的）column 的其他的 level 的 label。

(2) `DataFrame.set_index()` 重设索引

DataFrame.set_index()：重设（在 column 中指定）索引

主要参数：

keys (label or array-like or list of labels/arrays):
- label or list of labels: 通过指定 column 的 key 来设置索引
- array-like: 可以为 Series, Index, np.ndarray, Iterator，设置索引
drop (bool): 是否删除被用来当 Index 的 column
append (bool): 是否将原索引保存到 column
verify_integrity (bool): 是否验证新 index 的唯一性

(3) `DataFrame.rename()`，`DataFrame.rename_axis()`，`DataFrame.set_axis()`

DataFrame.rename() 和 DataFrame.rename_axis() 两则区别在：DataFrame.rename() 存在两个额外的参数：

level：对于 MultiIndex，只修改指定 level 的 label
errors ({'ignore', 'raise'}, default 'ignore')：在 mapper 中，原 index（或 column）不存在的处理方式

函数有两种调用方式：

# 修改 index 
(index = index_mapper, ...)
(mapper, axis = 'index', ...)
(mapper, axis = 0, ...)

# 修改 column
(columns = columns_mapper, ...)
(mapper, axis = 'columns', ...)
(mapper, axis = 1, ...)

(index = index_mapper, columns = columns_mapper, ...)
(mapper, axis = {'index', 'columns'}, ...)

DataFrame.set_axis() 与前两个方法的区别在于，只能修改全部索引（labels 参数指定），不能通过传入 mapper 修改部分索引。

(4) `Series.rename()`

4. MultiIndex

5. Map, Apply, Groupby, Aggregate, Transform

(1) Map 和 Apply Map

.map(arg, na_action=None) 函数：pandas.Series 数据类型的函数，对每个元素进行操作

.applymap(arg, na_action=None) 函数：pandas.DataFrame 数据类型的函数，对每个元素进行操作

主要参数：

arg: 映射函数，或 dict 类型实现映射
na_action: {None, 'ignore'}，如何处理 NA 值

(2) Apply

.apply() 函数：包括 pandas.Series.apply() 和 pandas.DataFrame.apply()。

对于 pandas.DataFrame.apply() 函数，主要参数：

func :
- 映射函数，传入的函数的数据为 pandas.Series 类型
axis : {0 or 'index', 1 or 'columns'}, default 0
- 指定按行操作还是按列操作
raw : bool, default False
- False : 将每一行（或每一列）作为 pandas.Series 类型传入
- True : 将每一行（或每一列）作为 numpy.ndarray 数据类型传入，效率更快
result_type : {'expand', 'reduce', 'broadcast', None}, default None
- 如何处理多返回值，仅对 axis=1（行操作）起作用
- 'expand' : list-like 的返回结果，扩展到每一列
- 'reduce' : 返回 pandas.Series 类型
- 'broadcast' ：
- None : 如果函数返回 list-like , 则保留；如果返回 pandas.Series 类型，则扩展到某一列。

实例：

df.apply(lambda x: [1, 2], axis=1)
df.apply(lambda x: [1, 2], axis=1, result_type='expand')
df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)

(3) Group By

pandas.DataFrame.groupby()：返回 DataFrameGroupBy 类型。DataFrameGroupBy 对象，也可以通过 group['col_name'] 选取需要的 column。

可以通过 for 遍历

for group_name, group in df.groupby("id"):
    # group_name 为分组名
    # group 为 pandas.DataFrame 类型
    print(group_name, group)

for group_name, group in df.groupby("id")["id"]:
    print(group_name, group)

(4) Transform

pandas.DataFrame.transform() 在指定的 axis（axis 参数）上，通过调用函数（func 参数）进行计算，返回的结果的 shape （在所指定的 axis 上）与原 DataFrame 的 shape 相同。

实例： .transform() 和 .applymap()

df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
df_1 = df.transform(lambda x: x + 1)
df_2 = df.applymap(lambda x: x + 1)
# 在此例子中，.transform() 和 .applymap() 产生的结果相同
print((df_1 - df_2).abs().sum().sum())

实例： .groupby() 和 .transform() 结合

df = pd.DataFrame({"type"  : ["m", "n", "o", "m", "m", "n", "n"], 
                   "value" : [1, 2, 3, 4, 5, 6, 7] })
df = df.sort_values("type")

df['sum'] = df.groupby('type').transform("sum")            # 分组求和
df['cnt'] = df.groupby('type')["type"].transform("count")  # 分组计算
#   type  value  sum  cnt
# 0    m      1   10    3
# 3    m      4   10    3
# 4    m      5   10    3
# 1    n      2   15    3
# 5    n      6   15    3
# 6    n      7   15    3
# 2    o      3    3    1

(5) Aggregate

实例： 自定义集聚

# data_df.columns = ['id', 'weight', 'value']
data_df1 = data_df.groupby(['id', as_index = False]).agg(
    sum_weight = ("weight", "sum"),
    ave_value  = ("value",  "mean"))
# 对 "weight" 列集聚求和，得到 "sum_weight" 列
# 对 "value"  列集聚求均值，得到 "ave_value" 列

实例： 分组加权平均

# data_df.columns = ["id", "weight", "value"]
data_df1 = data_df.groupby(["id"]).agg(
    weighted_value = ("value", lambda x : np.average(x, weights = data_df.loc[x.index, "weight"])))
# "weighted_value" 新建列名
# "value" 集聚的列名，作为参数传入随后的函数

# 等价于
def weighted_average(data_df):
    data_df['weighted_value'] = data_df['weight'] * data_df['value']
    data_df = data_df.groupby("id").sum()
    data_df['weighted_value'] = data_df['weighted_value'] / data_df['weight']
    data_df.drop(["weight", "value"], axis=1, inplace=True) 
    return data_df

遍历

pandas.DataFrame.iterrows()：逐行遍历

pandas.DataFrame.itertuples(index=True, name='Pandas')：逐行遍历

pandas.DataFrame.items()：逐列遍历

实例：

# 逐行遍历
for index, content in df.iterrows():
    index, content
    # index 为 index
    # content 为一行，pandas.Series 类型

# 逐行遍历
for content in df.itertuples():
    content
    # content 为一行，类元组类型
    # getattr(content, 'Index') 获取索引
    # getattr(content, 'column_name') 获取具体的某一个元素值

# 逐列遍历
for label, content in df.items():
    label, content
    # label 为 column name
    # content 为一列，pandas.Series 类型

Wide Table and Long Table

pandas.DataFrame.pivot(index, columns, values) 将 Long Table 转为 Wide Table

pandas.DataFrame.melt() 将 Wide Table 转为 Long Table

实例： Wide Table 和 Long Table 之间转换

df_1 = pd.DataFrame({'t': {0: 'a', 1: 'b', 2: 'c'},
                     'A': {0: 1, 1: 4, 2: 7},
                     'B': {0: 2, 1: 5, 2: 8},
                     'C': {0: 3, 1: 6, 2: 9}})
# 原始数据，df_1，Wide Table
#    t  A  B  C
# 0  a  1  2  3
# 1  b  4  5  6
# 2  c  7  8  9

df_2 = df_1.melt(id_vars=['t'], value_vars=['A', 'B', 'C'],
               var_name='type', value_name='count')
# .melt() 方法，将 Wide Table 转 Long Table
#    t type  count
# 0  a    A      1
# 1  b    A      4
# 2  c    A      7
# 3  a    B      2
# 4  b    B      5
# 5  c    B      8
# 6  a    C      3
# 7  b    C      6
# 8  c    C      9

df_3 = df_2.pivot(index='t', columns='type', values='count')
df_3 = df_3.reset_index()
# .pivot() 方法，将 Long Table 转 Wide Table
# type  t  A  B  C
# 0     a  1  2  3
# 1     b  4  5  6
# 2     c  7  8  9

pandas.DataFrame.pivot_table() Aggregate

pandas.wide_to_long() 与 .melt() 用法类似

posted @ 2022-06-29 23:39 veager 阅读(253) 评论(0) 收藏举报

刷新页面返回顶部

veager

Python Pandas 教程

1. 读取数据

2. 切片

pandas.IndexSlice()

3. Index 操作

3.1. 切片，选取

(1) DataFrame.reindex() 和 DataFrame.reindex_like() 重新选取

3.2. 修改 Index name

(1) Index.rename() 和 Index.set_names() 修改 Index 的 name

3.3. 修改 Index

(1) DataFrame.reset_index() 删除（重设） index

(2) DataFrame.set_index() 重设索引

(3) DataFrame.rename()，DataFrame.rename_axis()，DataFrame.set_axis()

(4) Series.rename()

4. MultiIndex

5. Map, Apply, Groupby, Aggregate, Transform

(1) Map 和 Apply Map

(2) Apply

(3) Group By

(4) Transform

(5) Aggregate

遍历

Wide Table and Long Table

`pandas.IndexSlice()`

(1) `DataFrame.reindex()` 和 `DataFrame.reindex_like()` 重新选取

(1) `Index.rename()` 和 `Index.set_names()` 修改 Index 的 name

(1) `DataFrame.reset_index()` 删除（重设） index

(2) `DataFrame.set_index()` 重设索引

(3) `DataFrame.rename()`，`DataFrame.rename_axis()`，`DataFrame.set_axis()`

(4) `Series.rename()`