Title

Pandas 基本用法快速入门

Pandas 基本用法快速入门

文章基于Labex-Pandas 实验整理

Pandas创建DataFrame

从字典创建DataFrame

import pandas as pd

student_data = {
    'Name' : ['Alice', 'Bob', 'Charlie'],
    'Score' : [85, 92, 78]
}

df = pd.DataFrame(student_data)
print(df)
      Name  Score
0    Alice     85
1      Bob     92
2  Charlie     78

在DataFrame中指定列名

指定列名相当于指定了列之间的顺序

import pandas as pd

student_data = {
    'Name' : ['Alice', 'Bob', 'Charlie'],
    'Score' : [85, 92, 78]
}

df = pd.DataFrame(student_data, columns = ['Score', 'Name'])
print(df)
   Score     Name
0     85    Alice
1     92      Bob
2     78  Charlie

通过 index 参数添加索引标签

import pandas as pd

student_data = {
    'Name' : ['Alice', 'Bob', 'Charlie'],
    'Score' : [85, 92, 78]
}

index_labels = ['ID1', 'ID2', 'ID3']

df = pd.DataFrame(student_data, index = index_labels)
print(df)
        Name  Score
ID1    Alice     85
ID2      Bob     92
ID3  Charlie     78

使用点符号访问 DataFrame 列

import pandas as pd

student_data = {
    'Name' : ['Alice', 'Bob', 'Charlie'],
    'Score' : [85, 92, 78]
}

index_labels = ['ID1', 'ID2', 'ID3']

df = pd.DataFrame(student_data)
print(df.Name)

输出是一个 Pandas Series,它本质上是 DataFrame 的单列

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

使用 info 方法显示 DataFrame 信息

import pandas as pd

student_data = {
    'Name' : ['Alice', 'Bob', 'Charlie'],
    'Score' : [85, 92, 78]
}

index_labels = ['ID1', 'ID2', 'ID3']

df = pd.DataFrame(student_data)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    3 non-null      object
 1   Score   3 non-null      int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes

Pandas读取外部数据

使用read_csv()读取csv文件

pandas.read_csv() 的参数 comment 用来指定“注释起始字符”。作用如下:

  1. 指定一个单字符(例如 '#')。当解析到该字符时,该字符及其后面的内容都会被忽略。
  2. 如果这一字符出现在一行的开头,则整行都会被跳过(常用于跳过文件里的说明/元数据行)。
  3. 如果出现在行中间(且不在引号内),则从该字符起到行尾的内容被当作注释丢弃,相当于“行内注释”。
  4. 在引号包裹的字段内出现的注释字符会被视为普通文本,不会触发注释。
  5. 只能是单个字符;不能使用多字符注释(例如不能直接用 '//')。
  6. 注意:注释判定是从行首匹配该字符;如果行首有空格再跟注释符,通常不会被当作“以注释开头的行”而跳过。

data.csv

# This is a comment line
# Another comment line
id,name,age,city
1,Alice,25,New York
2,Bob,30,Los Angeles
3,Charlie,Not Available,London
4,David,35,N/A
5,Eve,22,Paris
import pandas as pd

df = pd.read_csv('data.csv', comment = '#')
print(df)
   id     name            age         city
0   1    Alice             25     New York
1   2      Bob             30  Los Angeles
2   3  Charlie  Not Available       London
3   4    David             35          NaN
4   5      Eve             22        Paris

在read_csv中指定表头行

默认情况下,read_csv假定文件中的第一行(非注释行,非跳过行)是包含列名的表头行。
header参数允许你显示地指定哪一行作为表头,它接受一个整数,表示行索引(从0开始)。由于第一行数据(在注释之后)是我们地表头,其索引是0

import pandas as pd

df = pd.read_csv('data.csv', comment='#', header = 0)
print(df)

使用 na_values 参数处理缺失值

Pandas 会自动识别一些常见的缺失值指示符,如空字符串、NAN/A,但不会识别自定义的如 "Not Available"。我们可以使用 na_values 参数提供一个字符串列表,这些字符串将被解释为 NaN(Not a Number),这是 Pandas 用于缺失数据的标准标记。

import pandas as pd

missing_value = ["Not Available", "N/A"]

df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df)
   id     name   age         city
0   1    Alice  25.0     New York
1   2      Bob  30.0  Los Angeles
2   3  Charlie   NaN       London
3   4    David  35.0          NaN
4   5      Eve  22.0        Paris

使用.head()方法显示前几行

.head() 方法是一种方便的方式来快速预览你的数据。默认情况下,它返回前 5 行。.head(x)返回前x

import pandas as pd

missing_value = ["Not Available", "N/A"]

df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df.head(3))
   id     name   age         city
0   1    Alice  25.0     New York
1   2      Bob  30.0  Los Angeles
2   3  Charlie   NaN       London

使用shape属性检查DataFrame的形状

DataFrame 有一个 .shape 属性,它返回一个包含行数和列数的元组。

请注意,.shape 是一个属性,而不是一个方法,因此在访问它时不需要使用括号 ()

import pandas as pd

missing_value = ["Not Available", "N/A"]

df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df.shape)
(5, 4)

Pandas 数据选择

数据
students.csv

name,age,major,score
Alice,20,Physics,85
Bob,22,Computer Science,92
Charlie,21,Mathematics,95
David,23,Engineering,88
Eve,20,Biology,90

使用方括号表示选择单列

选择单列会返回一个Series,它本质上是一个一维的带标签数组

import pandas as pd

df = pd.read_csv('students.csv')
name_column = df['name']
print(name_column)

输出

0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: name, dtype: object

通过列表选择多列

与选择单列返回Series不同,选择多列会返回一个新的DataFrame

import pandas as pd
df = pd.read_csv('students.csv')
subset = df[['name', 'score']]
print(subset)

输出

      name  score
0    Alice     85
1      Bob     92
2  Charlie     95
3    David     88
4      Eve     90

使用loc标签选择行

.loc访问器根据标签选择数据。.loc索引器主要基于标签,这意味着你使用索引 名称 (或标签) 来进行选择。默认情况下,当你加载CSV文件而不指定索引列时,Pandas会分配一个从0开始的默认整数索引。这些整数充当行的标签

import pandas as pd
df = pd.read_csv('students.csv')
charlie_data = df.loc[2]
print(charlie_data)

输出

name        Charlie
age              21
major     Mathematics
score            95
Name: 2, dtype: object

使用iloc按整数位置选择行

.iloc索引器主要基于整数位置。它的工作方式与Python列表切片类似,你使用整数索引来访问元素。这与使用标签.loc不同。虽然我们的默认标签索引也是整数,但当你拥有非整数标签时,这种区别就变得至关重要

import pandas as pd

df = pd.read_csv('students.csv')
first_row = df.iloc[0]
print(first_row)

使用loc切片行和列

.loc.iloc都可以使用df.loc[row_selector, column_selector]语法同时选择行和列

.loc的一个关键特性是,当你使用标签进行切片时(例如1:3),结束标签(3)是包含在内

import pandas as pd
df = pd.read_csv('students.csv')
data_slice = df.loc[1:3, 'name' : 'major']
print(data_slice)

输出

      name  age           major
1      Bob   22  Computer Science
2  Charlie   21       Mathematics
3    David   23       Engineering

使用布尔条件过滤

创建一个布尔条件,该条件会返回一个包含 TrueFalse 值的 Pandas Series。当你将此 Series 传递给 DataFrame 时,它只会返回值为 True 的那些行。

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}

df = pd.DataFrame(data)

older_than_30 = df[df['Age'] > 30]
print("Employees older than 30:")
print(older_than_30)

输出

Employees older than 30:
      Name Department  Age   Salary
1      Bob         HR   45  80000.0
2  Charlie      Sales   38  75000.0
3    David         IT   52  95000.0
5    Frank    Finance   33      NaN

使用 & 运算符组合过滤条件

import pandas as pd
import numpy as np
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]
print("IT employees older than 30:")
print(it_and_older)

输出

IT employees older than 30:
    Name Department  Age   Salary
3  David         IT   52  95000.0

使用 isin 方法进行值匹配

此方法接受一个值列表,并为列值在该列表中的每一行返回True

import pandas as pd
import numpy as np
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
hr_or_finance = df[df["Department"].isin(["HR", "Finance"])]
print("Employees in HR or Finance:")
print(hr_or_finance)

输出

Employees in HR or Finance:
    Name Department  Age   Salary
0  Alice         HR   25  50000.0
1    Bob         HR   45  80000.0
5  Frank    Finance   33      NaN

使用 notnull 过滤以移除 NaN

notnull()方法返回一个布尔Series,对于非缺失值返回True, 对于缺失值(NaN)返回False。你可以使用它轻松地移除特定列中包含NaN的行

import pandas as pd
import numpy as np
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
valid_salary = df[df['Salary'].notnull()]
print("Employees with valid salary information:")
print(valid_salary)

输出

Employees with valid salary information:
      Name Department  Age   Salary
0    Alice         HR   25  50000.0
1      Bob         HR   45  80000.0
2  Charlie      Sales   38  75000.0
3    David         IT   52  95000.0
4      Eve         IT   29  62000.0

使用 len 函数计算过滤后的行数

返回DataFrame中的行数

import pandas as pd
import numpy as np
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
    'Age': [25, 45, 38, 52, 29, 33],
    'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]
hr_or_finance = df[df["Department"].isin(["HR", "Finance"])]
valid_salary = df[df['Salary'].notnull()]
it_employees = df[df['Department'] == 'IT']
num_it_employees = len(it_employees)
print(f"Number of employees in IT:{num_it_employees}")

输出

Number of employees in IT:2

Pandas 数据排序

使用 sort_values 按单列排序 DataFrame

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 22, 25, 28, 22],
    'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)

print("Orignal DataFrame:")
print(df)

df_sorted_by_age = df.sort_values(by = 'Age')
print("\nDataFrame sorted by Age:")
print(df_sorted_by_age)

输出

Orignal DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

DataFrame sorted by Age:
      Name  Age  Score
1      Bob   22     91
4      Eve   22     91
2  Charlie   25     88
0    Alice   25     85
3    David   28     79

按多列升序排列

要按多列排序,我们需要将列名列表传递给sort_values()方法的by参数。Pandas 将首先按照列表中的第一列进行排序,然后相同的按照第二列排序,以此类推。

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 22, 25, 28, 22],
    'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)

print("Orignal DataFrame:")
print(df)

df_sorted_multiple = df.sort_values(by=['Age', 'Score'])
print("\nDataFrame sorted by Age and then Score:")
print(df_sorted_multiple)

输出

DataFrame sorted by Age and then Score:
      Name  Age  Score
1      Bob   22     91
4      Eve   22     91
0    Alice   25     85
2  Charlie   25     88
3    David   28     79

使用 ascending = False 按降序排列

默认情况下,sort_values() 按升序排序
要按降序排序,请将 ascending=False 设置为 True
当按多列排序时,我们可以通过传递布尔值列表来为每列指定不同的顺序,例如 ascending=[True, False]

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 22, 25, 28, 22],
    'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)

print("Orignal DataFrame:")
print(df)

df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)

输出

Orignal DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

DataFrame sorted by Age (asc) and Score (desc):
      Name  Age  Score
1      Bob   22     91
4      Eve   22     91
2  Charlie   25     88
0    Alice   25     85
3    David   28     79

使用 sort_index 按索引排序

在执行sort_values()操作后,DataFrame的索引会变得混乱,那么sort_index()方法允许我们根据索引标签对DataFrame进行排序,如果索引是简单的范围,则可以恢复原始顺序

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 22, 25, 28, 22],
    'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)

print("Orignal DataFrame:")
print(df)

df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)

df_reordered_by_index = df_sorted_mixed.sort_index()
print("\nDataFrame re-sorted by index:")
print(df_reordered_by_index)

输出

Orignal DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

DataFrame sorted by Age (asc) and Score (desc):
      Name  Age  Score
1      Bob   22     91
4      Eve   22     91
2  Charlie   25     88
0    Alice   25     85
3    David   28     79

DataFrame re-sorted by index:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

使用 reset_index 在排序后重置索引

reset_index()会用一个默认的整数索引(0, 1, 2...)替换当前索引。通常使用drop = True参数来完全丢弃就索引。如果不使用drop = True,旧索引将被添加一个名为index的新列

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 22, 25, 28, 22],
    'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)

print("Orignal DataFrame:")
print(df)

df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)

df_reordered_by_index = df_sorted_mixed.sort_index()
print("\nDataFrame re-sorted by index:")
print(df_reordered_by_index)

df_with_reset_index = df_sorted_mixed.reset_index(drop=True)
print("\nSorted DataFrame with reset index:")
print(df_with_reset_index)

df_with_reset_index_without_drop = df_sorted_mixed.reset_index()
print("\nWithout Drop:")
print(df_with_reset_index_without_drop)

输出

Orignal DataFrame:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

DataFrame sorted by Age (asc) and Score (desc):
      Name  Age  Score
1      Bob   22     91
4      Eve   22     91
2  Charlie   25     88
0    Alice   25     85
3    David   28     79

DataFrame re-sorted by index:
      Name  Age  Score
0    Alice   25     85
1      Bob   22     91
2  Charlie   25     88
3    David   28     79
4      Eve   22     91

Sorted DataFrame with reset index:
      Name  Age  Score
0      Bob   22     91
1      Eve   22     91
2  Charlie   25     88
3    Alice   25     85
4    David   28     79

Without Drop:
   index     Name  Age  Score
0      1      Bob   22     91
1      4      Eve   22     91
2      2  Charlie   25     88
3      0    Alice   25     85
4      3    David   28     79

Pandas 基本数据清理

使用 dropna方法删除行

原始数据

data = {
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Eva', 'Frank'],
    'age': [25, 30, 35, 40, 25, np.nan, 45],
    'city': ['New York', 'Los Angeles', 'New York', 'Chicago', 'New York', 'Boston', np.nan],
    'SALARY_IN_USD': ['50000', '60000', '70000', '80000', '50000', '90000', '100000']
}

默认情况下,dropna()会删除包含至少一个NaN值得任何行

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\n DataFrame after dropping rows with any missing values:")
df_dropped = df.dropna()
print(df_dropped)

输出

Original DataFrame:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva   NaN       Boston         90000
6    Frank  45.0          NaN        100000

 DataFrame after dropping rows with any missing values:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000

使用fillna填充缺失值

通常用一个常量(0或者Unkonwn)来填充NaN值,或者用一个计算值(比如平均数或者中位数)来填充

mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
print("\n DataFrame after filling missing values:")
print(df)

注意
inplace = True参数会就地修改DataFrame,不会返回一个对象。而inplace = False不会就地修改,而是会返回一个DataFrame对象

输出

Original DataFrame:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva   NaN       Boston         90000
6    Frank  45.0          NaN        100000

 DataFrame after filling missing values:
      name        age         city SALARY_IN_USD
0    Alice  25.000000     New York         50000
1      Bob  30.000000  Los Angeles         60000
2  Charlie  35.000000     New York         70000
3    David  40.000000      Chicago         80000
4    Alice  25.000000     New York         50000
5      Eva  33.333333       Boston         90000
6    Frank  45.000000      Unknown        100000

使用drop_duplictates 删除重复行

inplace参数用法同上

print("Original DataFrame:")
print(df)
mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
df.drop_duplicates(inplace = True)
print("\nDataFrame after dropping duplicates:")
print(df)

输出

Original DataFrame:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva   NaN       Boston         90000
6    Frank  45.0          NaN        100000

 DataFrame after filling missing values:

DataFrame after dropping duplicates:
      name        age         city SALARY_IN_USD
0    Alice  25.000000     New York         50000
1      Bob  30.000000  Los Angeles         60000
2  Charlie  35.000000     New York         70000
3    David  40.000000      Chicago         80000
5      Eva  33.333333       Boston         90000
6    Frank  45.000000      Unknown        100000

使用rename方法重命名列

将一个字典传递给columns参数,其中键是旧名称,值是新名称

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
print("\n DataFrame after filling missing values:")
df.drop_duplicates(inplace = True)
print("\nDataFrame after dropping duplicates:")
print(df)

df.rename(columns = {'SALARY_IN_USD': 'salary'}, inplace = True)
print("\nDataFrame after renaming colmns:")
print(df)

输出

Original DataFrame:
      name   age         city SALARY_IN_USD
0    Alice  25.0     New York         50000
1      Bob  30.0  Los Angeles         60000
2  Charlie  35.0     New York         70000
3    David  40.0      Chicago         80000
4    Alice  25.0     New York         50000
5      Eva   NaN       Boston         90000
6    Frank  45.0          NaN        100000

 DataFrame after filling missing values:

DataFrame after dropping duplicates:
      name        age         city SALARY_IN_USD
0    Alice  25.000000     New York         50000
1      Bob  30.000000  Los Angeles         60000
2  Charlie  35.000000     New York         70000
3    David  40.000000      Chicago         80000
5      Eva  33.333333       Boston         90000
6    Frank  45.000000      Unknown        100000

DataFrame after renaming colmns:
      name        age         city  salary
0    Alice  25.000000     New York   50000
1      Bob  30.000000  Los Angeles   60000
2  Charlie  35.000000     New York   70000
3    David  40.000000      Chicago   80000
5      Eva  33.333333       Boston   90000
6    Frank  45.000000      Unknown  100000

使用astype装换列类型

对最初得DataFrame执行df.info()输出,我们会注意到SALARY_IN_USD列得dtypeobject,这意味着它将数字存储为字符串,如果想对其进行运算,我们可以装换为数字类型

df['salary'] = df['salary'].astype(int)

Pandas 数据统计

数据

data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'age': [24, 27, 22, 32, 29],
        'score': [85, 90, 78, 95, 88],
        'grade': ['B', 'A', 'C', 'A', 'B']}

使用mean方法计算均值

.mean()可以对Series(DataFrame的一列)计算均值

print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")
score_mean = df['score'].mean()
print(f"Mean Score: {score_mean}")

输出

Original DataFrame:
      name  age  score grade
0    Alice   24     85     B
1      Bob   27     90     A
2  Charlie   22     78     C
3    David   32     95     A
4      Eve   29     88     B

==============================

Mean Score: 87.2

使用 median方法计算中位数

score_mdian = df['score'].median()
print(f"Median Score: {score_mdian}")

输出

Median Score: 88.0

查找最大值和最小值

.max()查找最大值,.min()查找最小值

score_min = df['score'].min()
score_max = df['score'].max()
print(f"Minimum Score: {score_min}")
print(f"Maximum Score: {score_max}")

输出

Minimum Score: 78
Maximum Score: 95

使用 describe 生成汇总统计信息

.describe() 方法。这个方法可以为 DataFrame 中的所有数值列生成全面的描述性统计信息摘要,包括计数(count)均值(mean)标准差(standard deviation)最小值(min)最大值(max)以及四分位数(quartile)值。

summary_stats = df.describe()
print("Descriptive Statistics Summary:")
print(summary_stats)

输出

Descriptive Statistics Summary:
             age      score
count   5.000000   5.000000
mean   26.800000  87.200000
std     4.024922   6.379655
min    22.000000  78.000000
25%    24.000000  85.000000
50%    27.000000  88.000000
75%    29.000000  90.000000
max    32.000000  95.000000

使用 value_counts计数唯一值

.value_counts()方法返回一个包含唯一值计数的Series(也就是说去重后计数)

grade_counts = df['grade'].value_counts()
print("Grade Counts:")
print(grade_counts)

输出

... (previous output) ...

Grade Counts:
grade
B    2
A    2
C    1
Name: count, dtype: int64

Pandas 分组与聚合

使用 groupby 按单列分组

groupby方法将DataFrame拆分成组,返回一个DataFrameGroupby对象,此对象包含了之后对每个组应用计算所需的所有信息

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")

group_by_categeory = df.groupby('Category')

print("Type of the grouped object:")
print(type(group_by_categeory))

print("\nIterating over groups to see their content:")
for name, group in group_by_categeory:
    print(f"\nGroup: {name}")
    print(group)

输出

Original DataFrame:
      Category Region  Sales  Units
0  Electronics  North   1200     10
1     Clothing  South    800     25
2  Electronics  North   1500      8
3        Books   West    300     15
4     Clothing   East    900     20
5        Books   West    450     18

==============================

Type of the grouped object:
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

Iterating over groups to see their content:

Group: Books
  Category Region  Sales  Units
3    Books   West    300     15
5    Books   West    450     18

Group: Clothing
   Category Region  Sales  Units
1  Clothing  South    800     25
4  Clothing   East    900     20

Group: Electronics
      Category Region  Sales  Units
0  Electronics  North   1200     10
2  Electronics  North   1500      8

对分组应用sum聚合

在数据分组后,最常见的下一步是计算每个组的某个值,例如sum(), mean(), count()max()

category_sales_sum = df.groupby('Category')['Sales'].sum()
print("Total sales per category:")
print(category_sales_sum)
Total sales per category:
Category
Books           750
Clothing       1700
Electronics    2700
Name: Sales, dtype: int64

使用agg聚合多个函数

agg()方法可以接受一个字符串列表,其中每个字符串都是一个聚合函数的名称。

category_agg = df.groupby('Category')['Sales'].agg(['sum', 'mean'])
print("\nSum and mean of sales per category:")
print(category_agg)

输出

... (previous output) ...

Sum and mean of sales per category:
              sum    mean
Category
Books         750   375.0
Clothing     1700   850.0
Electronics  2700  1350.0

按多列分组

只需要将列名列表传递给groupby方法

multi_group_sum = df.groupby(['Region', 'Category'])['Sales'].sum()
print("\nTotal sales per Region and Category:")
print(multi_group_sum)

输出

... (previous output) ...

Total sales per Region and Category:
Region  Category
East    Clothing        900
North   Electronics    2700
South   Clothing        800
West    Books           750
Name: Sales, dtype: int64

重置分组 DataFrame的索引

如何将分组后的输出转换回常规 DataFrame? 其中分组键是列而不是索引。默认情况下,groupby() 会将分组键设置为结果 Series 或 DataFrame 的索引。有时,我们可能需要一个扁平化DataFrame 以便进行进一步处理或可视化。

实现这一目标的最简单方法是在 groupby() 方法中使用 as_index=False 参数。

category_sales_flat = df.groupby('Category', as_index=False)['Sales'].sum()
print("\nGrouped data with 'Category' as a column:")
print(category_sales_flat)

输出

Grouped data with 'Category' as a column:
      Category  Sales
0        Books    750
1     Clothing   1700
2  Electronics   2700
posted @ 2025-09-23 21:50  栗悟饭与龟功気波  阅读(5)  评论(0)    收藏  举报