Pandas 基本用法快速入门
Pandas 基本用法快速入门
文章基于Labex-Pandas 实验整理
Pandas创建DataFrame
从字典创建DataFrame
import pandas as pd
student_data = {
'Name' : ['Alice', 'Bob', 'Charlie'],
'Score' : [85, 92, 78]
}
df = pd.DataFrame(student_data)
print(df)
Name Score
0 Alice 85
1 Bob 92
2 Charlie 78
在DataFrame中指定列名
指定列名相当于指定了列之间的顺序
import pandas as pd
student_data = {
'Name' : ['Alice', 'Bob', 'Charlie'],
'Score' : [85, 92, 78]
}
df = pd.DataFrame(student_data, columns = ['Score', 'Name'])
print(df)
Score Name
0 85 Alice
1 92 Bob
2 78 Charlie
通过 index 参数添加索引标签
import pandas as pd
student_data = {
'Name' : ['Alice', 'Bob', 'Charlie'],
'Score' : [85, 92, 78]
}
index_labels = ['ID1', 'ID2', 'ID3']
df = pd.DataFrame(student_data, index = index_labels)
print(df)
Name Score
ID1 Alice 85
ID2 Bob 92
ID3 Charlie 78
使用点符号访问 DataFrame 列
import pandas as pd
student_data = {
'Name' : ['Alice', 'Bob', 'Charlie'],
'Score' : [85, 92, 78]
}
index_labels = ['ID1', 'ID2', 'ID3']
df = pd.DataFrame(student_data)
print(df.Name)
输出是一个 Pandas Series,它本质上是 DataFrame 的单列
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
使用 info 方法显示 DataFrame 信息
import pandas as pd
student_data = {
'Name' : ['Alice', 'Bob', 'Charlie'],
'Score' : [85, 92, 78]
}
index_labels = ['ID1', 'ID2', 'ID3']
df = pd.DataFrame(student_data)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 3 non-null object
1 Score 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
Pandas读取外部数据
使用read_csv()读取csv文件
pandas.read_csv() 的参数 comment 用来指定“注释起始字符”。作用如下:
- 指定一个单字符(例如 '#')。当解析到该字符时,该字符及其后面的内容都会被忽略。
- 如果这一字符出现在一行的开头,则整行都会被跳过(常用于跳过文件里的说明/元数据行)。
- 如果出现在行中间(且不在引号内),则从该字符起到行尾的内容被当作注释丢弃,相当于“行内注释”。
- 在引号包裹的字段内出现的注释字符会被视为普通文本,不会触发注释。
- 只能是单个字符;不能使用多字符注释(例如不能直接用 '//')。
- 注意:注释判定是从行首匹配该字符;如果行首有空格再跟注释符,通常不会被当作“以注释开头的行”而跳过。
data.csv
# This is a comment line
# Another comment line
id,name,age,city
1,Alice,25,New York
2,Bob,30,Los Angeles
3,Charlie,Not Available,London
4,David,35,N/A
5,Eve,22,Paris
import pandas as pd
df = pd.read_csv('data.csv', comment = '#')
print(df)
id name age city
0 1 Alice 25 New York
1 2 Bob 30 Los Angeles
2 3 Charlie Not Available London
3 4 David 35 NaN
4 5 Eve 22 Paris
在read_csv中指定表头行
默认情况下,read_csv假定文件中的第一行(非注释行,非跳过行)是包含列名的表头行。
header参数允许你显示地指定哪一行作为表头,它接受一个整数,表示行索引(从0开始)。由于第一行数据(在注释之后)是我们地表头,其索引是0
import pandas as pd
df = pd.read_csv('data.csv', comment='#', header = 0)
print(df)
使用 na_values 参数处理缺失值
Pandas 会自动识别一些常见的缺失值指示符,如空字符串、NA 或 N/A,但不会识别自定义的如 "Not Available"。我们可以使用 na_values 参数提供一个字符串列表,这些字符串将被解释为 NaN(Not a Number),这是 Pandas 用于缺失数据的标准标记。
import pandas as pd
missing_value = ["Not Available", "N/A"]
df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df)
id name age city
0 1 Alice 25.0 New York
1 2 Bob 30.0 Los Angeles
2 3 Charlie NaN London
3 4 David 35.0 NaN
4 5 Eve 22.0 Paris
使用.head()方法显示前几行
.head() 方法是一种方便的方式来快速预览你的数据。默认情况下,它返回前 5 行。.head(x)返回前x行
import pandas as pd
missing_value = ["Not Available", "N/A"]
df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df.head(3))
id name age city
0 1 Alice 25.0 New York
1 2 Bob 30.0 Los Angeles
2 3 Charlie NaN London
使用shape属性检查DataFrame的形状
DataFrame 有一个 .shape 属性,它返回一个包含行数和列数的元组。
请注意,.shape 是一个属性,而不是一个方法,因此在访问它时不需要使用括号 ()。
import pandas as pd
missing_value = ["Not Available", "N/A"]
df = pd.read_csv('data.csv', comment = '#', na_values = missing_value)
print(df.shape)
(5, 4)
Pandas 数据选择
数据
students.csv
name,age,major,score
Alice,20,Physics,85
Bob,22,Computer Science,92
Charlie,21,Mathematics,95
David,23,Engineering,88
Eve,20,Biology,90
使用方括号表示选择单列
选择单列会返回一个Series,它本质上是一个一维的带标签数组
import pandas as pd
df = pd.read_csv('students.csv')
name_column = df['name']
print(name_column)
输出
0 Alice
1 Bob
2 Charlie
3 David
4 Eve
Name: name, dtype: object
通过列表选择多列
与选择单列返回Series不同,选择多列会返回一个新的DataFrame
import pandas as pd
df = pd.read_csv('students.csv')
subset = df[['name', 'score']]
print(subset)
输出
name score
0 Alice 85
1 Bob 92
2 Charlie 95
3 David 88
4 Eve 90
使用loc标签选择行
.loc访问器根据标签选择数据。.loc索引器主要基于标签,这意味着你使用索引 名称 (或标签) 来进行选择。默认情况下,当你加载CSV文件而不指定索引列时,Pandas会分配一个从0开始的默认整数索引。这些整数充当行的标签
import pandas as pd
df = pd.read_csv('students.csv')
charlie_data = df.loc[2]
print(charlie_data)
输出
name Charlie
age 21
major Mathematics
score 95
Name: 2, dtype: object
使用iloc按整数位置选择行
.iloc索引器主要基于整数位置。它的工作方式与Python列表切片类似,你使用整数索引来访问元素。这与使用标签.loc不同。虽然我们的默认标签索引也是整数,但当你拥有非整数标签时,这种区别就变得至关重要
import pandas as pd
df = pd.read_csv('students.csv')
first_row = df.iloc[0]
print(first_row)
使用loc切片行和列
.loc和.iloc都可以使用df.loc[row_selector, column_selector]语法同时选择行和列
.loc的一个关键特性是,当你使用标签进行切片时(例如1:3),结束标签(3)是包含在内的
import pandas as pd
df = pd.read_csv('students.csv')
data_slice = df.loc[1:3, 'name' : 'major']
print(data_slice)
输出
name age major
1 Bob 22 Computer Science
2 Charlie 21 Mathematics
3 David 23 Engineering
使用布尔条件过滤
创建一个布尔条件,该条件会返回一个包含 True 和 False 值的 Pandas Series。当你将此 Series 传递给 DataFrame 时,它只会返回值为 True 的那些行。
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
older_than_30 = df[df['Age'] > 30]
print("Employees older than 30:")
print(older_than_30)
输出
Employees older than 30:
Name Department Age Salary
1 Bob HR 45 80000.0
2 Charlie Sales 38 75000.0
3 David IT 52 95000.0
5 Frank Finance 33 NaN
使用 & 运算符组合过滤条件
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]
print("IT employees older than 30:")
print(it_and_older)
输出
IT employees older than 30:
Name Department Age Salary
3 David IT 52 95000.0
使用 isin 方法进行值匹配
此方法接受一个值列表,并为列值在该列表中的每一行返回True
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
hr_or_finance = df[df["Department"].isin(["HR", "Finance"])]
print("Employees in HR or Finance:")
print(hr_or_finance)
输出
Employees in HR or Finance:
Name Department Age Salary
0 Alice HR 25 50000.0
1 Bob HR 45 80000.0
5 Frank Finance 33 NaN
使用 notnull 过滤以移除 NaN
notnull()方法返回一个布尔Series,对于非缺失值返回True, 对于缺失值(NaN)返回False。你可以使用它轻松地移除特定列中包含NaN的行
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
valid_salary = df[df['Salary'].notnull()]
print("Employees with valid salary information:")
print(valid_salary)
输出
Employees with valid salary information:
Name Department Age Salary
0 Alice HR 25 50000.0
1 Bob HR 45 80000.0
2 Charlie Sales 38 75000.0
3 David IT 52 95000.0
4 Eve IT 29 62000.0
使用 len 函数计算过滤后的行数
返回DataFrame中的行数
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Department': ['HR', 'HR', 'Sales', 'IT', 'IT', 'Finance'],
'Age': [25, 45, 38, 52, 29, 33],
'Salary': [50000, 80000, 75000, 95000, 62000, np.nan]
}
df = pd.DataFrame(data)
it_and_older = df[(df['Department'] == 'IT') & (df['Age'] > 30)]
hr_or_finance = df[df["Department"].isin(["HR", "Finance"])]
valid_salary = df[df['Salary'].notnull()]
it_employees = df[df['Department'] == 'IT']
num_it_employees = len(it_employees)
print(f"Number of employees in IT:{num_it_employees}")
输出
Number of employees in IT:2
Pandas 数据排序
使用 sort_values 按单列排序 DataFrame
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 22, 25, 28, 22],
'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)
print("Orignal DataFrame:")
print(df)
df_sorted_by_age = df.sort_values(by = 'Age')
print("\nDataFrame sorted by Age:")
print(df_sorted_by_age)
输出
Orignal DataFrame:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
DataFrame sorted by Age:
Name Age Score
1 Bob 22 91
4 Eve 22 91
2 Charlie 25 88
0 Alice 25 85
3 David 28 79
按多列升序排列
要按多列排序,我们需要将列名列表传递给sort_values()方法的by参数。Pandas 将首先按照列表中的第一列进行排序,然后相同的按照第二列排序,以此类推。
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 22, 25, 28, 22],
'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)
print("Orignal DataFrame:")
print(df)
df_sorted_multiple = df.sort_values(by=['Age', 'Score'])
print("\nDataFrame sorted by Age and then Score:")
print(df_sorted_multiple)
输出
DataFrame sorted by Age and then Score:
Name Age Score
1 Bob 22 91
4 Eve 22 91
0 Alice 25 85
2 Charlie 25 88
3 David 28 79
使用 ascending = False 按降序排列
默认情况下,sort_values() 按升序排序
要按降序排序,请将 ascending=False 设置为 True。
当按多列排序时,我们可以通过传递布尔值列表来为每列指定不同的顺序,例如 ascending=[True, False]。
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 22, 25, 28, 22],
'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)
print("Orignal DataFrame:")
print(df)
df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)
输出
Orignal DataFrame:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
DataFrame sorted by Age (asc) and Score (desc):
Name Age Score
1 Bob 22 91
4 Eve 22 91
2 Charlie 25 88
0 Alice 25 85
3 David 28 79
使用 sort_index 按索引排序
在执行sort_values()操作后,DataFrame的索引会变得混乱,那么sort_index()方法允许我们根据索引标签对DataFrame进行排序,如果索引是简单的范围,则可以恢复原始顺序
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 22, 25, 28, 22],
'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)
print("Orignal DataFrame:")
print(df)
df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)
df_reordered_by_index = df_sorted_mixed.sort_index()
print("\nDataFrame re-sorted by index:")
print(df_reordered_by_index)
输出
Orignal DataFrame:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
DataFrame sorted by Age (asc) and Score (desc):
Name Age Score
1 Bob 22 91
4 Eve 22 91
2 Charlie 25 88
0 Alice 25 85
3 David 28 79
DataFrame re-sorted by index:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
使用 reset_index 在排序后重置索引
reset_index()会用一个默认的整数索引(0, 1, 2...)替换当前索引。通常使用drop = True参数来完全丢弃就索引。如果不使用drop = True,旧索引将被添加一个名为index的新列
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 22, 25, 28, 22],
'Score': [85, 91, 88, 79, 91]
}
df = pd.DataFrame(data)
print("Orignal DataFrame:")
print(df)
df_sorted_mixed = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print("\nDataFrame sorted by Age (asc) and Score (desc):")
print(df_sorted_mixed)
df_reordered_by_index = df_sorted_mixed.sort_index()
print("\nDataFrame re-sorted by index:")
print(df_reordered_by_index)
df_with_reset_index = df_sorted_mixed.reset_index(drop=True)
print("\nSorted DataFrame with reset index:")
print(df_with_reset_index)
df_with_reset_index_without_drop = df_sorted_mixed.reset_index()
print("\nWithout Drop:")
print(df_with_reset_index_without_drop)
输出
Orignal DataFrame:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
DataFrame sorted by Age (asc) and Score (desc):
Name Age Score
1 Bob 22 91
4 Eve 22 91
2 Charlie 25 88
0 Alice 25 85
3 David 28 79
DataFrame re-sorted by index:
Name Age Score
0 Alice 25 85
1 Bob 22 91
2 Charlie 25 88
3 David 28 79
4 Eve 22 91
Sorted DataFrame with reset index:
Name Age Score
0 Bob 22 91
1 Eve 22 91
2 Charlie 25 88
3 Alice 25 85
4 David 28 79
Without Drop:
index Name Age Score
0 1 Bob 22 91
1 4 Eve 22 91
2 2 Charlie 25 88
3 0 Alice 25 85
4 3 David 28 79
Pandas 基本数据清理
使用 dropna方法删除行
原始数据
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Eva', 'Frank'],
'age': [25, 30, 35, 40, 25, np.nan, 45],
'city': ['New York', 'Los Angeles', 'New York', 'Chicago', 'New York', 'Boston', np.nan],
'SALARY_IN_USD': ['50000', '60000', '70000', '80000', '50000', '90000', '100000']
}
默认情况下,dropna()会删除包含至少一个NaN值得任何行
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\n DataFrame after dropping rows with any missing values:")
df_dropped = df.dropna()
print(df_dropped)
输出
Original DataFrame:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva NaN Boston 90000
6 Frank 45.0 NaN 100000
DataFrame after dropping rows with any missing values:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
使用fillna填充缺失值
通常用一个常量(0或者Unkonwn)来填充NaN值,或者用一个计算值(比如平均数或者中位数)来填充
mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
print("\n DataFrame after filling missing values:")
print(df)
注意
inplace = True参数会就地修改DataFrame,不会返回一个对象。而inplace = False不会就地修改,而是会返回一个DataFrame对象
输出
Original DataFrame:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva NaN Boston 90000
6 Frank 45.0 NaN 100000
DataFrame after filling missing values:
name age city SALARY_IN_USD
0 Alice 25.000000 New York 50000
1 Bob 30.000000 Los Angeles 60000
2 Charlie 35.000000 New York 70000
3 David 40.000000 Chicago 80000
4 Alice 25.000000 New York 50000
5 Eva 33.333333 Boston 90000
6 Frank 45.000000 Unknown 100000
使用drop_duplictates 删除重复行
inplace参数用法同上
print("Original DataFrame:")
print(df)
mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
df.drop_duplicates(inplace = True)
print("\nDataFrame after dropping duplicates:")
print(df)
输出
Original DataFrame:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva NaN Boston 90000
6 Frank 45.0 NaN 100000
DataFrame after filling missing values:
DataFrame after dropping duplicates:
name age city SALARY_IN_USD
0 Alice 25.000000 New York 50000
1 Bob 30.000000 Los Angeles 60000
2 Charlie 35.000000 New York 70000
3 David 40.000000 Chicago 80000
5 Eva 33.333333 Boston 90000
6 Frank 45.000000 Unknown 100000
使用rename方法重命名列
将一个字典传递给columns参数,其中键是旧名称,值是新名称
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
mean_age = df['age'].mean()
df.fillna({'age' : mean_age, 'city' : 'Unknown'}, inplace = True)
print("\n DataFrame after filling missing values:")
df.drop_duplicates(inplace = True)
print("\nDataFrame after dropping duplicates:")
print(df)
df.rename(columns = {'SALARY_IN_USD': 'salary'}, inplace = True)
print("\nDataFrame after renaming colmns:")
print(df)
输出
Original DataFrame:
name age city SALARY_IN_USD
0 Alice 25.0 New York 50000
1 Bob 30.0 Los Angeles 60000
2 Charlie 35.0 New York 70000
3 David 40.0 Chicago 80000
4 Alice 25.0 New York 50000
5 Eva NaN Boston 90000
6 Frank 45.0 NaN 100000
DataFrame after filling missing values:
DataFrame after dropping duplicates:
name age city SALARY_IN_USD
0 Alice 25.000000 New York 50000
1 Bob 30.000000 Los Angeles 60000
2 Charlie 35.000000 New York 70000
3 David 40.000000 Chicago 80000
5 Eva 33.333333 Boston 90000
6 Frank 45.000000 Unknown 100000
DataFrame after renaming colmns:
name age city salary
0 Alice 25.000000 New York 50000
1 Bob 30.000000 Los Angeles 60000
2 Charlie 35.000000 New York 70000
3 David 40.000000 Chicago 80000
5 Eva 33.333333 Boston 90000
6 Frank 45.000000 Unknown 100000
使用astype装换列类型
对最初得DataFrame执行df.info()输出,我们会注意到SALARY_IN_USD列得dtype是object,这意味着它将数字存储为字符串,如果想对其进行运算,我们可以装换为数字类型
df['salary'] = df['salary'].astype(int)
Pandas 数据统计
数据
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [24, 27, 22, 32, 29],
'score': [85, 90, 78, 95, 88],
'grade': ['B', 'A', 'C', 'A', 'B']}
使用mean方法计算均值
.mean()可以对Series(DataFrame的一列)计算均值
print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")
score_mean = df['score'].mean()
print(f"Mean Score: {score_mean}")
输出
Original DataFrame:
name age score grade
0 Alice 24 85 B
1 Bob 27 90 A
2 Charlie 22 78 C
3 David 32 95 A
4 Eve 29 88 B
==============================
Mean Score: 87.2
使用 median方法计算中位数
score_mdian = df['score'].median()
print(f"Median Score: {score_mdian}")
输出
Median Score: 88.0
查找最大值和最小值
.max()查找最大值,.min()查找最小值
score_min = df['score'].min()
score_max = df['score'].max()
print(f"Minimum Score: {score_min}")
print(f"Maximum Score: {score_max}")
输出
Minimum Score: 78
Maximum Score: 95
使用 describe 生成汇总统计信息
.describe() 方法。这个方法可以为 DataFrame 中的所有数值列生成全面的描述性统计信息摘要,包括计数(count)、均值(mean)、标准差(standard deviation)、最小值(min)、最大值(max)以及四分位数(quartile)值。
summary_stats = df.describe()
print("Descriptive Statistics Summary:")
print(summary_stats)
输出
Descriptive Statistics Summary:
age score
count 5.000000 5.000000
mean 26.800000 87.200000
std 4.024922 6.379655
min 22.000000 78.000000
25% 24.000000 85.000000
50% 27.000000 88.000000
75% 29.000000 90.000000
max 32.000000 95.000000
使用 value_counts计数唯一值
.value_counts()方法返回一个包含唯一值计数的Series(也就是说去重后计数)
grade_counts = df['grade'].value_counts()
print("Grade Counts:")
print(grade_counts)
输出
... (previous output) ...
Grade Counts:
grade
B 2
A 2
C 1
Name: count, dtype: int64
Pandas 分组与聚合
使用 groupby 按单列分组
groupby方法将DataFrame拆分成组,返回一个DataFrameGroupby对象,此对象包含了之后对每个组应用计算所需的所有信息
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")
group_by_categeory = df.groupby('Category')
print("Type of the grouped object:")
print(type(group_by_categeory))
print("\nIterating over groups to see their content:")
for name, group in group_by_categeory:
print(f"\nGroup: {name}")
print(group)
输出
Original DataFrame:
Category Region Sales Units
0 Electronics North 1200 10
1 Clothing South 800 25
2 Electronics North 1500 8
3 Books West 300 15
4 Clothing East 900 20
5 Books West 450 18
==============================
Type of the grouped object:
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Iterating over groups to see their content:
Group: Books
Category Region Sales Units
3 Books West 300 15
5 Books West 450 18
Group: Clothing
Category Region Sales Units
1 Clothing South 800 25
4 Clothing East 900 20
Group: Electronics
Category Region Sales Units
0 Electronics North 1200 10
2 Electronics North 1500 8
对分组应用sum聚合
在数据分组后,最常见的下一步是计算每个组的某个值,例如sum(), mean(), count()或max()
category_sales_sum = df.groupby('Category')['Sales'].sum()
print("Total sales per category:")
print(category_sales_sum)
Total sales per category:
Category
Books 750
Clothing 1700
Electronics 2700
Name: Sales, dtype: int64
使用agg聚合多个函数
agg()方法可以接受一个字符串列表,其中每个字符串都是一个聚合函数的名称。
category_agg = df.groupby('Category')['Sales'].agg(['sum', 'mean'])
print("\nSum and mean of sales per category:")
print(category_agg)
输出
... (previous output) ...
Sum and mean of sales per category:
sum mean
Category
Books 750 375.0
Clothing 1700 850.0
Electronics 2700 1350.0
按多列分组
只需要将列名列表传递给groupby方法
multi_group_sum = df.groupby(['Region', 'Category'])['Sales'].sum()
print("\nTotal sales per Region and Category:")
print(multi_group_sum)
输出
... (previous output) ...
Total sales per Region and Category:
Region Category
East Clothing 900
North Electronics 2700
South Clothing 800
West Books 750
Name: Sales, dtype: int64
重置分组 DataFrame的索引
如何将分组后的输出转换回常规 DataFrame? 其中分组键是列而不是索引。默认情况下,groupby() 会将分组键设置为结果 Series 或 DataFrame 的索引。有时,我们可能需要一个扁平化的 DataFrame 以便进行进一步处理或可视化。
实现这一目标的最简单方法是在 groupby() 方法中使用 as_index=False 参数。
category_sales_flat = df.groupby('Category', as_index=False)['Sales'].sum()
print("\nGrouped data with 'Category' as a column:")
print(category_sales_flat)
输出
Grouped data with 'Category' as a column:
Category Sales
0 Books 750
1 Clothing 1700
2 Electronics 2700

浙公网安备 33010602011771号