pandas中的基本数据操作
pandas中的基本数据操作
一、df对象的删除
- 分为删除行、列 和去重两种删除
1. drop() 删除行、列
# 语法:
df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise’)
# 参数:
labels: 指定要删除的行索引或列名,参数传入方式为字符串或list-like。如果指定的是列名,要配合将axis参数设置为1或columns。
axis: 设置删除行还是删除列,0或index表示删除行,1或columns表示删除列,默认值为0。
index: 设置要删除的行,相当于设置labels且axis为0或index。
columns: 设置要删除的列,相当于设置labels且axis为1或columns。
level: 如果索引是多重索引,指定按多重索引中的哪个等级的索引删除,可以传入多重索引的下标或名称。
inplace: 设置是否在DataFrame本身删除数据,默认为False,在DataFrame的副本里删除数据,返回删除数据后的DataFrame。如果设置为True,则在调用drop()的DataFrame本身执行删除,返回值为None。
errors: 设置是否抛出错误,可以设置的值有{‘ignore’, ‘raise’},默认raise,表示抛出错误。ignore表示忽略错误,跳过传入的错误索引名或列名,正确的索引名或列名不受影响,正常执行删除。
# 示例:
import pandas as pd
import numpy as np
a = np.arange(48).reshape(8,6)
# 构造行索引序列
xueke =["yuwen","shuxue","yingyu","zhengzhi","tiyu","lishi"]
# 构造列索引序列
m_time =['2025030'+str(i) for i in range(1,a.shape[0]+1)]
df = pd.DataFrame(a,index=m_time,columns=xueke)
print(df)
# data = df.drop(["yuwen","shuxue"]) # 会报错,因为纵向数据的索引中没有 这两个索引
data = df.drop(["yuwen","shuxue"],axis=1)
print(data)
data1 = df.drop(["20250301","20250302"],axis=0)
print(data1)
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
yingyu zhengzhi tiyu lishi
20250301 2 3 4 5
20250302 8 9 10 11
20250303 14 15 16 17
20250304 20 21 22 23
20250305 26 27 28 29
20250306 32 33 34 35
20250307 38 39 40 41
20250308 44 45 46 47
yuwen shuxue yingyu zhengzhi tiyu lishi
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
"""
二、df对象的索引操作和取值
- Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名称,甚至组合使用
1. 直接使用行列索引取值(先列后行)
- 注意:是先列后行
print(df['yuwen']['20250301']) # 打印结果:0
# 不支持的操作
# 错误1,不支持先行后列
df['20250301']['yuwen']
# 错误2,不支持下标索引拿数据
df[:1,:2]
2. 获取标签对应的索引
-
语法:
get_indexer() -
标签不存在时,则返回 -1
print(df)
a = df.columns.get_indexer(["yuwen", "lishi"]) # 获取标签对应的列索引
print(a)
b = df.index.get_indexer(["20230302"]) # 标签不存在时,则返回 -1
print(b)
c= df.index.get_indexer(["20250302"]) # 获取标签对应的行索引
print(c)
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
[0 5]
[-1]
[1]
"""
3. loc() 取值
- loc按照标签或者索引、布尔值或者条件进行选择数据,这种选择数据的方法较为常用
- 语法:
df.loc [row selection, column selection] - 与numpy的区别
- 相同:同numpy数组的索引切片写法一样
- 不同:冒号两边是左闭右闭区间
- 不同:取点时,实际取得还是不连续的行和列,与numpy的取点不同
(1)根据标签选择单行或单列
- 同numpy数组的索引取值用法一样
print(df)
print('pppppppppppppppp')
print(df.loc['20250301']) # 选择第1行
print('pppppppppppppppp')
print(df.loc[:,['yuwen']]) # 选择 yuwen 这一列
print(df.loc[:,'yuwen']) # 选择 yuwen 这一列,yuwen 不带中括号取出的只是值,上面带中括号取出的含有 yuwen 这个索引
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
pppppppppppppppp
yuwen 0
shuxue 1
yingyu 2
zhengzhi 3
tiyu 4
lishi 5
Name: 20250301, dtype: int32
pppppppppppppppp
yuwen
20250301 0
20250302 6
20250303 12
20250304 18
20250305 24
20250306 30
20250307 36
20250308 42
20250301 0
20250302 6
20250303 12
20250304 18
20250305 24
20250306 30
20250307 36
20250308 42
Name: yuwen, dtype: int32
"""
(2)根据标签选择多行多列
- 同numpy数组的索引切片语法一样,注意:不同的是,loc中,冒号两边都是闭合区间,而numpy的冒号两边是左闭右开区间
print(df)
print('pppppppppppppppp')
print(df.loc['20250301':'20250303',:]) # 选择连续多行
print(df.loc[['20250301','20250303'],['yuwen']]) # 选择不连续的多行
print(df.loc[:,'yuwen':'zhengzhi']) # 选择连续多列,冒号两边都是闭合区间
print(df.loc[:,['yuwen','zhengzhi']]) # 选择不连续多列
print(df.loc['20250301':'20250303',['yuwen','zhengzhi']]) # 选择多行多列,冒号两边都是闭合区间
print('*'*50)
print(df.index[0:2])
print(df.loc[df.index[0:2],['yuwen','zhengzhi']]) # 选择多行多列
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
pppppppppppppppp
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
yuwen
20250301 0
20250303 12
yuwen shuxue yingyu zhengzhi
20250301 0 1 2 3
20250302 6 7 8 9
20250303 12 13 14 15
20250304 18 19 20 21
20250305 24 25 26 27
20250306 30 31 32 33
20250307 36 37 38 39
20250308 42 43 44 45
yuwen zhengzhi
20250301 0 3
20250302 6 9
20250303 12 15
20250304 18 21
20250305 24 27
20250306 30 33
20250307 36 39
20250308 42 45
yuwen zhengzhi
20250301 0 3
20250302 6 9
20250303 12 15
**************************************************
Index(['20250301', '20250302'], dtype='object')
yuwen zhengzhi
20250301 0 3
20250302 6 9
"""
(3)根据带条件标签取值
- 注意:下面的条件规则是,逗号前面是行的条件,右边是列的条件
~表示 "非" 或者 "取反"。|表示 "或"(or)。&表示 "与"(and)。
print('*'*50)
print(df)
print(df.loc[df.yuwen > 20,['lishi','zhengzhi']]) # 选择yuwen大于20的行 和lishi、zhengzhi两列
print(df.loc[df['yuwen'] > 20,['lishi','zhengzhi']]) # 同上面 df.yuwen > 20 结果一样
print(df.loc[(df.yuwen > 20) & (df.lishi>30),['lishi','zhengzhi']]) # & 是 and 的关系
print(df.loc[(df.yuwen > 20) | (df.yingyu == 14),['yuwen','lishi']]) # | 是 or 的关系
print(df.loc[(df.yuwen > 20) & (df.lishi>30),'zhengzhi':'lishi'])
"""
**************************************************
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
lishi zhengzhi
20250305 29 27
20250306 35 33
20250307 41 39
20250308 47 45
lishi zhengzhi
20250305 29 27
20250306 35 33
20250307 41 39
20250308 47 45
lishi zhengzhi
20250306 35 33
20250307 41 39
20250308 47 45
yuwen lishi
20250303 12 17
20250305 24 29
20250306 30 35
20250307 36 41
20250308 42 47
zhengzhi tiyu lishi
20250306 33 34 35
20250307 39 40 41
20250308 45 46 47
"""
4. iloc() 取值
- iloc索引器用于按位置进行基于整数位置的索引或者选择
- 语法:
df.iloc [row selection, column selection] - 与numpy的区别
- 相同:同numpy数组的索引切片写法一样,冒号两边也是左闭右开区间
- 不同:取点时,实际取得还是不连续的行和列,与numpy的取点不同
(1)选择单行单列
print(df)
print(df.iloc[2])
print(df.iloc[:,[1]])
print(df.iloc[:,1]) # 对比上面带中括号的结果,不带中括号取出来的值,是没有columns索引的
print(df.iloc[:,-1])
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
yuwen 12
shuxue 13
yingyu 14
zhengzhi 15
tiyu 16
lishi 17
Name: 20250303, dtype: int32
shuxue
20250301 1
20250302 7
20250303 13
20250304 19
20250305 25
20250306 31
20250307 37
20250308 43
20250301 1
20250302 7
20250303 13
20250304 19
20250305 25
20250306 31
20250307 37
20250308 43
Name: shuxue, dtype: int32
20250301 5
20250302 11
20250303 17
20250304 23
20250305 29
20250306 35
20250307 41
20250308 47
Name: lishi, dtype: int32
"""
(2)选择多行多列
- 冒号两边是左开右闭区间
print(df)
print(df.iloc[2:4]) # 左开右闭区间,取出第3行到第4行
print(df.iloc[:,1:4]) # 左开右闭区间,取出第2列到第4列
print('*'*50)
print(df.iloc[[0,3],1:4])
print(df.iloc[[0,3],[1,4]]) # 取第1行和第4行,第2列和第5列
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250301 0 1 2 3 4 5
20250302 6 7 8 9 10 11
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
20250305 24 25 26 27 28 29
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
yuwen shuxue yingyu zhengzhi tiyu lishi
20250303 12 13 14 15 16 17
20250304 18 19 20 21 22 23
shuxue yingyu zhengzhi
20250301 1 2 3
20250302 7 8 9
20250303 13 14 15
20250304 19 20 21
20250305 25 26 27
20250306 31 32 33
20250307 37 38 39
20250308 43 44 45
**************************************************
shuxue yingyu zhengzhi
20250301 1 2 3
20250304 19 20 21
shuxue tiyu
20250301 1 4
20250304 19 22
"""
三、赋值和改值
- 直接将取出来的值进行赋值即可
四、排序
- df对象的排序有两种形式,一种对于索引进行排序,一种对于内容进行排序
- df.sort_values(by=,ascending=),单个键或者多个键进行排序
- df.sort index(),给索引进行排序
1. df按值排序
# 语法:
df.sort_values(by=,ascending=)
参数:
- by:指定排序参考的键
- ascending:默认升序
ascending=False:降序
ascending=True:升序
# 示例:
print(df.sort_values(by='yuwen',ascending=False).head(3)) # 按照yuwen这列的值,从大到小排序,显示前3行
# 按照多个键进行排序,先根据第一个键排序,如果第一个键相等,则根据第二个键进行排序
print(df.sort_values(by=['yuwen','tiyu'],ascending=False))
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250308 42 43 44 45 46 47
20250307 36 37 38 39 40 41
20250306 30 31 32 33 34 35
yuwen shuxue yingyu zhengzhi tiyu lishi
20250308 42 43 44 45 46 47
20250307 36 37 38 39 40 41
20250306 30 31 32 33 34 35
20250305 24 25 26 27 28 29
20250304 18 19 20 21 22 23
20250303 12 13 14 15 16 17
20250302 6 7 8 9 10 11
20250301 0 1 2 3 4 5
"""
2. df按索引排序
# 语法:
df.sort_index() # 按照index索引,升序排序
# 示例:
data = df.sort_values(by='yuwen',ascending=False).head(3)
print(data)
print(data.sort_index())
"""
yuwen shuxue yingyu zhengzhi tiyu lishi
20250308 42 43 44 45 46 47
20250307 36 37 38 39 40 41
20250306 30 31 32 33 34 35
yuwen shuxue yingyu zhengzhi tiyu lishi
20250306 30 31 32 33 34 35
20250307 36 37 38 39 40 41
20250308 42 43 44 45 46 47
"""
3. series排序
- 也有按值排序和按索引排序
- series.sort_values(ascending=True),只有一列,所以不需要参数指定key
- series.sort_index(),按照index索引进行升序排序

浙公网安备 33010602011771号