pandas中的基本数据操作

pandas中的基本数据操作

一、df对象的删除

  • 分为删除行、列 和去重两种删除

1. drop() 删除行、列

# 语法: 
df.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise’)

# 参数:
labels: 指定要删除的行索引或列名,参数传入方式为字符串或list-like。如果指定的是列名,要配合将axis参数设置为1或columns。
axis: 设置删除行还是删除列,0或index表示删除行,1或columns表示删除列,默认值为0。
index: 设置要删除的行,相当于设置labels且axis为0或index。
columns: 设置要删除的列,相当于设置labels且axis为1或columns。
level: 如果索引是多重索引,指定按多重索引中的哪个等级的索引删除,可以传入多重索引的下标或名称。
inplace: 设置是否在DataFrame本身删除数据,默认为False,在DataFrame的副本里删除数据,返回删除数据后的DataFrame。如果设置为True,则在调用drop()的DataFrame本身执行删除,返回值为None。
errors: 设置是否抛出错误,可以设置的值有{‘ignore’, ‘raise’},默认raise,表示抛出错误。ignore表示忽略错误,跳过传入的错误索引名或列名,正确的索引名或列名不受影响,正常执行删除。


# 示例:
import pandas as pd
import numpy as np

a = np.arange(48).reshape(8,6)
# 构造行索引序列
xueke =["yuwen","shuxue","yingyu","zhengzhi","tiyu","lishi"]
# 构造列索引序列
m_time =['2025030'+str(i) for i in range(1,a.shape[0]+1)]
df = pd.DataFrame(a,index=m_time,columns=xueke)
print(df)

# data = df.drop(["yuwen","shuxue"])  # 会报错,因为纵向数据的索引中没有 这两个索引
data = df.drop(["yuwen","shuxue"],axis=1)
print(data)

data1 = df.drop(["20250301","20250302"],axis=0)
print(data1)

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
          yingyu  zhengzhi  tiyu  lishi
20250301       2         3     4      5
20250302       8         9    10     11
20250303      14        15    16     17
20250304      20        21    22     23
20250305      26        27    28     29
20250306      32        33    34     35
20250307      38        39    40     41
20250308      44        45    46     47
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
"""

二、df对象的索引操作和取值

  • Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名称,甚至组合使用

1. 直接使用行列索引取值(先列后行)

  • 注意:是先列后行
print(df['yuwen']['20250301'])  # 打印结果:0

# 不支持的操作
# 错误1,不支持先行后列
df['20250301']['yuwen']
# 错误2,不支持下标索引拿数据
df[:1,:2]

2. 获取标签对应的索引

  • 语法:get_indexer()

  • 标签不存在时,则返回 -1

print(df)

a = df.columns.get_indexer(["yuwen", "lishi"])  # 获取标签对应的列索引
print(a)
b = df.index.get_indexer(["20230302"])  # 标签不存在时,则返回 -1
print(b)
c= df.index.get_indexer(["20250302"])  # 获取标签对应的行索引
print(c)

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
[0 5]
[-1]
[1]
"""

3. loc() 取值

  • loc按照标签或者索引、布尔值或者条件进行选择数据,这种选择数据的方法较为常用
  • 语法:df.loc [row selection, column selection]
  • 与numpy的区别
    • 相同:同numpy数组的索引切片写法一样
    • 不同:冒号两边是左闭右闭区间
    • 不同:取点时,实际取得还是不连续的行和列,与numpy的取点不同

(1)根据标签选择单行或单列

  • 同numpy数组的索引取值用法一样
print(df)
print('pppppppppppppppp')
print(df.loc['20250301']) # 选择第1行
print('pppppppppppppppp')
print(df.loc[:,['yuwen']]) # 选择 yuwen 这一列
print(df.loc[:,'yuwen']) # 选择 yuwen 这一列,yuwen 不带中括号取出的只是值,上面带中括号取出的含有 yuwen 这个索引

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
pppppppppppppppp
yuwen       0
shuxue      1
yingyu      2
zhengzhi    3
tiyu        4
lishi       5
Name: 20250301, dtype: int32
pppppppppppppppp
          yuwen
20250301      0
20250302      6
20250303     12
20250304     18
20250305     24
20250306     30
20250307     36
20250308     42

20250301     0
20250302     6
20250303    12
20250304    18
20250305    24
20250306    30
20250307    36
20250308    42
Name: yuwen, dtype: int32
"""

(2)根据标签选择多行多列

  • 同numpy数组的索引切片语法一样,注意:不同的是,loc中,冒号两边都是闭合区间,而numpy的冒号两边是左闭右开区间
print(df)
print('pppppppppppppppp')
print(df.loc['20250301':'20250303',:])  # 选择连续多行
print(df.loc[['20250301','20250303'],['yuwen']])   # 选择不连续的多行
print(df.loc[:,'yuwen':'zhengzhi'])  # 选择连续多列,冒号两边都是闭合区间
print(df.loc[:,['yuwen','zhengzhi']])  # 选择不连续多列
print(df.loc['20250301':'20250303',['yuwen','zhengzhi']])  # 选择多行多列,冒号两边都是闭合区间
print('*'*50)
print(df.index[0:2])
print(df.loc[df.index[0:2],['yuwen','zhengzhi']])  # 选择多行多列
"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
pppppppppppppppp
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17

          yuwen
20250301      0
20250303     12

          yuwen  shuxue  yingyu  zhengzhi
20250301      0       1       2         3
20250302      6       7       8         9
20250303     12      13      14        15
20250304     18      19      20        21
20250305     24      25      26        27
20250306     30      31      32        33
20250307     36      37      38        39
20250308     42      43      44        45

          yuwen  zhengzhi
20250301      0         3
20250302      6         9
20250303     12        15
20250304     18        21
20250305     24        27
20250306     30        33
20250307     36        39
20250308     42        45

          yuwen  zhengzhi
20250301      0         3
20250302      6         9
20250303     12        15
**************************************************
Index(['20250301', '20250302'], dtype='object')
          yuwen  zhengzhi
20250301      0         3
20250302      6         9
"""

(3)根据带条件标签取值

  • 注意:下面的条件规则是,逗号前面是行的条件,右边是列的条件
  • ~ 表示 "非" 或者 "取反"。
  • | 表示 "或"(or)。
  • & 表示 "与"(and)。
print('*'*50)
print(df)
print(df.loc[df.yuwen > 20,['lishi','zhengzhi']])  # 选择yuwen大于20的行 和lishi、zhengzhi两列
print(df.loc[df['yuwen'] > 20,['lishi','zhengzhi']])  # 同上面 df.yuwen > 20 结果一样
print(df.loc[(df.yuwen > 20) & (df.lishi>30),['lishi','zhengzhi']])  # & 是 and 的关系
print(df.loc[(df.yuwen > 20) | (df.yingyu == 14),['yuwen','lishi']])  # | 是 or 的关系
print(df.loc[(df.yuwen > 20) & (df.lishi>30),'zhengzhi':'lishi'])

"""
**************************************************
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47

          lishi  zhengzhi
20250305     29        27
20250306     35        33
20250307     41        39
20250308     47        45

          lishi  zhengzhi
20250305     29        27
20250306     35        33
20250307     41        39
20250308     47        45

          lishi  zhengzhi
20250306     35        33
20250307     41        39
20250308     47        45

          yuwen  lishi
20250303     12     17
20250305     24     29
20250306     30     35
20250307     36     41
20250308     42     47

          zhengzhi  tiyu  lishi
20250306        33    34     35
20250307        39    40     41
20250308        45    46     47
"""

4. iloc() 取值

  • iloc索引器用于按位置进行基于整数位置的索引或者选择
  • 语法:df.iloc [row selection, column selection]
  • 与numpy的区别
    • 相同:同numpy数组的索引切片写法一样,冒号两边也是左闭右开区间
    • 不同:取点时,实际取得还是不连续的行和列,与numpy的取点不同

(1)选择单行单列

print(df)
print(df.iloc[2])
print(df.iloc[:,[1]])
print(df.iloc[:,1])  # 对比上面带中括号的结果,不带中括号取出来的值,是没有columns索引的
print(df.iloc[:,-1]) 
"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47

yuwen       12
shuxue      13
yingyu      14
zhengzhi    15
tiyu        16
lishi       17
Name: 20250303, dtype: int32

          shuxue
20250301       1
20250302       7
20250303      13
20250304      19
20250305      25
20250306      31
20250307      37
20250308      43

20250301     1
20250302     7
20250303    13
20250304    19
20250305    25
20250306    31
20250307    37
20250308    43
Name: shuxue, dtype: int32

20250301     5
20250302    11
20250303    17
20250304    23
20250305    29
20250306    35
20250307    41
20250308    47
Name: lishi, dtype: int32
"""

(2)选择多行多列

  • 冒号两边是左开右闭区间
print(df)
print(df.iloc[2:4])  # 左开右闭区间,取出第3行到第4行
print(df.iloc[:,1:4])  # 左开右闭区间,取出第2列到第4列
print('*'*50)
print(df.iloc[[0,3],1:4])
print(df.iloc[[0,3],[1,4]])  # 取第1行和第4行,第2列和第5列

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250301      0       1       2         3     4      5
20250302      6       7       8         9    10     11
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
20250305     24      25      26        27    28     29
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250303     12      13      14        15    16     17
20250304     18      19      20        21    22     23
          shuxue  yingyu  zhengzhi
20250301       1       2         3
20250302       7       8         9
20250303      13      14        15
20250304      19      20        21
20250305      25      26        27
20250306      31      32        33
20250307      37      38        39
20250308      43      44        45
**************************************************
          shuxue  yingyu  zhengzhi
20250301       1       2         3
20250304      19      20        21
          shuxue  tiyu
20250301       1     4
20250304      19    22
"""

三、赋值和改值

  • 直接将取出来的值进行赋值即可

四、排序

  • df对象的排序有两种形式,一种对于索引进行排序,一种对于内容进行排序
    • df.sort_values(by=,ascending=),单个键或者多个键进行排序
    • df.sort index(),给索引进行排序

1. df按值排序

# 语法:
df.sort_values(by=,ascending=)
参数:

 - by:指定排序参考的键
 - ascending:默认升序
    ascending=False:降序
    ascending=True:升序
    
# 示例:
print(df.sort_values(by='yuwen',ascending=False).head(3))  # 按照yuwen这列的值,从大到小排序,显示前3行 

# 按照多个键进行排序,先根据第一个键排序,如果第一个键相等,则根据第二个键进行排序
print(df.sort_values(by=['yuwen','tiyu'],ascending=False))  

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250308     42      43      44        45    46     47
20250307     36      37      38        39    40     41
20250306     30      31      32        33    34     35

          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250308     42      43      44        45    46     47
20250307     36      37      38        39    40     41
20250306     30      31      32        33    34     35
20250305     24      25      26        27    28     29
20250304     18      19      20        21    22     23
20250303     12      13      14        15    16     17
20250302      6       7       8         9    10     11
20250301      0       1       2         3     4      5
"""    

2. df按索引排序

# 语法:
df.sort_index()  # 按照index索引,升序排序

# 示例:
data = df.sort_values(by='yuwen',ascending=False).head(3)
print(data)
print(data.sort_index())

"""
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250308     42      43      44        45    46     47
20250307     36      37      38        39    40     41
20250306     30      31      32        33    34     35
          yuwen  shuxue  yingyu  zhengzhi  tiyu  lishi
20250306     30      31      32        33    34     35
20250307     36      37      38        39    40     41
20250308     42      43      44        45    46     47
"""

3. series排序

  • 也有按值排序和按索引排序
    • series.sort_values(ascending=True),只有一列,所以不需要参数指定key
    • series.sort_index(),按照index索引进行升序排序
posted @ 2025-03-18 15:36  BigSun丶  阅读(104)  评论(0)    收藏  举报