PYTHON PANDAS入门-（11）PANDAS的索引index - 一只牧羊虎

一、index的用途

1、更方便的查询

2、提升查询性能

3、自动对齐

4、更多更强大的数据结构支持

二、实例

1、方便查询

import pandas as pd
df=pd.read_csv('./ratings.csv')
df.head()
'''
userId    movieId    rating    timestamp
0    1    1    4.0    964982703
1    1    3    4.0    964981247
2    1    6    4.0    964982224
3    1    47    5.0    964983815
4    1    50    5.0    964982931'''
#使用index查询
#改变索引值
df.set_index('userId',inplace=True,drop=False) #drop=False表示不删除原值
df.head()
'''
    userId    movieId    rating    timestamp
userId                
1    1    1    4.0    964982703
1    1    3    4.0    964981247
1    1    6    4.0    964982224
1    1    47    5.0    964983815
1    1    50    5.0    964982931
'''
df.loc[df['userId']==500].head() #不使用index按userId查询
'''
    userId    movieId    rating    timestamp
79907    500    1    4.0    1005527755
79908    500    11    1.0    1005528017
79909    500    39    1.0    1005527926
79910    500    101    1.0    1005527980
79911    500    104    4.0    1005528065
'''
df.loc[500].head()  #使用index查询，简单
'''

userId    movieId    rating    timestamp
userId                
500    500    1    4.0    1005527755
500    500    11    1.0    1005528017
500    500    39    1.0    1005527926
500    500    101    1.0    1005527980
500    500    104    4.0    1005528065
'''

2、提升效率

3、自动对齐

s1=pd.Series([1,2,3],index=list('abc'))
'''
a    1
b    2
c    3
dtype: int64
'''
s2=pd.Series([2,3,4],index=list('bcd'))
'''
b    2
c    3
d    4
dtype: int64
'''
s1+s2
'''
a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64
'''

b和c自动对齐了，a和d无法对齐，填充了空值

4、支持更多更强大的数据结构

1）Categoricalindex，基于分类的index，提升性能

2）Multiindex，多索引，用于groupby的多维聚合

3）Datetimeindex，时间类索引，强大的日期和时间的方法

1）如果index值唯一，则pandas使用哈希表优化，查询性能为O(1)

2）如果index值不唯一，但有序，则pandas使用二分法查询，性能为O（logN）

3）如果完全随机，每次查询要全表扫描，性能为O(N)

发表于 2023-01-28 21:13 一只牧羊虎阅读(251) 评论(0) 收藏举报