Pandas_索引

# 1)使用index查询数据 drop==False，列索引还保持在column：
df.set_index('userId', inplace=True, drop=False)		# DataFrame设置索引，drop=False不删除这一列数据

# 2）使用index的查询方式 查询userId为1的4条数据 loc方法的第一个参数为index：
df.loc[1].head(4)

# 3）使用column的condition查询方法：
df.loc[df['userId'] == 1].head()

# 4)使用index会提升查询性能
如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

from sklearn.utils import shuffle
df_shuffle = shuffle(df)                        # 将数据随机打乱
df_shuffle.index.is_monotonic_increasing        # 判断索引是否为有序的
df_shuffle.index.is_unique                      # 查看索引是否是唯一的
%timeit df_shuffle.loc[1]                       # 计时，查询id=500数据性能 %timeit 为ipython的魔法命令 会将命令执行多次

# 5）使用index能自动对齐数据
s1 = pd.Series([1,2,3], index=list('abc'))       # 指定Series的索引
s2 = pd.Series([2,3,4], index=list('bcd'))
s1 + s2                                         # a和d找不到对应的索引
# 结果：
    # a    NaN
    # b    4.0
    # c    6.0
    # d    NaN
    # dtype: float64

# 6)使用index更多更强大的数据结构支持
# 很多强大的索引数据结构
# CategoricalIndex 基于分类数据的Index，提升性能
# MultiIndex 多维索引，用于groupby多维聚合后结果等
# DatetimeIndex 时间类型索引 强大的日期和时间的方法支持
posted @ 2022-03-08 16:20 aall_blue 阅读(60) 评论(0) 收藏举报
刷新页面返回顶部
aall_blue

Pandas_索引

公告