Pandas性能优化:让pandas运行加快的技巧

参考文章:Top 5 tips to make your pandas code absurdly fast   

一. 采用向量化计算

采用向量化计算相比普通的循环可以大幅度提高性能。

import os
import re
import time
import random
from functools import lru_cache
 
import numpy as np
import pandas as pd
 
# 需要使用并行计算时取消注释
# from pandarallel import pandarallel
# pandarallel.initialize(nb_workers=min(os.cpu_count(), 12), progress_bar=True)
 
 
def time_compare(func1, func2, times=10, *args):
    start = time.time()
    for i in range(times):
        func1(*args)
    end = time.time()
    for i in range(times):
        func2(*args)
    end2 = time.time()
    print("%s比%s快%s倍!" % (func1.__name__,  func2.__name__, (end2 - end) / (end - start)))
 
 
def count_loop(x_df, target: int) -> int:
    # 采用循环计算x_df的column_1列中值为target的数据行数。
    return sum(x == target for x in x_df['column_1'])
 
 
def count_vectorized(x_df, target: int) -> int:
    # 采用矢量化方式计算
    return (x_df['column_1'] == target).sum()
 
 
def test_demo1():
    import random
    df = pd.DataFrame()
    df['column_1'] = [random.randint(1, 10) for i in range(1000000)]
    print(df)
 
    count_loop(df, 5)
    count_vectorized(df, 5)
    time_compare(count_vectorized, count_loop, 10, df, 5)

运行结果是:

  count_vectorized比count_loop快68.43278322570357倍!

如果将test_demo1()的代码改为如下内容:

def test_demo1():
    import random
    df = pd.DataFrame()
    df['column_1'] = [random.randint(1, 10) for i in range(600)]
    time_compare(count_vectorized, count_loop, 1000, df, 5)

降低了df的数据量,提高了运行次数之后,运行结果为:
  count_vectorized比count_loop快0.9146907454143715倍!

继续修改代码:

def test_demo1():
    import random
    df = pd.DataFrame()
    df['column_1'] = [random.randint(1, 10) for i in range(100)]
    time_compare(count_vectorized, count_loop, 10,1000, df, 5)

运行结果为:
  count_vectorized比count_loop快0.19698566724982886倍!

可见,虽然代码运行的效率受计算机硬件配置的影响,也可以观察到,如果数据量足够小,向量化也不一定会比for循环块,甚至会更慢。

如果将df的数据量继续调低,向量化计算的速度会一直降低,甚至只有for循环性能的十分之一。

综合来讲,可以得出结论,如果数据量较大,向量化性能显著快于for循环,但是数据量较小时,for循环的性能可能优于向量化计算,在开发时需要注意,Pandas向量化计算也有其适用范围。


另一个检验向量化与for循环速度的demo如下:

def offset_loop(x_df, days: int) -> pd.DataFrame:
    # 在column_10列的基础上添加5天。
    d = pd.Timedelta(days=days)
    x_df['column_const'] = [x + d for x in x_df['column_10']]
    return x_df
     
     
def offset_vectorized(x_df, days: int) -> pd.DataFrame:
    x_df['column_const'] = x_df['column_10'] + pd.Timedelta(days=days)
    return x_df
     
     
def test_demo2():
    import datetime
    df = pd.DataFrame()
    df['column_10'] = pd.date_range(datetime.date(1900, 1, 1), datetime.date(2025, 1, 1))
    time_compare(offset_vectorized, offset_loop, 10, df, 5)

结果为:

  offset_vectorized比offset_loop快410.74743248383413倍!

 

二. 使用迭代函数提高性能

案例为使用一个字符串过滤程序检验不同的迭代方式对程序性能的影响。

如下是字符串过滤的核心函数:

def remove_words(remove_from: str, words_to_remove: str, min_include_word_length: int = 4, rr=re) -> str:
    """
    从remove_from字符串中移除words_to_remove中的字符串。
    """
    words_to_exclude = set(words_to_remove.split(" "))
    no_html = rr.sub("<.*?>", " ", remove_from)
    include_words = [
        x
        for x in rr.findall(r"\w+", no_html)
        if (len(x) >= min_include_word_length) and (x not in words_to_exclude)
    ]
    return " ".join(include_words)

 

比较不同实现方式的差异

以下是各种实现方式:

1. 首先使用for循环实现

def loop(df: pd.DataFrame, remove_col: str, words_to_remove_col: str, *args) -> list[str]:
    # 使用for循环实现
    res = []
    i_remove_col = df.columns.get_loc(remove_col)
    i_words_to_remove_col = df.columns.get_loc(words_to_remove_col)
    for i_row in range(df.shape[0]):
        res.append(
            remove_words(
                df.iat[i_row, i_remove_col], df.iat[i_row, i_words_to_remove_col]
            )
        )
    return res

2.使用apply实现

def apply(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用apply方法实现。
    return df.apply(
        func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1
    ).tolist()

3.使用df子集+apply实现

def apply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 先截取需要处理的列,再使用apply
    return df[[remove_col, words_to_remove_col]].apply(
        func=lambda x: remove_words(x[remove_col], x[words_to_remove_col]), axis=1
    )

4.使用itertuples+列表推导式+子集实现

def iter_tuples_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用itertuples+列表推导式实现
    return [
        remove_words(x[0], x[1])
        for x in df[[remove_col, words_to_remove_col]].itertuples(
            index=False, name=None
        )
    ]

5.使用values+列表推导式+子集实现

def values_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用values+列表推导式实现
    return [
        remove_words(x[0], x[1])
        for x in df[[remove_col, words_to_remove_col]].values
    ]

6.使用to_numpy+列表推导式+子集实现

def to_numpy_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用to_numpy+列表推导式实现
    return [
        remove_words(x[0], x[1])
        for x in df[[remove_col, words_to_remove_col]].to_numpy()
    ]

7.使用zip函数 + 列表推导式实现

def zip_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用zip函数 + 列表推导式实现
    return [remove_words(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]

8.使用to_dict + 列表推导式实现

def to_dict_only_used_columns(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用to_dict + 列表推导式实现
    return [remove_words(row[remove_col], row[words_to_remove_col])
            for row in df[[remove_col, words_to_remove_col]].to_dict(orient="records")
            ]

比较每种实现方式的性能差异:

def test_demo3():
    length = 10000
    df = pd.DataFrame(
        {
            "column_1": [31, 41] * length,
            "column_2": [
                "<p>The Apple iPhone 14, launched in 2022, comes in black,"
                " has metallic bezels and 2 or 3 cameras on the back.</p>",
                "<p>The Samsung Galaxy S22 Ultra, launched in 2022, is slim, "
                "comes in purple, has metallic bezels and multiple cameras on the back.</p>",
            ] * length,
            "column_3": [59, 26] * length,
            "column_4": ["Apple iPhone", "Samsung Galaxy"] * length,
        }
    )
 
    remove_col = "column_2"
    words_to_remove_col = "column_4"
    times = 10
    time_compare(apply, loop, times, df, remove_col, words_to_remove_col)
    time_compare(apply_only_used_cols, loop, times, df, remove_col, words_to_remove_col)
    time_compare(iter_tuples_only_used_cols, loop, times, df, remove_col, words_to_remove_col)
    time_compare(values_only_used_cols, loop, times, df, remove_col, words_to_remove_col)
    time_compare(to_numpy_only_used_cols, loop, times, df, remove_col, words_to_remove_col)
    time_compare(zip_only_used_cols, loop, times, df, remove_col, words_to_remove_col)
    time_compare(to_dict_only_used_columns, loop, times, df, remove_col, words_to_remove_col)

df的数据量为10000行,每个函数运行10次,比较总时长的与loop函数的差异,结果如下:
>>
  apply比loop快2.488778827656623倍!
  apply_only_used_cols比loop快2.5450880432143665倍!
  iter_tuples_only_used_cols比loop快4.227697069094404倍!
  values_only_used_cols比loop快4.230125072178648倍!
  to_numpy_only_used_cols比loop快4.193605421800908倍!
  zip_only_used_cols比loop快4.285874304788437倍!
  to_dict_only_used_columns比loop快3.5297519386949587倍!


多次运行后,可以得出结论如下:

(1)使用apply比不使用apply的循环性能更高
(2)使用子集比使用全集性能更高,但是不明显
(3)采用列表推导式形式比直接采用apply性能更好
(4)DataFrame的itertuples、values、to_numpy这三个函数的迭代性能差异不大,但是比to_dict性能更好
(5)zip函数与DataFrame的itertuples、values、to_numpy的性能接近


降低数据量:


在此基础上,调整length参数的值,从10000逐步减小,同时提高times的值,避免因为单个函数运行太快,时间为0,导致除数为0,引起ZeroDivsionError。
length=10,times=100时输出结果如下:

  apply比loop快1.297782568395623倍!
  apply_only_used_cols比loop快0.9615337464678227倍!
  iter_tuples_only_used_cols比loop快1.381812570902026倍!
  values_only_used_cols比loop快1.6666648958776031倍!
  to_numpy_only_used_cols比loop快1.6888828682380566倍!
  zip_only_used_cols比loop快4.168200592321075倍!
  to_dict_only_used_columns比loop快0.9736854360251224倍!

多次运行,可以发现:

  (1)apply、itertuples、values、to_numpy、to_dict等随着数据量降低,性能倍数相比for循环逐步接近1,甚至低于1。
  (2)采用zip+列表推导式的优化方案,性能倍数仍然维持在4倍左右。

提高数据量:

调高length的值为1000000,times设置为1,输出结果如下:

  apply比loop快2.285570314346545倍!
  apply_only_used_cols比loop快2.3677366568650133倍!
  iter_tuples_only_used_cols比loop快4.200264759040428倍!
  values_only_used_cols比loop快4.164087932501857倍!
  to_numpy_only_used_cols比loop快4.178357045572744倍!
  zip_only_used_cols比loop快4.177424999835447倍!
  to_dict_only_used_columns比loop快3.5472882071889926倍!

调高length的值为10000000, times设置为1,输出结果为如下:

  apply比loop快2.3469223840251487倍!
  apply_only_used_cols比loop快2.469951612921611倍!
  iter_tuples_only_used_cols比loop快4.239984442624696倍!
  values_only_used_cols比loop快4.206735378444697倍!
  to_numpy_only_used_cols比loop快4.212792473103955倍!
  zip_only_used_cols比loop快4.27282418691667倍!
  to_dict_only_used_columns比loop快3.46571638926062倍!


可以发现,至少在1万到1千万以内,数据量增大暂时不会导致性能倍数发生变化。

在迭代基础上配合缓存提高性能:

@lru_cache
def remove_words2(remove_from: str, words_to_remove: str, min_include_word_length: int = 4) -> str:
    words_to_exclude = set(words_to_remove.split(" "))
    no_html = re.sub("<.*?>", " ", remove_from)
    include_words = [
        x
        for x in re.findall(r"\w+", no_html)
        if (len(x) >= min_include_word_length) and (x not in words_to_exclude)
    ]
    return " ".join(include_words)
 
 
def zip_only_used_cols_cached(df: pd.DataFrame, remove_col: str, words_to_remove_col: str) -> list[str]:
    # 使用zip + 列表推导式 + lru_cache实现
    return [remove_words2(x, y) for x, y in zip(df[remove_col], df[words_to_remove_col])]
 
def test_demo3():
    length = 10000000
    df = pd.DataFrame(
        {
            "column_1": [31, 41] * length,
            "column_2": [
                "<p>The Apple iPhone 14, launched in 2022, comes in black,"
                " has metallic bezels and 2 or 3 cameras on the back.</p>",
                "<p>The Samsung Galaxy S22 Ultra, launched in 2022, is slim, "
                "comes in purple, has metallic bezels and multiple cameras on the back.</p>",
            ] * length,
            "column_3": [59, 26] * length,
            "column_4": ["Apple iPhone", "Samsung Galaxy"] * length,
        }
    )
 
    remove_col = "column_2"
    words_to_remove_col = "column_4"
    times = 1
    time_compare(zip_only_used_cols_cached, loop, times, df, remove_col, words_to_remove_col)

运行输出结果:

  zip_only_used_cols_cached比loop快154.78049914392133倍!

由于制造的数据重复率很高,最好提高的性能倍数也很高,如果实际数据重复率不高的话,额外消耗的内存可能无法带来足够的性能提高收益。

在迭代基础上配合并行提高性能:

import os
import re
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=min(os.cpu_count(), 12), progress_bar=True)
     
def parapply_only_used_cols(df: pd.DataFrame, remove_col: str, words_to_remove_col: str, func, rr) -> list[str]:
    # 使用并行计算实现
    return df[[remove_col, words_to_remove_col]].parallel_apply(
        lambda x: func(x[remove_col], x[words_to_remove_col], rr=rr), axis=1
    )
     
     
def test_demo3():
    length = 10000000
    df = pd.DataFrame(
        {
            "column_1": [31, 41] * length,
            "column_2": [
                "<p>The Apple iPhone 14, launched in 2022, comes in black,"
                " has metallic bezels and 2 or 3 cameras on the back.</p>",
                "<p>The Samsung Galaxy S22 Ultra, launched in 2022, is slim, "
                "comes in purple, has metallic bezels and multiple cameras on the back.</p>",
            ] * length,
            "column_3": [59, 26] * length,
            "column_4": ["Apple iPhone", "Samsung Galaxy"] * length,
        }
    )
     
    remove_col = "column_2"
    words_to_remove_col = "column_4"
    times = 1
    time_compare(parapply_only_used_cols, loop, times, df, remove_col, words_to_remove_col, remove_words, re)

运行此程序的电脑cpu_count()为6,运行结果为:

  parapply_only_used_cols比loop快8.581086267629177倍!

注意,使用pandarallel模块实现pandas程序并行计算需要消耗更多的cpu资源。

三. 使用numpy提高性能

 

def count_numpy(x_df, target: int) -> int:
    return np.count_nonzero(x_df["column_1"].to_numpy() == target)
     
     
def count_numpy_sum(x_df, target: int) -> int:
    return np.sum(x_df["column_1"].to_numpy() == target)
     
def count_vectorized(x_df, target: int) -> int:
  return (x_df['column_1'] == target).sum()
     
def count_vectorized2(x_df, target: int) -> int:
    return len(x_df[(x_df['column_1'] == target)])
     
     
def test_demo4():
    import random
    df = pd.DataFrame()
    df['column_1'] = [random.randint(1, 10) for i in range(1000000)]
    time_compare(count_vectorized, count_loop, 10, df, 5)
    time_compare(count_numpy, count_loop, 10, df, 5)
    time_compare(count_numpy_sum, count_loop, 10, df, 5)
    time_compare(count_vectorized2, count_loop, 10, df, 5)

分别使用了四种方法实现了对df中值为target的数据行数的计算。

运行结果如下:

  count_vectorized比count_loop快41.38230345601147倍!
  count_numpy比count_loop快108.2795955926289倍!
  count_numpy_sum比count_loop快66.99983564918571倍!
  count_vectorized2比count_loop快26.87428957323596倍!

  从结果可见,采用numpy实现的两种计算行数的方式,比只使用pandas要快。

比较时间计算:

def offset_loop(x_df, days: int) -> pd.DataFrame:
    # 在column_10列的基础上添加5天。
    d = pd.Timedelta(days=days)
    x_df['column_const'] = [x + d for x in x_df['column_10']]
    return x_df
     
     
def offset_vectorized(x_df, days: int) -> pd.DataFrame:
    x_df['column_const'] = x_df['column_10'] + pd.Timedelta(days=days)
    return x_df
     
def offset_numpy(x_df, days: int) -> pd.DataFrame:
    x_df["column_const"] = x_df["column_10"].to_numpy() + np.timedelta64(days, "D")
    return x_df
     
     
def test_demo5():
    import datetime
    df = pd.DataFrame()
    df['column_10'] = pd.date_range(datetime.date(1900, 1, 1), datetime.date(2025, 1, 1))
    time_compare(offset_vectorized, offset_loop, 10, df.copy(), 5)
    time_compare(offset_numpy, offset_loop, 10, df.copy(), 5)

运行结果为:

  offset_vectorized比offset_loop快310.55895912304067倍!
  offset_numpy比offset_loop快493.59031363549434倍!

使用numpy运算的结果比使用pandas快。

四. 调整数据类型提高性能

通过设定合适的pandas数据类型,可以提高代码的运行速度。

def mean(df):
    df["column_0"].mean()
     
     
def mean_object(df):
    df["column_0_o"].mean()
     
     
def mean_32(df):
    df["column_0_32"].mean()
     
     
def std(df):
    df["column_0"].std()
     
     
def std_object(df):
    df["column_0_o"].std()
     
     
def value_counts(df):
    df["column_1"].value_counts()
     
     
def value_counts_object(df):
    df["column_1_o"].value_counts()
     
     
def test_demo6():
    df = pd.DataFrame()
    df["column_0"] = [random.random() * 10 for i in range(1000000)]
    df["column_0_o"] = df["column_0"].astype(object)
    df["column_0_32"] = df["column_0_o"].astype('float32')
 
    times = 10
    time_compare(mean, mean_object, times, df)
    time_compare(std, std_object, times, df)
 
    df["column_1"] = [random.randint(1, 10) for i in range(1000000)]
    df["column_1_o"] = df["column_1"].astype(object)
    time_compare(value_counts, value_counts_object, times, df)
     
    time_compare(mean_32, mean, times, df)

执行结果为:

  mean比mean_object快13.919992732697125倍!
  std比std_object快16.70126141201752倍!
  value_counts比value_counts_object快5.388198857493551倍!
  mean_32比mean快2.0620422535211267倍!

可见:
  1. 设置了类型的列column_0,在参与计算时,比没有设置类型的列column_0_o速度要快,少则几倍,多则十几倍。
  2. 设置类型为float32的列,比设置类型为float64的列,计算速度快。

所以,尽量给参与计算的数据列设置一个pandas的类型,而不是采用python的object类型。同时,如果业务不需要过高精度,数据范围较小的话,可以考虑采用数据范围更小、精度更低的数据类型来提高代码执行速度。

 

五. 单列分组速度快过多列分组

需要对数据进行分组求和,但是分组列是多个列的情况,可以考虑凑成单个列提高速度。

def timer(func):
    def wrapper(*_args, **_kwargs):
        start_time = time.time()
        for i in range(10):
            data = func(*_args, **_kwargs)
        end_time = time.time()
        print("%s: %s" % (func.__name__, end_time - start_time))
        return data
    return wrapper
 
 
def groupby(df):
    df.groupby(['B', 'C']).A.sum()
 
 
def groupby_single(df):
    df.groupby('id').A.sum()
 
 
def groupby2(df):
    df.groupby(['B', 'C', 'D']).A.sum()
 
 
def groupby_single2(df):
    df.groupby('id2').A.sum()
 
 
def test_demo7():
    # 单列分组和多列分组
    import random
    df = pd.DataFrame()
    data_length = 1000000
    df['A'] = [random.random() for _ in range(data_length)]
    df['B'] = [random.randint(1, 10) for _ in range(data_length)]
    df['C'] = [random.randint(1, 10) for _ in range(data_length)]
    df['D'] = [random.randint(1, 10) for _ in range(data_length)]
    df['B'] = df.B.astype(str)
    df['C'] = df.C.astype(str)
    df['D'] = df.C.astype(str)
    start = time.time()
    df['id'] = df['B'].astype(str) + "_" + df['C'].astype(str)
    print(f"合成id列的时间为{time.time() - start}秒")
    start = time.time()
    df['id2'] = df['B'].astype(str) + "_" + df['C'].astype(str) + "_" + df['D'].astype(str)
    print(f"合成id2列的时间为{time.time() - start}秒")
    times = 10
    time_compare(groupby_single, groupby, times, df)
    time_compare(groupby_single2, groupby2, times, df)
    timer(groupby)(df)
    timer(groupby_single)(df)
    timer(groupby2)(df)
    timer(groupby_single2)(df)

输出结果:
  合成id列的时间为0.45378637313842773秒
  合成id2列的时间为0.6722023487091064秒
  groupby_single比groupby快1.816462146412564倍!
  groupby_single2比groupby2快3.1309729995336744倍!
  groupby: 1.3633546829223633
  groupby_single: 0.6931476593017578
  groupby2: 2.0515127182006836
  groupby_single2: 0.6333067417144775

从结果可见,单列运行速度比多列运行速度快,并且多列数量越多,合并成单列后性能提升幅度越大。

注意,需要根据合成分组列消耗的时间是否超过了节约的时间来判断是否采用多列合并成单列来优化。

六. 使用恰当的文件格式提高读写速度和降低磁盘消耗

@timer
def to_parquet(df):
    # 需要安装pyarrow、fastparquet
    df.to_parquet("test.parquet")
     
     
@timer
def to_pickle(df):
    df.to_pickle("test.pickle")
     
     
@timer
def to_dta(df):
    df.to_stata("test.dta")
     
     
@timer
def to_csv(df):
    df.to_csv("test.csv")
     
     
def test_demo8():
    import random
    df = pd.DataFrame([[random.randint(1, 10) for i in range(20)]
                       for _ in range(100000)], columns=[f"col{i}" for i in range(20)])
 
    to_parquet(df)
    to_pickle(df)
    to_csv(df)
    to_dta(df)


运行test_demo8的结果为:

to_parquet: 0.5604836940765381
to_pickle: 0.1137237548828125
to_csv: 3.3569953441619873
to_dta: 0.6413149833679199

统计运行时间和结果文件大小如下:

函数 执行时间(s) 文件大小(KB)
to_parquet 0.56 993
to_pickle 0.11 15626
to_csv 3.36 4775
to_dta 0.64 8208

可以看出,从存储文件大小来看,从小到大依次为 to_parquet < to_csv < to_dta < to_pickle, 从执行时间来看,从小到大依次为 to_pickle < to_parquet < to_dta < to_csv。

综合这四个函数的表现,to_parquet速度快的同时,存储文件较小,应该优先选择。如果只追求速度,不考虑其他因素,to_pickle也可以考虑。

此处未考虑文件压缩的情况,如果进行文件压缩,虽然文件大小会缩小,执行时间会几十倍的提高,没有比较的价值了。

 

posted @ 2025-01-02 22:13  桑胡  阅读(261)  评论(0)    收藏  举报