鸭子之战

鸭子之战

towardsdatascience.com/battle-of-the-ducks-24fd55260fae/

由 AI(Dalle-3)创建的图片

由 AI(Dalle-3)创建的图片

如你们中的一些人所知,我是一位 DuckDB Python 库的大粉丝,并且已经撰写了许多关于它的文章。我也是最早撰写关于一个名为 Fireducks 的更新 Python 库的文章的人之一,并帮助将其带到人们的关注之下。

如果你从未听说过这些有用的库,请查看下面的链接以了解它们的介绍。

DuckDB

新 Pandas 竞争对手,FireDucks,带来烟雾!

这两个库正在增加它们在数据科学工作负载中的份额,其中可以争论说,数据操作和一般整理至少与机器学习方面的数据分析和洞察力一样重要。

这两个工具的核心基础非常不同;DuckDB 是一个现代的嵌入式分析数据库,旨在高效处理和查询来自各种来源的数 GB 数据。Fireducks 被设计成 Pandas 的一个更快替代品。

然而,它们的关键共性是它们都适用于通用中等规模的数据处理任务,并且性能极高。如果你的用例是这样的,你应该选择哪一个?这正是我们今天要弄清楚的问题。

这里是我将要进行的测试。

  • 将大型 CSV 文件读入内存,即 DuckDB 表和 Fireducks 数据框

  • 对两组内存中的数据进行一些典型的数据处理任务

  • 在内存数据集中基于现有的表/数据框列数据创建一个新列。

  • 将更新的内存数据集输出为 CSV 和 Parquet 格式

输入数据集

我创建了一个包含 1 亿条记录的虚假销售数据的 CSV 文件。

输入数据模式如下,

  • order_id (整数)

  • order_date (日期)

  • customer_id (整数)

  • customer_name (字符串)

  • product_id (整数)

  • product_name (字符串)

  • category (字符串)

  • quantity (整数)

  • price (浮点数)

  • total (浮点数)

这里是一个你可以使用的 Python 程序来创建 CSV 文件。在我的系统上,这产生了一个大约 7.5GB 的文件。

# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta

def generate(nrows: int, filename: str):
    names = np.asarray(
        [
            "Laptop",
            "Smartphone",
            "Desk",
            "Chair",
            "Monitor",
            "Printer",
            "Paper",
            "Pen",
            "Notebook",
            "Coffee Maker",
            "Cabinet",
            "Plastic Cups",
        ]
    )

    categories = np.asarray(
        [
            "Electronics",
            "Electronics",
            "Office",
            "Office",
            "Electronics",
            "Electronics",
            "Stationery",
            "Stationery",
            "Stationery",
            "Electronics",
            "Office",
            "Sundry",
        ]
    )

    product_id = np.random.randint(len(names), size=nrows)
    quantity = np.random.randint(1, 11, size=nrows)
    price = np.random.randint(199, 10000, size=nrows) / 100

    # Generate random dates between 2010-01-01 and 2023-12-31
    start_date = datetime(2010, 1, 1)
    end_date = datetime(2023, 12, 31)
    date_range = (end_date - start_date).days

    # Create random dates as np.array and convert to string format
    order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])

    # Define columns
    columns = {
        "order_id": np.arange(nrows),
        "order_date": order_dates,
        "customer_id": np.random.randint(100, 1000, size=nrows),
        "customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
        "product_id": product_id + 200,
        "product_names": names[product_id],
        "categories": categories[product_id],
        "quantity": quantity,
        "price": price,
        "total": price * quantity,
    }

    # Create Polars DataFrame and write to CSV with explicit delimiter
    df = pl.DataFrame(columns)
    df.write_csv(filename, separator=',',include_header=True)  # Ensure comma is used as the delimiter

# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")

安装 WSL2 Ubuntu

Fireducks 只能在 Linux 下运行,因此由于我通常运行 Windows,我将使用 WSL2 Ubuntu 来设置我的 Linux 环境,但相同的代码应该可以在任何 Linux/Unix 设置上运行。我有一个完整的指南,介绍如何在 Windows 上安装 WSL2 Ubuntu 这里

设置开发环境

好的,在我们开始编写示例代码之前,我们应该设置一个独立的开发环境。这样,我们做的不会干扰我们可能用于其他项目的其他库版本、编程等……。

我使用 Miniconda 来做这件事,但你可以使用最适合你的方法。

如果你想走 Miniconda 路线并且还没有安装,你必须首先安装 Miniconda。使用此链接获取它,

Miniconda – Anaconda 文档

环境创建完成后,使用 activate 命令切换到它,然后安装 Jupyter 和任何所需的 Python 库。

#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y
# Now activate it
(base) $ conda activate duck_battle
# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb

测试 1 – 读取大型 CSV 文件并显示最后 10 条记录

DuckDB

import duckdb

print(duckdb.__version__)

'1.1.3'
# DuckDB read CSV file 
#
import duckdb
import time

# Start the timer
start_time = time.time()

# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')

# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")

# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()

# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)

# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time

print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")

#
# DuckDB output
#

Last 10 rows of the file:
   order_id order_date  customer_id   customer_name  product_id product_names  
0  99999999 2023-06-16          102   Customer_9650         203         Chair   
1  99999998 2022-03-02          709  Customer_23966         208      Notebook   
2  99999997 2019-05-10          673  Customer_25709         202          Desk   
3  99999996 2011-10-21          593  Customer_29352         200        Laptop   
4  99999995 2011-10-24          501  Customer_29289         202          Desk   
5  99999994 2023-09-27          119  Customer_15532         209  Coffee Maker   
6  99999993 2015-01-15          294  Customer_27081         200        Laptop   
7  99999992 2016-04-07          379   Customer_1353         207           Pen   
8  99999991 2010-09-19          253  Customer_29439         204       Monitor   
9  99999990 2016-05-19          174  Customer_11294         210       Cabinet   

    categories  quantity  price   total  
0       Office         4  59.58  238.32  
1   Stationery         1  78.91   78.91  
2       Office         5   9.12   45.60  
3  Electronics         3  67.42  202.26  
4       Office         7  53.78  376.46  
5  Electronics         2  55.10  110.20  
6  Electronics         9  86.01  774.09  
7   Stationery         5  21.56  107.80  
8  Electronics         4   5.17   20.68  
9       Office         9  65.10  585.90  

DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds

Fireducks

import fireducks
import fireducks.pandas as pd

print(fireducks.__version__)
print(pd.__version__)

1.1.6
2.2.3
# Fireducks read CSV
#
import fireducks.pandas as pd
import time

# Start the timer
start_time = time.time()

# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"

# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)

# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")         

#
# Fireducks output
#

          order_id  order_date  customer_id   customer_name  product_id  
99999990  99999990  2016-05-19          174  Customer_11294         210   
99999991  99999991  2010-09-19          253  Customer_29439         204   
99999992  99999992  2016-04-07          379   Customer_1353         207   
99999993  99999993  2015-01-15          294  Customer_27081         200   
99999994  99999994  2023-09-27          119  Customer_15532         209   
99999995  99999995  2011-10-24          501  Customer_29289         202   
99999996  99999996  2011-10-21          593  Customer_29352         200   
99999997  99999997  2019-05-10          673  Customer_25709         202   
99999998  99999998  2022-03-02          709  Customer_23966         208   
99999999  99999999  2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total  
99999990       Cabinet       Office         9  65.10  585.90  
99999991       Monitor  Electronics         4   5.17   20.68  
99999992           Pen   Stationery         5  21.56  107.80  
99999993        Laptop  Electronics         9  86.01  774.09  
99999994  Coffee Maker  Electronics         2  55.10  110.20  
99999995          Desk       Office         7  53.78  376.46  
99999996        Laptop  Electronics         3  67.42  202.26  
99999997          Desk       Office         5   9.12   45.60  
99999998      Notebook   Stationery         1  78.91   78.91  
99999999         Chair       Office         4  59.58  238.32 

Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds

没有太多区别;DuckDB 以大约 6 秒的优势领先。

测试 2—按类别计算总销售额

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    categories, 
    SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()

# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds

  categories  total_sales
0 Electronics 1.168493e+10
1 Stationery  7.014109e+09
2 Office      7.006807e+09
3 Sundry      2.338428e+09

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")

#
# Fireducks output
#

categories
Electronics    1.168493e+10
Stationery     7.014109e+09
Office         7.006807e+09
Sundry         2.338428e+09
Name: total, dtype: float64

Fireducks: Time taken to calculate sales by category:  0.13571524620056152 seconds

在那里也没有太多区别。Fireducks 略胜一筹。

测试 3—前 5 位客户消费

DuckDB

# duckdb process data
#
import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    customer_id, 
    customer_name, 
    SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()

# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds

  customer_id customer_name  total_purchase
0 681         Customer_20387 6892.96
1 740         Customer_30499 6613.11
2 389         Customer_22686 6597.35
3 316         Customer_185   6565.38
4 529         Customer_1609  6494.35

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")

#
# Fireducks output
#

customer_id  customer_name 
681          Customer_20387    6892.96
740          Customer_30499    6613.11
389          Customer_22686    6597.35
316          Customer_1859     6565.38
529          Customer_1609     6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds

DuckDB 在这一项中获胜,其速度几乎是 Fireducks 的两倍。

测试 4—月度销售额数据

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()

# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")

results

# 
# DuckDB output
#

DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds

  month        monthly_sales
0 2010-01-01   1.699500e+08
1 2010-02-01   1.535730e+08
2 2010-03-01   1.702968e+08
3 2010-04-01   1.646421e+08
4 2010-05-01   1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08

168 rows × 2 columns

Fireducks

import fireducks.pandas as pd
import time

def seasonal_trend():
    # Ensure 'order_date' is datetime
    df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])

    # Extract 'month' as string
    df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')

    # Group by 'month' and sum 'total'
    results = (
        df_fire.groupby('month')['total']
        .sum()
        .reset_index()
        .sort_values('month')
    )
    print(results)

start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")

#
# Fireducks Output
#

       month         total
0    2010-01  1.699500e+08
1    2010-02  1.535730e+08
2    2010-03  1.702968e+08
3    2010-04  1.646421e+08
4    2010-05  1.704506e+08
..       ...           ...
163  2023-08  1.699263e+08
164  2023-09  1.646018e+08
165  2023-10  1.692184e+08
166  2023-11  1.644883e+08
167  2023-12  1.643962e+08

[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds

在这个例子中,DuckDB 明显更快。

测试 5—按产品平均订单

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
SELECT 
    product_id,
    product_names,
    AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()

# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds

  product_id product_names avg_order_value
0 206        Paper         280.529144
1 208        Notebook      280.497268
2 201        Smartphone    280.494779
3 207        Pen           280.491508
4 205        Printer       280.470150
5 200        Laptop        280.456913
6 209        Coffee Maker  280.445365
7 211        Plastic Cups  280.440161
8 210        Cabinet       280.426960
9 202        Desk          280.367135
10 203       Chair         280.364045
11 204       Monitor       280.329706

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time

print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

product_id  product_names
206         Paper            280.529144
208         Notebook         280.497268
201         Smartphone       280.494779
207         Pen              280.491508
205         Printer          280.470150
200         Laptop           280.456913
209         Coffee Maker     280.445365
211         Plastic Cups     280.440161
210         Cabinet          280.426960
202         Desk             280.367135
203         Chair            280.364045
204         Monitor          280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds

在那里,Fireducks 取得了胜利,并且比 DuckDB 快了两倍。

测试 6—产品性能分析

DuckDB

import duckdb
import time

# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
    SELECT 
        EXTRACT(YEAR FROM order_date) AS year,
        SUM(total) AS total_sales
    FROM sales
    GROUP BY year
)
SELECT 
    year,
    total_sales,
    LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
    (total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()

# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()

print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")

results

#
# DuckDb output
#

Time for product performance analysis  calculation: 0.03958845138549805 seconds

   year total_sales prev_year_sales yoy_growth
0  2010 2.002066e+09 NaN            NaN
1  2011 2.002441e+09 2.002066e+09   0.018739
2  2012 2.008966e+09 2.002441e+09   0.325848
3  2013 2.002901e+09 2.008966e+09  -0.301900
4  2014 2.000773e+09 2.002901e+09  -0.106225
5  2015 2.001931e+09 2.000773e+09   0.057855
6  2016 2.008762e+09 2.001931e+09   0.341229
7  2017 2.002164e+09 2.008762e+09  -0.328457
8  2018 2.002383e+09 2.002164e+09   0.010927
9  2019 2.002891e+09 2.002383e+09   0.025383
10 2020 2.008585e+09 2.002891e+09   0.284318
11 2021 2.000244e+09 2.008585e+09  -0.415281
12 2022 2.004500e+09 2.000244e+09   0.212756
13 2023 1.995672e+09 2.004500e+09  -0.440401

Fireducks

import fireducks.pandas as pd

# Start the timer
start_time = time.time()

df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100

result = pd.DataFrame({
    'year': yearly_sales.index,
    'total_sales': yearly_sales.values,
    'prev_year_sales': yearly_sales.shift().values,
    'yoy_growth': yoy_growth.values
})

print(result)

# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis  calculation: {time.time() - start_time} seconds")

#
# Fireducks output
#

    year   total_sales  prev_year_sales  yoy_growth
0   2010  2.002066e+09              NaN         NaN
1   2011  2.002441e+09     2.002066e+09    0.018739
2   2012  2.008966e+09     2.002441e+09    0.325848
3   2013  2.002901e+09     2.008966e+09   -0.301900
4   2014  2.000773e+09     2.002901e+09   -0.106225
5   2015  2.001931e+09     2.000773e+09    0.057855
6   2016  2.008762e+09     2.001931e+09    0.341229
7   2017  2.002164e+09     2.008762e+09   -0.328457
8   2018  2.002383e+09     2.002164e+09    0.010927
9   2019  2.002891e+09     2.002383e+09    0.025383
10  2020  2.008585e+09     2.002891e+09    0.284318
11  2021  2.000244e+09     2.008585e+09   -0.415281
12  2022  2.004500e+09     2.000244e+09    0.212756
13  2023  1.995672e+09     2.004500e+09   -0.440401

Time for product performance analysis  calculation: 0.17495489120483398 seconds

这次 DuckDB 更快。

测试 7 – 向数据集添加新列并更新其值

DuckDB

import duckdb

from datetime import datetime

start_time = time.time()

# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)

# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE 
    WHEN total <= 100 THEN total * 1.125  -- 12.5% tax
    WHEN total > 100 AND total <= 200 THEN total * 1.15   -- 15% tax
    WHEN total > 200 AND total <= 500 THEN total * 1.17   -- 17% tax
    WHEN total > 500 THEN total * 1.20   -- 20% tax
END;
""")

print(f"Time to add new column: {time.time() - start_time} seconds")

# Verify the new columns
result = con.execute("""
    SELECT 
        *
    FROM sales
    LIMIT 10;
""").fetchdf()

print(result)

#
# DuckDB output
#

Time to add new column: 2.4016575813293457 seconds

   order_id order_date  customer_id   customer_name  product_id product_names  
0         0 2021-11-25          238  Customer_25600         211  Plastic Cups   
1         1 2017-06-10          534  Customer_14188         209  Coffee Maker   
2         2 2010-02-15          924  Customer_14013         207           Pen   
3         3 2011-01-26          633   Customer_6120         211  Plastic Cups   
4         4 2014-01-11          561   Customer_1352         205       Printer   
5         5 2021-04-19          533   Customer_5342         208      Notebook   
6         6 2012-03-14          684  Customer_21604         207           Pen   
7         7 2017-07-01          744  Customer_30291         201    Smartphone   
8         8 2013-02-13          678  Customer_32618         204       Monitor   
9         9 2023-01-04          340  Customer_16898         207           Pen   

    categories  quantity  price   total  total_with_tax  
0       Sundry         2  99.80  199.60      229.539993  
1  Electronics         8   7.19   57.52       64.709999  
2   Stationery         6  70.98  425.88      498.279602  
3       Sundry         6  94.38  566.28      679.536011  
4  Electronics         4  44.68  178.72      205.528000  
5   Stationery         4  21.85   87.40       98.324997  
6   Stationery         3  93.66  280.98      328.746613  
7  Electronics         6  39.41  236.46      276.658203  
8  Electronics         2   4.30    8.60        9.675000  
9   Stationery         2   6.67   13.34       15.007500 

Fireducks

import numpy as np
import time
import fireducks.pandas as pd

# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
    (df_fire['total'] <= 100),
    (df_fire['total'] > 100) &amp; (df_fire['total'] <= 200),
    (df_fire['total'] > 200) &amp; (df_fire['total'] <= 500),
    (df_fire['total'] > 500)
]

choices = [1.125, 1.15, 1.17, 1.20]

# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)

# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)

#
# Fireducks oputput
#

Fireducks: Time to add new column: 2.7112433910369873 seconds

          order_id order_date  customer_id   customer_name  product_id  
0                0 2021-11-25          238  Customer_25600         211   
1                1 2017-06-10          534  Customer_14188         209   
2                2 2010-02-15          924  Customer_14013         207   
3                3 2011-01-26          633   Customer_6120         211   
4                4 2014-01-11          561   Customer_1352         205   
...            ...        ...          ...             ...         ...   
99999995  99999995 2011-10-24          501  Customer_29289         202   
99999996  99999996 2011-10-21          593  Customer_29352         200   
99999997  99999997 2019-05-10          673  Customer_25709         202   
99999998  99999998 2022-03-02          709  Customer_23966         208   
99999999  99999999 2023-06-16          102   Customer_9650         203   

         product_names   categories  quantity  price   total    month  year  
0         Plastic Cups       Sundry         2  99.80  199.60  2021-11  2021   
1         Coffee Maker  Electronics         8   7.19   57.52  2017-06  2017   
2                  Pen   Stationery         6  70.98  425.88  2010-02  2010   
3         Plastic Cups       Sundry         6  94.38  566.28  2011-01  2011   
4              Printer  Electronics         4  44.68  178.72  2014-01  2014   
...                ...          ...       ...    ...     ...      ...   ...   
99999995          Desk       Office         7  53.78  376.46  2011-10  2011   
99999996        Laptop  Electronics         3  67.42  202.26  2011-10  2011   
99999997          Desk       Office         5   9.12   45.60  2019-05  2019   
99999998      Notebook   Stationery         1  78.91   78.91  2022-03  2022   
99999999         Chair       Office         4  59.58  238.32  2023-06  2023   

          total_with_tax  
0              229.54000  
1               64.71000  
2              498.27960  
3              679.53600  
4              205.52800  
...                  ...  
99999995       440.45820  
99999996       236.64420  
99999997        51.30000  
99999998        88.77375  
99999999       278.83440  

[100000000 rows x 13 columns]

它们的运行时间非常相似,再次打平。

测试 8 – 将更新后的数据写入 CSV 文件

DuckDB

start_time = time.time()

# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
    COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")

print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")

DuckDB: Time to write CSV to file: 54.899176597595215 seconds

Fireducks

# fireducks write data back to CSV
#
import fireducks.pandas as pd

# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2) 
df_fire['order_date'] = df_fire['order_date'].dt.date

# Start total runtime timer
start_time = time.time()

df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)

# Print total runtime
print(f"Fireducks: Time to write CSV  to file: {time.time() - start_time} seconds")

Fireducks: Time to write CSV  to file: 54.490307331085205 seconds

又是一次难分胜负。

测试 9—将更新后的数据写入 parquet 文件

DuckDB

# DuckDB write Parquet data
# 

start_time = time.time()

# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")

print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")

DuckDB: Time to write parquet to file: 30.011869192123413 seconds

Fireducks

import fireducks.pandas as pd
import time

# Start total runtime timer
start_time = time.time()

df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')

# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")

Fireducks: Time to write Parquet to file: 86.29632377624512 seconds

这是运行时间之间的第一个主要差异。Fireducks 将数据写入 Parquet 的时间比 DuckDB 多了近一分钟。

摘要

那么,我们应该如何看待这一切呢?简单地说,这两个库之间没有什么太大的区别。它们都非常快,能够处理大量数据集。一旦你的数据在内存中,无论是 DuckDB 表还是 Fireducks 数据框,这两个库都能够以双倍的速度处理它。

选择使用哪一个取决于你现有的基础设施和技能集。

如果你是一个数据库人员,DuckDB 是一个明显的库选择,因为你的 SQL 技能将立即可迁移。

或者,如果你已经深深嵌入 Pandas 世界,Fireducks 将是一个很好的选择。

_ 好的,目前就到这里。希望你觉得这篇文章有用。如果你觉得有用,请访问我的个人资料页面这个链接。从那里,你可以看到我其他发表的故事,关注我或订阅以获取我发布新内容的通知._

如果你喜欢这个内容,你可能会发现这些文章也很有趣。

构建数据仪表板

使用 Numpy 加速 Pandas 代码

posted @ 2026-03-28 10:17  布客飞龙V  阅读(3)  评论(0)    收藏  举报