鸭子之战
鸭子之战

由 AI(Dalle-3)创建的图片
如你们中的一些人所知,我是一位 DuckDB Python 库的大粉丝,并且已经撰写了许多关于它的文章。我也是最早撰写关于一个名为 Fireducks 的更新 Python 库的文章的人之一,并帮助将其带到人们的关注之下。
如果你从未听说过这些有用的库,请查看下面的链接以了解它们的介绍。
这两个库正在增加它们在数据科学工作负载中的份额,其中可以争论说,数据操作和一般整理至少与机器学习方面的数据分析和洞察力一样重要。
这两个工具的核心基础非常不同;DuckDB 是一个现代的嵌入式分析数据库,旨在高效处理和查询来自各种来源的数 GB 数据。Fireducks 被设计成 Pandas 的一个更快替代品。
然而,它们的关键共性是它们都适用于通用中等规模的数据处理任务,并且性能极高。如果你的用例是这样的,你应该选择哪一个?这正是我们今天要弄清楚的问题。
这里是我将要进行的测试。
-
将大型 CSV 文件读入内存,即 DuckDB 表和 Fireducks 数据框
-
对两组内存中的数据进行一些典型的数据处理任务
-
在内存数据集中基于现有的表/数据框列数据创建一个新列。
-
将更新的内存数据集输出为 CSV 和 Parquet 格式
输入数据集
我创建了一个包含 1 亿条记录的虚假销售数据的 CSV 文件。
输入数据模式如下,
-
order_id(整数) -
order_date(日期) -
customer_id(整数) -
customer_name(字符串) -
product_id(整数) -
product_name(字符串) -
category(字符串) -
quantity(整数) -
price(浮点数) -
total(浮点数)
这里是一个你可以使用的 Python 程序来创建 CSV 文件。在我的系统上,这产生了一个大约 7.5GB 的文件。
# generate the 100m CSV file
#
import polars as pl
import numpy as np
from datetime import datetime, timedelta
def generate(nrows: int, filename: str):
names = np.asarray(
[
"Laptop",
"Smartphone",
"Desk",
"Chair",
"Monitor",
"Printer",
"Paper",
"Pen",
"Notebook",
"Coffee Maker",
"Cabinet",
"Plastic Cups",
]
)
categories = np.asarray(
[
"Electronics",
"Electronics",
"Office",
"Office",
"Electronics",
"Electronics",
"Stationery",
"Stationery",
"Stationery",
"Electronics",
"Office",
"Sundry",
]
)
product_id = np.random.randint(len(names), size=nrows)
quantity = np.random.randint(1, 11, size=nrows)
price = np.random.randint(199, 10000, size=nrows) / 100
# Generate random dates between 2010-01-01 and 2023-12-31
start_date = datetime(2010, 1, 1)
end_date = datetime(2023, 12, 31)
date_range = (end_date - start_date).days
# Create random dates as np.array and convert to string format
order_dates = np.array([(start_date + timedelta(days=np.random.randint(0, date_range))).strftime('%Y-%m-%d') for _ in range(nrows)])
# Define columns
columns = {
"order_id": np.arange(nrows),
"order_date": order_dates,
"customer_id": np.random.randint(100, 1000, size=nrows),
"customer_name": [f"Customer_{i}" for i in np.random.randint(2**15, size=nrows)],
"product_id": product_id + 200,
"product_names": names[product_id],
"categories": categories[product_id],
"quantity": quantity,
"price": price,
"total": price * quantity,
}
# Create Polars DataFrame and write to CSV with explicit delimiter
df = pl.DataFrame(columns)
df.write_csv(filename, separator=',',include_header=True) # Ensure comma is used as the delimiter
# Generate data with random order_date and save to CSV
generate(100_000_000, "/mnt/d/sales_data/sales_data_100m.csv")
安装 WSL2 Ubuntu
Fireducks 只能在 Linux 下运行,因此由于我通常运行 Windows,我将使用 WSL2 Ubuntu 来设置我的 Linux 环境,但相同的代码应该可以在任何 Linux/Unix 设置上运行。我有一个完整的指南,介绍如何在 Windows 上安装 WSL2 Ubuntu 这里。
设置开发环境
好的,在我们开始编写示例代码之前,我们应该设置一个独立的开发环境。这样,我们做的不会干扰我们可能用于其他项目的其他库版本、编程等……。
我使用 Miniconda 来做这件事,但你可以使用最适合你的方法。
如果你想走 Miniconda 路线并且还没有安装,你必须首先安装 Miniconda。使用此链接获取它,
环境创建完成后,使用 activate 命令切换到它,然后安装 Jupyter 和任何所需的 Python 库。
#create our test environment
(base) $ conda create -n duck_battle python=3.11 -y
# Now activate it
(base) $ conda activate duck_battle
# Install python libraries, etc ...
(duck_battle) $ pip install jupyter fireducks duckdb
测试 1 – 读取大型 CSV 文件并显示最后 10 条记录
DuckDB
import duckdb
print(duckdb.__version__)
'1.1.3'
# DuckDB read CSV file
#
import duckdb
import time
# Start the timer
start_time = time.time()
# Create a connection to an in-memory DuckDB database
con = duckdb.connect(':memory:')
# Create a table from the CSV file
con.execute(f"CREATE TABLE sales AS SELECT * FROM read_csv('/mnt/d/sales_data/sales_data_100m.csv',header=true)")
# Fetch the last 10 rows
query = "SELECT * FROM sales ORDER BY rowid DESC LIMIT 10"
df = con.execute(query).df()
# Display the last 10 rows
print("nLast 10 rows of the file:")
print(df)
# End the timer and calculate the total elapsed time
total_elapsed_time = time.time() - start_time
print(f"DuckDB: Time taken to read the CSV file and display the last 10 records: {total_elapsed_time} seconds")
#
# DuckDB output
#
Last 10 rows of the file:
order_id order_date customer_id customer_name product_id product_names
0 99999999 2023-06-16 102 Customer_9650 203 Chair
1 99999998 2022-03-02 709 Customer_23966 208 Notebook
2 99999997 2019-05-10 673 Customer_25709 202 Desk
3 99999996 2011-10-21 593 Customer_29352 200 Laptop
4 99999995 2011-10-24 501 Customer_29289 202 Desk
5 99999994 2023-09-27 119 Customer_15532 209 Coffee Maker
6 99999993 2015-01-15 294 Customer_27081 200 Laptop
7 99999992 2016-04-07 379 Customer_1353 207 Pen
8 99999991 2010-09-19 253 Customer_29439 204 Monitor
9 99999990 2016-05-19 174 Customer_11294 210 Cabinet
categories quantity price total
0 Office 4 59.58 238.32
1 Stationery 1 78.91 78.91
2 Office 5 9.12 45.60
3 Electronics 3 67.42 202.26
4 Office 7 53.78 376.46
5 Electronics 2 55.10 110.20
6 Electronics 9 86.01 774.09
7 Stationery 5 21.56 107.80
8 Electronics 4 5.17 20.68
9 Office 9 65.10 585.90
DuckDB: Time taken to read the CSV file and display the last 10 records: 59.23184013366699 seconds
Fireducks
import fireducks
import fireducks.pandas as pd
print(fireducks.__version__)
print(pd.__version__)
1.1.6
2.2.3
# Fireducks read CSV
#
import fireducks.pandas as pd
import time
# Start the timer
start_time = time.time()
# Path to the CSV file
file_path = "/mnt/d/sales_data/sales_data_100m.csv"
# Read the CSV file into a DataFrame
df_fire = pd.read_csv(file_path)
# Display the last 10 rows of the DataFrame
print(df_fire.tail(10))
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to read the CSV file and display the last 10 records: {elapsed_time} seconds")
#
# Fireducks output
#
order_id order_date customer_id customer_name product_id
99999990 99999990 2016-05-19 174 Customer_11294 210
99999991 99999991 2010-09-19 253 Customer_29439 204
99999992 99999992 2016-04-07 379 Customer_1353 207
99999993 99999993 2015-01-15 294 Customer_27081 200
99999994 99999994 2023-09-27 119 Customer_15532 209
99999995 99999995 2011-10-24 501 Customer_29289 202
99999996 99999996 2011-10-21 593 Customer_29352 200
99999997 99999997 2019-05-10 673 Customer_25709 202
99999998 99999998 2022-03-02 709 Customer_23966 208
99999999 99999999 2023-06-16 102 Customer_9650 203
product_names categories quantity price total
99999990 Cabinet Office 9 65.10 585.90
99999991 Monitor Electronics 4 5.17 20.68
99999992 Pen Stationery 5 21.56 107.80
99999993 Laptop Electronics 9 86.01 774.09
99999994 Coffee Maker Electronics 2 55.10 110.20
99999995 Desk Office 7 53.78 376.46
99999996 Laptop Electronics 3 67.42 202.26
99999997 Desk Office 5 9.12 45.60
99999998 Notebook Stationery 1 78.91 78.91
99999999 Chair Office 4 59.58 238.32
Fireducks: Time taken to read the CSV file and display the last 10 records: 65.69259881973267 seconds
没有太多区别;DuckDB 以大约 6 秒的优势领先。
测试 2—按类别计算总销售额
DuckDB
# duckdb process data
#
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
categories,
SUM(total) AS total_sales
FROM sales
GROUP BY categories
ORDER BY total_sales DESC
"""
start_time = time.time()
# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for sales by category calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckDB: Time for sales by category calculation: 0.1401681900024414 seconds
categories total_sales
0 Electronics 1.168493e+10
1 Stationery 7.014109e+09
2 Office 7.006807e+09
3 Sundry 2.338428e+09
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
total_sales_by_category = df_fire.groupby('categories')['total'].sum().sort_values(ascending=False)
print(total_sales_by_category)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate sales by category: {elapsed_time} seconds")
#
# Fireducks output
#
categories
Electronics 1.168493e+10
Stationery 7.014109e+09
Office 7.006807e+09
Sundry 2.338428e+09
Name: total, dtype: float64
Fireducks: Time taken to calculate sales by category: 0.13571524620056152 seconds
在那里也没有太多区别。Fireducks 略胜一筹。
测试 3—前 5 位客户消费
DuckDB
# duckdb process data
#
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
customer_id,
customer_name,
SUM(total) AS total_purchase
FROM sales
GROUP BY customer_id, customer_name
ORDER BY total_purchase DESC
LIMIT 5
"""
start_time = time.time()
# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckdDB: Time to calculate top 5 customers: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckdDB: Time to calculate top 5 customers: 1.4588654041290283 seconds
customer_id customer_name total_purchase
0 681 Customer_20387 6892.96
1 740 Customer_30499 6613.11
2 389 Customer_22686 6597.35
3 316 Customer_185 6565.38
4 529 Customer_1609 6494.35
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
top_5_customers = df_fire.groupby(['customer_id', 'customer_name'])['total'].sum().sort_values(ascending=False).head(5)
print(top_5_customers)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time taken to calculate top 5 customers: {elapsed_time} seconds")
#
# Fireducks output
#
customer_id customer_name
681 Customer_20387 6892.96
740 Customer_30499 6613.11
389 Customer_22686 6597.35
316 Customer_1859 6565.38
529 Customer_1609 6494.35
Name: total, dtype: float64
Fireducks: Time taken to calculate top 5 customers: 2.823930263519287 seconds
DuckDB 在这一项中获胜,其速度几乎是 Fireducks 的两倍。
测试 4—月度销售额数据
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total) AS monthly_sales
FROM sales
GROUP BY month
ORDER BY month
"""
start_time = time.time()
# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for seasonal trend calculation: {time.time() - start_time} seconds")
results
#
# DuckDB output
#
DuckDB: Time for seasonal trend calculation: 0.16109275817871094 seconds
month monthly_sales
0 2010-01-01 1.699500e+08
1 2010-02-01 1.535730e+08
2 2010-03-01 1.702968e+08
3 2010-04-01 1.646421e+08
4 2010-05-01 1.704506e+08
... ... ...
163 2023-08-01 1.699263e+08
164 2023-09-01 1.646018e+08
165 2023-10-01 1.692184e+08
166 2023-11-01 1.644883e+08
167 2023-12-01 1.643962e+08
168 rows × 2 columns
Fireducks
import fireducks.pandas as pd
import time
def seasonal_trend():
# Ensure 'order_date' is datetime
df_fire['order_date'] = pd.to_datetime(df_fire['order_date'])
# Extract 'month' as string
df_fire['month'] = df_fire['order_date'].dt.strftime('%Y-%m')
# Group by 'month' and sum 'total'
results = (
df_fire.groupby('month')['total']
.sum()
.reset_index()
.sort_values('month')
)
print(results)
start_time = time.time()
seasonal_trend()
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time for seasonal trend calculation: {time.time() - start_time} seconds")
#
# Fireducks Output
#
month total
0 2010-01 1.699500e+08
1 2010-02 1.535730e+08
2 2010-03 1.702968e+08
3 2010-04 1.646421e+08
4 2010-05 1.704506e+08
.. ... ...
163 2023-08 1.699263e+08
164 2023-09 1.646018e+08
165 2023-10 1.692184e+08
166 2023-11 1.644883e+08
167 2023-12 1.643962e+08
[168 rows x 2 columns]
Fireducks: Time for seasonal trend calculation: 3.109074354171753 seconds
在这个例子中,DuckDB 明显更快。
测试 5—按产品平均订单
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
SELECT
product_id,
product_names,
AVG(total) AS avg_order_value
FROM sales
GROUP BY product_id, product_names
ORDER BY avg_order_value DESC
"""
start_time = time.time()
# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for average order by product calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
DuckDB: Time for average order by product calculation: 0.13720130920410156 seconds
product_id product_names avg_order_value
0 206 Paper 280.529144
1 208 Notebook 280.497268
2 201 Smartphone 280.494779
3 207 Pen 280.491508
4 205 Printer 280.470150
5 200 Laptop 280.456913
6 209 Coffee Maker 280.445365
7 211 Plastic Cups 280.440161
8 210 Cabinet 280.426960
9 202 Desk 280.367135
10 203 Chair 280.364045
11 204 Monitor 280.329706
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
avg_order_value = df_fire.groupby(['product_id', 'product_names'])['total'].mean().sort_values(ascending=False)
print(avg_order_value)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Fireducks: Time for average order calculation: {time.time() - start_time} seconds")
#
# Fireducks output
#
product_id product_names
206 Paper 280.529144
208 Notebook 280.497268
201 Smartphone 280.494779
207 Pen 280.491508
205 Printer 280.470150
200 Laptop 280.456913
209 Coffee Maker 280.445365
211 Plastic Cups 280.440161
210 Cabinet 280.426960
202 Desk 280.367135
203 Chair 280.364045
204 Monitor 280.329706
Name: total, dtype: float64
Fireducks: Time for average order calculation: 0.06766319274902344 seconds
在那里,Fireducks 取得了胜利,并且比 DuckDB 快了两倍。
测试 6—产品性能分析
DuckDB
import duckdb
import time
# Start total runtime timer
query_sql="""
WITH yearly_sales AS (
SELECT
EXTRACT(YEAR FROM order_date) AS year,
SUM(total) AS total_sales
FROM sales
GROUP BY year
)
SELECT
year,
total_sales,
LAG(total_sales) OVER (ORDER BY year) AS prev_year_sales,
(total_sales - LAG(total_sales) OVER (ORDER BY year)) / LAG(total_sales) OVER (ORDER BY year) * 100 AS yoy_growth
FROM yearly_sales
ORDER BY year
"""
start_time = time.time()
# 1\. Total sales by category
start = time.time()
results = con.execute(query_sql).df()
print(f"DuckDB: Time for product performance analysis calculation: {time.time() - start_time} seconds")
results
#
# DuckDb output
#
Time for product performance analysis calculation: 0.03958845138549805 seconds
year total_sales prev_year_sales yoy_growth
0 2010 2.002066e+09 NaN NaN
1 2011 2.002441e+09 2.002066e+09 0.018739
2 2012 2.008966e+09 2.002441e+09 0.325848
3 2013 2.002901e+09 2.008966e+09 -0.301900
4 2014 2.000773e+09 2.002901e+09 -0.106225
5 2015 2.001931e+09 2.000773e+09 0.057855
6 2016 2.008762e+09 2.001931e+09 0.341229
7 2017 2.002164e+09 2.008762e+09 -0.328457
8 2018 2.002383e+09 2.002164e+09 0.010927
9 2019 2.002891e+09 2.002383e+09 0.025383
10 2020 2.008585e+09 2.002891e+09 0.284318
11 2021 2.000244e+09 2.008585e+09 -0.415281
12 2022 2.004500e+09 2.000244e+09 0.212756
13 2023 1.995672e+09 2.004500e+09 -0.440401
Fireducks
import fireducks.pandas as pd
# Start the timer
start_time = time.time()
df_fire['year'] = pd.to_datetime(df_fire['order_date']).dt.year
yearly_sales = df_fire.groupby('year')['total'].sum().sort_index()
yoy_growth = yearly_sales.pct_change() * 100
result = pd.DataFrame({
'year': yearly_sales.index,
'total_sales': yearly_sales.values,
'prev_year_sales': yearly_sales.shift().values,
'yoy_growth': yoy_growth.values
})
print(result)
# End the timer and calculate the elapsed time
elapsed_time = time.time() - start_time
print(f"Time for product performance analysis calculation: {time.time() - start_time} seconds")
#
# Fireducks output
#
year total_sales prev_year_sales yoy_growth
0 2010 2.002066e+09 NaN NaN
1 2011 2.002441e+09 2.002066e+09 0.018739
2 2012 2.008966e+09 2.002441e+09 0.325848
3 2013 2.002901e+09 2.008966e+09 -0.301900
4 2014 2.000773e+09 2.002901e+09 -0.106225
5 2015 2.001931e+09 2.000773e+09 0.057855
6 2016 2.008762e+09 2.001931e+09 0.341229
7 2017 2.002164e+09 2.008762e+09 -0.328457
8 2018 2.002383e+09 2.002164e+09 0.010927
9 2019 2.002891e+09 2.002383e+09 0.025383
10 2020 2.008585e+09 2.002891e+09 0.284318
11 2021 2.000244e+09 2.008585e+09 -0.415281
12 2022 2.004500e+09 2.000244e+09 0.212756
13 2023 1.995672e+09 2.004500e+09 -0.440401
Time for product performance analysis calculation: 0.17495489120483398 seconds
这次 DuckDB 更快。
测试 7 – 向数据集添加新列并更新其值
DuckDB
import duckdb
from datetime import datetime
start_time = time.time()
# Add new columns
con.execute("""
ALTER TABLE sales ADD COLUMN total_with_tax FLOAT
"""
)
# Perform the calculations and update the table
con.execute("""
UPDATE sales
SET total_with_tax = CASE
WHEN total <= 100 THEN total * 1.125 -- 12.5% tax
WHEN total > 100 AND total <= 200 THEN total * 1.15 -- 15% tax
WHEN total > 200 AND total <= 500 THEN total * 1.17 -- 17% tax
WHEN total > 500 THEN total * 1.20 -- 20% tax
END;
""")
print(f"Time to add new column: {time.time() - start_time} seconds")
# Verify the new columns
result = con.execute("""
SELECT
*
FROM sales
LIMIT 10;
""").fetchdf()
print(result)
#
# DuckDB output
#
Time to add new column: 2.4016575813293457 seconds
order_id order_date customer_id customer_name product_id product_names
0 0 2021-11-25 238 Customer_25600 211 Plastic Cups
1 1 2017-06-10 534 Customer_14188 209 Coffee Maker
2 2 2010-02-15 924 Customer_14013 207 Pen
3 3 2011-01-26 633 Customer_6120 211 Plastic Cups
4 4 2014-01-11 561 Customer_1352 205 Printer
5 5 2021-04-19 533 Customer_5342 208 Notebook
6 6 2012-03-14 684 Customer_21604 207 Pen
7 7 2017-07-01 744 Customer_30291 201 Smartphone
8 8 2013-02-13 678 Customer_32618 204 Monitor
9 9 2023-01-04 340 Customer_16898 207 Pen
categories quantity price total total_with_tax
0 Sundry 2 99.80 199.60 229.539993
1 Electronics 8 7.19 57.52 64.709999
2 Stationery 6 70.98 425.88 498.279602
3 Sundry 6 94.38 566.28 679.536011
4 Electronics 4 44.68 178.72 205.528000
5 Stationery 4 21.85 87.40 98.324997
6 Stationery 3 93.66 280.98 328.746613
7 Electronics 6 39.41 236.46 276.658203
8 Electronics 2 4.30 8.60 9.675000
9 Stationery 2 6.67 13.34 15.007500
Fireducks
import numpy as np
import time
import fireducks.pandas as pd
# Start total runtime timer
start_time = time.time()
# Define tax rate conditions and choices
conditions = [
(df_fire['total'] <= 100),
(df_fire['total'] > 100) & (df_fire['total'] <= 200),
(df_fire['total'] > 200) & (df_fire['total'] <= 500),
(df_fire['total'] > 500)
]
choices = [1.125, 1.15, 1.17, 1.20]
# Calculate total_with_tax using np.select for efficiency
df_fire['total_with_tax'] = df_fire['total'] * np.select(conditions, choices)
# Print total runtime
print(f"Fireducks: Time to add new column: {time.time() - start_time} seconds")
print(df_fire)
#
# Fireducks oputput
#
Fireducks: Time to add new column: 2.7112433910369873 seconds
order_id order_date customer_id customer_name product_id
0 0 2021-11-25 238 Customer_25600 211
1 1 2017-06-10 534 Customer_14188 209
2 2 2010-02-15 924 Customer_14013 207
3 3 2011-01-26 633 Customer_6120 211
4 4 2014-01-11 561 Customer_1352 205
... ... ... ... ... ...
99999995 99999995 2011-10-24 501 Customer_29289 202
99999996 99999996 2011-10-21 593 Customer_29352 200
99999997 99999997 2019-05-10 673 Customer_25709 202
99999998 99999998 2022-03-02 709 Customer_23966 208
99999999 99999999 2023-06-16 102 Customer_9650 203
product_names categories quantity price total month year
0 Plastic Cups Sundry 2 99.80 199.60 2021-11 2021
1 Coffee Maker Electronics 8 7.19 57.52 2017-06 2017
2 Pen Stationery 6 70.98 425.88 2010-02 2010
3 Plastic Cups Sundry 6 94.38 566.28 2011-01 2011
4 Printer Electronics 4 44.68 178.72 2014-01 2014
... ... ... ... ... ... ... ...
99999995 Desk Office 7 53.78 376.46 2011-10 2011
99999996 Laptop Electronics 3 67.42 202.26 2011-10 2011
99999997 Desk Office 5 9.12 45.60 2019-05 2019
99999998 Notebook Stationery 1 78.91 78.91 2022-03 2022
99999999 Chair Office 4 59.58 238.32 2023-06 2023
total_with_tax
0 229.54000
1 64.71000
2 498.27960
3 679.53600
4 205.52800
... ...
99999995 440.45820
99999996 236.64420
99999997 51.30000
99999998 88.77375
99999999 278.83440
[100000000 rows x 13 columns]
它们的运行时间非常相似,再次打平。
测试 8 – 将更新后的数据写入 CSV 文件
DuckDB
start_time = time.time()
# Write the modified sales_data table to a CSV file
start = time.time()
con.execute("""
COPY (SELECT * FROM sales) TO '/mnt/d/sales_data/final_sales_data_duckdb.csv' WITH (HEADER TRUE, DELIMITER ',')
""")
print(f"DuckDB: Time to write CSV to file: {time.time() - start_time} seconds")
DuckDB: Time to write CSV to file: 54.899176597595215 seconds
Fireducks
# fireducks write data back to CSV
#
import fireducks.pandas as pd
# Tidy up DF before writing out
cols_to_drop = ['year', 'month']
df_fire = df_fire.drop(columns=cols_to_drop)
df_fire['total_with_tax'] = df_fire['total_with_tax'].round(2)
df_fire['order_date'] = df_fire['order_date'].dt.date
# Start total runtime timer
start_time = time.time()
df_fire.to_csv('/mnt/d/sales_data/fireducks_sales.csv',quoting=0,index=False)
# Print total runtime
print(f"Fireducks: Time to write CSV to file: {time.time() - start_time} seconds")
Fireducks: Time to write CSV to file: 54.490307331085205 seconds
又是一次难分胜负。
测试 9—将更新后的数据写入 parquet 文件
DuckDB
# DuckDB write Parquet data
#
start_time = time.time()
# Write the modified sales_data table to a Parquet file
start = time.time()
con.execute("COPY sales TO '/mnt/d/sales_data/final_sales_data_duckdb.parquet' (FORMAT 'parquet');")
print(f"DuckDB: Time to write parquet to file: {time.time() - start_time} seconds")
DuckDB: Time to write parquet to file: 30.011869192123413 seconds
Fireducks
import fireducks.pandas as pd
import time
# Start total runtime timer
start_time = time.time()
df_fire.to_parquet('/mnt/d/sales_data/fireducks_sales.parquet')
# Print total runtime
print(f"Fireducks: Time to write Parquet to file: {time.time() - start_time} seconds")
Fireducks: Time to write Parquet to file: 86.29632377624512 seconds
这是运行时间之间的第一个主要差异。Fireducks 将数据写入 Parquet 的时间比 DuckDB 多了近一分钟。
摘要
那么,我们应该如何看待这一切呢?简单地说,这两个库之间没有什么太大的区别。它们都非常快,能够处理大量数据集。一旦你的数据在内存中,无论是 DuckDB 表还是 Fireducks 数据框,这两个库都能够以双倍的速度处理它。
选择使用哪一个取决于你现有的基础设施和技能集。
如果你是一个数据库人员,DuckDB 是一个明显的库选择,因为你的 SQL 技能将立即可迁移。
或者,如果你已经深深嵌入 Pandas 世界,Fireducks 将是一个很好的选择。
_ 好的,目前就到这里。希望你觉得这篇文章有用。如果你觉得有用,请访问我的个人资料页面这个链接。从那里,你可以看到我其他发表的故事,关注我或订阅以获取我发布新内容的通知._
如果你喜欢这个内容,你可能会发现这些文章也很有趣。

浙公网安备 33010602011771号