duckdb
duckdb 的限制
- 可以多线程并发读, 但不能多线程并发写数据库.
duckdb 的作用
- 数据交换格式, 尤其适合于用于较大的数据传输, 比csv格式更好, 有主外键约束, 有非空约束, 每列都有强数据类型, 避免出现脏数据, 列式数据库文件压缩效果好
- 数据处理引擎, 可以读写csv/json/parquet文件, 甚至支持RDBMS读写, 然后利用duckdb强大的SQL特性, 进行数据分析和处理.
- 数据探索工具, 可以非常容易集成到Jupyter Notebook中, 可以从Python DataFrame对象读取, 然后利用强大的SQL特性, 进行数据分析. 也可以将sql的结果回写到 DataFrame, 可以同时获得SQL和DataFrame的数据探索的优势.
duckdb 读取 DataFrame 的示例
import duckdb
import pandas as pd
# Create a DataFrame, in this case using Pandas
my_df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# Query it directly with SQL - no explicit conversion needed
result = duckdb.sql("SELECT * FROM my_df WHERE a > 1")
将 duckdb 转成 DataFrame 并可视化
import duckdb
import polars as pl
import plotly.express as px
# Use SQL to load data and apply a complex aggregation
regional_summary = duckdb.query("""
SELECT
region,
SUM(sales) as total_sales,
COUNT(DISTINCT customer_id) as customer_count,
SUM(sales) / COUNT(DISTINCT customer_id) as sales_per_customer
FROM read_csv('sales_data*.csv')
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_sales DESC
""").pl()
# Use summarised data in Polars for visualization
fig = px.bar(
regional_summary,
x="region",
y="sales_per_customer",
)
fig.show()
参考
https://endjin.com/blog/2025/04/duckdb-in-depth-how-it-works-what-makes-it-fast
https://endjin.com/blog/2025/04/duckdb-in-practice-enterprise-integration-architectural-patterns
https://www.timestored.com/data/duckdb/
https://github.com/davidgasquez/awesome-duckdb
https://datawise.dev/a-portable-data-stack-with-dagster-docker-duckdb-dbt-and-superset