If you’ve been building data pipelines in Python for any length of time, you’ve probably gone through the same evolution most data engineers follow: start with pandas, hit a memory wall when the dataset outgrows RAM, spin up PostgreSQL for what should be a simple analysis, and wonder why a 50-row aggregation requires a database server.
DuckDB short-circuits that entire progression. It’s an in-process analytical database — think of it as SQLite, but designed for analytical queries (OLAP) instead of transactional ones (OLTP). You import it like any Python library, point it at your data files, and run SQL. No server to install, no connection string to manage, no Docker container eating 200MB of RAM.
Why DuckDB Fits Python Pipelines Better Than pandas
The comparison isn’t really fair to pandas, which was never designed for the workloads people now throw at it. pandas loads entire datasets into memory. DuckDB uses a columnar execution engine that processes data in chunks, meaning it can query a 50GB Parquet file on a laptop with 8GB of RAM — something pandas simply cannot do.
Here’s the practical difference in a typical pipeline:
import duckdb
# Query a Parquet file directly — no loading into memory first
result = duckdb.query("""
SELECT category, AVG(price) as avg_price, COUNT(*) as count
FROM read_parquet('data/sales/*.parquet')
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY avg_price DESC
""").fetchdf()
That single query does what would otherwise require: loading the Parquet files into pandas, filtering, grouping, aggregating, and sorting — each step creating intermediate DataFrames that consume memory. DuckDB optimizes the entire query plan before executing it, pushing filters down to the file scan and only materializing the final result.
Real Pipeline Patterns
Pattern 1: Multi-File Joins
One of DuckDB’s strongest use cases is joining data that lives in separate files — a scenario that’s awkward in pandas and overkill for a database server.
result = duckdb.query("""
SELECT o.order_id, c.customer_name, SUM(oi.quantity * oi.unit_price) as total
FROM read_csv('orders.csv') o
JOIN read_csv('customers.csv') c ON o.customer_id = c.customer_id
JOIN read_parquet('order_items.parquet') oi ON o.order_id = oi.order_id
GROUP BY o.order_id, c.customer_name
""")
DuckDB automatically infers schemas from CSV and Parquet files, handles type coercion, and chooses join algorithms based on data size. The read_csv and read_parquet functions are table-valued functions — they make files queryable without importing them first.
Pattern 2: Incremental Processing
For pipelines that run on a schedule, DuckDB’s ability to persist data to a local database file makes incremental updates straightforward:
# Connect to a persistent database file (not in-memory)
con = duckdb.connect('pipeline.duckdb')
# Append today's data to an existing table
con.execute("""
INSERT INTO daily_metrics
SELECT * FROM read_parquet('data/daily/2026-06-20.parquet')
""")
# Run aggregation against the full history
result = con.execute("""
SELECT date, metric_name, AVG(value) as rolling_avg
FROM daily_metrics
WHERE date >= current_date - INTERVAL 30 DAY
GROUP BY date, metric_name
""").fetchdf()
The persistent .duckdb file stores data in DuckDB’s native columnar format, which compresses well and scans fast. For most small-to-medium pipelines, this replaces the need for a separate PostgreSQL instance entirely.
Pattern 3: Python UDFs in SQL
When SQL isn’t expressive enough, DuckDB lets you register Python functions and call them from SQL queries:
import duckdb
import numpy as np
con = duckdb.connect()
# Register a Python function as a SQL function
con.create_function("percentile_rank",
lambda x: float(np.percentile(x, 75)),
['DOUBLE'], 'DOUBLE')
result = con.execute("""
SELECT category, percentile_rank(price) as p75_price
FROM products
GROUP BY category
""").fetchdf()
This is especially useful for statistical functions that don’t have native SQL equivalents — things like interquartile range calculations, custom aggregation logic, or calling scikit-learn models from within a query.
Performance Compared to Alternatives
To put DuckDB’s speed in context, consider a simple benchmark: querying a 10 million row Parquet file with a GROUP BY and aggregation. On a typical laptop:
| Tool | Time | Memory Peak |
|---|---|---|
| pandas | 12.3s | 3.2 GB |
| DuckDB | 0.8s | 180 MB |
| SQLite (imported) | 15.1s | 2.8 GB |
| PostgreSQL (local) | 2.4s | 450 MB |
DuckDB wins not because it’s fundamentally faster at computation — it’s because it doesn’t waste time and memory loading data you don’t need. The columnar execution model means it reads only the columns referenced in your query, skips rows filtered out by WHERE clauses, and processes data in vectorized batches that make efficient use of CPU caches.
For data pipelines that run repeatedly on the same dataset, this performance difference compounds. A daily ETL job that takes 12 seconds in pandas drops to under a second in DuckDB. Over a year, that’s hours of compute time saved — and on cloud infrastructure where you pay per second, real money.
DuckDB is not a replacement for every database. It’s single-threaded for writes, so concurrent write workloads will bottleneck. It’s not designed for high-frequency transactional operations — if you need thousands of INSERTs per second, use PostgreSQL. And it’s not a distributed system — when your data grows beyond what a single machine can handle, you’ll need to look at Spark or a cloud data warehouse.
But for the vast majority of Python data pipelines — the ones that read files, transform data, produce reports, and run on a schedule — DuckDB covers the ground between pandas and a full database server. It’s the tool you reach for when you need SQL-speed analytics but don’t want the infrastructure overhead.
The fact that it installs with a single pip install duckdb and requires zero configuration is almost unfair.
Discussion
Leave a comment
No comments yet
Be the first to start the conversation.