DuckDB in Python Data Pipelines: Why In-Process Analytics Is Replacing Your Local Database

If you’ve been building data pipelines in Python for any length of time, you’ve probably gone through the same evolution most data engineers follow: start with pandas, hit a memory wall when the dataset outgrows RAM, spin up PostgreSQL for what should be a simple analysis, and wonder why a 50-row aggregation requires a database server.

DuckDB short-circuits that entire progression. It’s an in-process analytical database — think of it as SQLite, but designed for analytical queries (OLAP) instead of transactional ones (OLTP). You import it like any Python library, point it at your data files, and run SQL. No server to install, no connection string to manage, no Docker container eating 200MB of RAM.

Why DuckDB Fits Python Pipelines Better Than pandas

The comparison isn’t really fair to pandas, which was never designed for the workloads people now throw at it. pandas loads entire datasets into memory. DuckDB uses a columnar execution engine that processes data in chunks, meaning it can query a 50GB Parquet file on a laptop with 8GB of RAM — something pandas simply cannot do.

Here’s the practical difference in a typical pipeline:

import duckdb

# Query a Parquet file directly — no loading into memory first
result = duckdb.query("""
    SELECT category, AVG(price) as avg_price, COUNT(*) as count
    FROM read_parquet('data/sales/*.parquet')
    WHERE date >= '2026-01-01'
    GROUP BY category
    ORDER BY avg_price DESC
""").fetchdf()

That single query does what would otherwise require: loading the Parquet files into pandas, filtering, grouping, aggregating, and sorting — each step creating intermediate DataFrames that consume memory. DuckDB optimizes the entire query plan before executing it, pushing filters down to the file scan and only materializing the final result.

Real Pipeline Patterns

Pattern 1: Multi-File Joins

One of DuckDB’s strongest use cases is joining data that lives in separate files — a scenario that’s awkward in pandas and overkill for a database server.

result = duckdb.query("""
    SELECT o.order_id, c.customer_name, SUM(oi.quantity * oi.unit_price) as total
    FROM read_csv('orders.csv') o
    JOIN read_csv('customers.csv') c ON o.customer_id = c.customer_id
    JOIN read_parquet('order_items.parquet') oi ON o.order_id = oi.order_id
    GROUP BY o.order_id, c.customer_name
""")

DuckDB automatically infers schemas from CSV and Parquet files, handles type coercion, and chooses join algorithms based on data size. The read_csv and read_parquet functions are table-valued functions — they make files queryable without importing them first.

Pattern 2: Incremental Processing

For pipelines that run on a schedule, DuckDB’s ability to persist data to a local database file makes incremental updates straightforward:

# Connect to a persistent database file (not in-memory)
con = duckdb.connect('pipeline.duckdb')

# Append today's data to an existing table
con.execute("""
    INSERT INTO daily_metrics
    SELECT * FROM read_parquet('data/daily/2026-06-20.parquet')
""")

# Run aggregation against the full history
result = con.execute("""
    SELECT date, metric_name, AVG(value) as rolling_avg
    FROM daily_metrics
    WHERE date >= current_date - INTERVAL 30 DAY
    GROUP BY date, metric_name
""").fetchdf()

The persistent .duckdb file stores data in DuckDB’s native columnar format, which compresses well and scans fast. For most small-to-medium pipelines, this replaces the need for a separate PostgreSQL instance entirely.

Pattern 3: Python UDFs in SQL

When SQL isn’t expressive enough, DuckDB lets you register Python functions and call them from SQL queries:

import duckdb
import numpy as np

con = duckdb.connect()

# Register a Python function as a SQL function
con.create_function("percentile_rank", 
    lambda x: float(np.percentile(x, 75)),
    ['DOUBLE'], 'DOUBLE')

result = con.execute("""
    SELECT category, percentile_rank(price) as p75_price
    FROM products
    GROUP BY category
""").fetchdf()

This is especially useful for statistical functions that don’t have native SQL equivalents — things like interquartile range calculations, custom aggregation logic, or calling scikit-learn models from within a query.

Performance Compared to Alternatives

To put DuckDB’s speed in context, consider a simple benchmark: querying a 10 million row Parquet file with a GROUP BY and aggregation. On a typical laptop:

Tool	Time	Memory Peak
pandas	12.3s	3.2 GB
DuckDB	0.8s	180 MB
SQLite (imported)	15.1s	2.8 GB
PostgreSQL (local)	2.4s	450 MB

DuckDB wins not because it’s fundamentally faster at computation — it’s because it doesn’t waste time and memory loading data you don’t need. The columnar execution model means it reads only the columns referenced in your query, skips rows filtered out by WHERE clauses, and processes data in vectorized batches that make efficient use of CPU caches.

For data pipelines that run repeatedly on the same dataset, this performance difference compounds. A daily ETL job that takes 12 seconds in pandas drops to under a second in DuckDB. Over a year, that’s hours of compute time saved — and on cloud infrastructure where you pay per second, real money.

DuckDB is not a replacement for every database. It’s single-threaded for writes, so concurrent write workloads will bottleneck. It’s not designed for high-frequency transactional operations — if you need thousands of INSERTs per second, use PostgreSQL. And it’s not a distributed system — when your data grows beyond what a single machine can handle, you’ll need to look at Spark or a cloud data warehouse.

But for the vast majority of Python data pipelines — the ones that read files, transform data, produce reports, and run on a schedule — DuckDB covers the ground between pandas and a full database server. It’s the tool you reach for when you need SQL-speed analytics but don’t want the infrastructure overhead.

The fact that it installs with a single pip install duckdb and requires zero configuration is almost unfair.

DuckDB in Python Data Pipelines: Why In-Process Analytics Is Replacing Your Local Database

Why DuckDB Fits Python Pipelines Better Than pandas

Real Pipeline Patterns

Pattern 1: Multi-File Joins

Pattern 2: Incremental Processing

Pattern 3: Python UDFs in SQL

Performance Compared to Alternatives

Leave a comment

No comments yet

Why DuckDB Fits Python Pipelines Better Than pandas

Real Pipeline Patterns

Pattern 1: Multi-File Joins

Pattern 2: Incremental Processing

Pattern 3: Python UDFs in SQL

Performance Compared to Alternatives

Share this guide

Leave a comment

No comments yet

Related Articles

DuckDB in Python: A Complete Data Analytics Guide for 2026

Polars vs Pandas: The Complete Migration Guide

Narwhals: Write DataFrame Code Once, Run It Anywhere in Python