Most Python data work follows the same pattern: load data into a DataFrame, filter a few columns, group by something, maybe join two tables. It works fine until the data gets big enough to make pandas choke.
That’s where DuckDB changes things. It runs SQL queries directly inside your Python process with zero server setup. You install it with pip install duckdb and start querying. For many analytical tasks, it outperforms pandas by a wide margin and can even rival distributed frameworks like Spark or Dask for datasets up to a few gigabytes.
What Is DuckDB and Why It Matters
DuckDB is an in-process SQL OLAP database. Unlike PostgreSQL or MySQL, it doesn’t run as a separate service. It lives inside your Python interpreter, sharing the same memory space. Think of it as SQLite, but built for analytical queries instead of transactional ones.
The architecture matters because it eliminates network overhead. No connection pooling, no serialization between processes, no server configuration. You write SQL against your data and get results back as Python objects. The entire query engine runs in C++ and processes data in vectors of 1,024 values at a time, which is why it handles large aggregations so quickly.
Getting Started
Installation takes one command:
pip install duckdb
That’s it. No server to start, no Docker containers, no environment variables. Here’s your first query:
import duckdb
# Run SQL directly on a Python object
result = duckdb.sql("""
SELECT species, AVG(body_mass_g) as avg_mass
FROM 'penguins.parquet'
GROUP BY species
ORDER BY avg_mass DESC
""").show()
DuckDB can read Parquet, CSV, and JSON files directly. No need to load them into memory first. It uses predicate pushdown to read only the columns and rows your query actually needs.
Querying Parquet Files Directly
One of DuckDB’s strongest features is its ability to query Parquet files without loading them into memory:
import duckdb
# Query a single file
result = duckdb.sql("""
SELECT region, COUNT(*) as sales_count, SUM(amount) as total
FROM 'sales_2026.parquet'
WHERE date >= '2026-01-01'
GROUP BY region
""").fetchdf()
# Query multiple files with a glob pattern
result = duckdb.sql("""
SELECT month, SUM(revenue)
FROM 'data/2026/month_*.parquet'
GROUP BY month
""").fetchdf()
This works because Parquet stores data in columnar chunks with metadata about min/max values. DuckDB reads the metadata first, then skips entire file sections that don’t match your filters. A 10 GB file with a date filter might only require reading 200 MB of actual data.
Working with DuckDB’s Persistent Storage
For workflows that run repeatedly, DuckDB offers a file-based database that persists between sessions:
import duckdb
# Connect to a database file (creates it if it doesn't exist)
conn = duckdb.connect('analytics.duckdb')
# Create a table from a Parquet file
conn.execute("""
CREATE TABLE sales AS
SELECT * FROM 'sales_2026.parquet'
""")
# Create a regular table with inserts
conn.execute("""
CREATE TABLE metadata (
key VARCHAR,
value VARCHAR
)
""")
conn.execute("INSERT INTO metadata VALUES ('last_update', '2026-06-13')")
# Query the persisted data
result = conn.execute("""
SELECT s.region, m.value as last_update
FROM sales s
CROSS JOIN metadata m
GROUP BY s.region, m.value
""").fetchdf()
conn.close()
When you reconnect to analytics.duckdb later, the sales and metadata tables are still there with all their data. This is useful for intermediate results that take time to compute but get reused across multiple analysis sessions.
Performance: When DuckDB Beats Pandas
The difference becomes clear when data gets large. DuckDB’s vectorized engine processes data in chunks of 1,024 values, running C++ code without Python overhead. Benchmarks consistently show DuckDB outperforming pandas on group-by operations, joins, and aggregations for datasets over a few million rows.
Consider a simple aggregation:
import duckdb
import pandas as pd
# With pandas — loads entire file into memory
df = pd.read_parquet('large_dataset.parquet')
result = df.groupby('category')['value'].sum()
# With DuckDB — streams data, only reads needed columns
result = duckdb.sql("""
SELECT category, SUM(value)
FROM 'large_dataset.parquet'
GROUP BY category
""").fetchdf()
The pandas version loads every column into memory before grouping. DuckDB reads only category and value columns, skipping the rest. On a 5 GB Parquet file with 50 columns, that’s the difference between loading 5 GB and loading 200 MB.
DuckDB Meets Polars: Two Complementary Tools
DuckDB and Polars solve overlapping but distinct problems. Polars is a DataFrame library with a lazy execution engine: you build an execution plan, then execute it. DuckDB is a SQL engine that happens to run inside your Python process.
Many teams use both:
import duckdb
import polars as pl
# Use DuckDB for complex SQL joins across multiple files
joined = duckdb.sql("""
SELECT c.customer_name, o.order_id, o.total
FROM 'customers.parquet' c
JOIN 'orders.parquet' o ON c.id = o.customer_id
WHERE o.date >= '2026-01-01'
""").pl() # Convert directly to Polars DataFrame
# Use Polars for downstream data transformations
result = (
joined
.with_columns(pl.col("total").log().alias("log_total"))
.group_by("customer_name")
.agg(pl.col("log_total").mean())
.sort("log_total", descending=True)
)
The fetchpl() method and .pl() conversion make switching between them seamless. DuckDB handles the heavy SQL lifting, Polars handles the DataFrame transformations.
Practical Analytics Patterns
Window Functions
DuckDB supports the full SQL window function suite:
result = duckdb.sql("""
SELECT
product_name,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY product_name
ORDER BY sale_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7_day_total,
RANK() OVER (
PARTITION BY product_name
ORDER BY amount DESC
) as sales_rank
FROM 'daily_sales.parquet'
WHERE sale_date >= '2026-01-01'
""").fetchdf()
This calculates a 7-day rolling total and a sales rank for each product. Doing this in pandas requires multiple groupby operations or a complex apply pattern.
Time Series Aggregation
result = duckdb.sql("""
SELECT
date_trunc('week', sale_date) as week,
region,
COUNT(DISTINCT customer_id) as unique_customers,
AVG(amount) as avg_order_value,
PERCENTILE_CONT(0.95, amount) as p95_order_value
FROM 'transactions.parquet'
GROUP BY week, region
ORDER BY week, region
""").fetchdf()
DuckDB’s date_trunc, PERCENTILE_CONT, and other analytical functions handle common time series patterns without custom Python logic.
Working with Hugging Face Datasets
DuckDB can read datasets directly from Hugging Face Hub:
import duckdb
conn = duckdb.connect()
# Access Hugging Face datasets via hf:// prefix
result = conn.execute("""
SELECT text, emotion_id
FROM read_parquet('hf://datasets/dair-ai/emotion/unsplit/train-00000-of-00001.parquet')
WHERE emotion_id = 0
LIMIT 10
""").fetchdf()
The hf:// prefix lets DuckDB stream Parquet files from Hugging Face without downloading them first, which is useful for exploring large datasets before deciding what to load locally.
When Not to Use DuckDB
DuckDB excels at analytical queries on a single machine, but it’s not the right tool for every job.
Skip DuckDB when you need:
- Concurrent writes from multiple processes. DuckDB allows one writer at a time. For multi-writer scenarios, stick with PostgreSQL or a distributed system.
- Sub-second latency on small lookups. DuckDB is optimized for throughput, not latency. A primary key lookup on 100 rows will be faster in SQLite.
- Production OLTP workloads. DuckDB is an OLAP database. If your application needs to handle many small concurrent reads and writes, use a transactional database.
For data analysis, exploratory work, and analytical pipelines on a single machine, DuckDB is often the best choice available.
The Bottom Line
DuckDB has carved out a niche that didn’t exist a few years ago: serverless SQL analytics that run as fast as distributed frameworks for most practical dataset sizes. It fits naturally into Python workflows, reads Parquet files directly, and handles complex analytical queries with full SQL support.
If your pandas code is getting slow or running out of memory, DuckDB is worth a look. You don’t need to rewrite your entire pipeline. Start by replacing one heavy aggregation with a DuckDB SQL query and see what happens.
Discussion
Leave a comment
No comments yet
Be the first to start the conversation.