DuckDB in Python: A Complete Data Analytics Guide for 2026

DuckDB brings fast SQL analytics directly into Python without a server. Learn how to use it for data analysis, Parquet queries, and workflows that beat pandas on large datasets.

Most Python data work follows the same pattern: load data into a DataFrame, filter a few columns, group by something, maybe join two tables. It works fine until the data gets big enough to make pandas choke.

That’s where DuckDB changes things. It runs SQL queries directly inside your Python process with zero server setup. You install it with pip install duckdb and start querying. For many analytical tasks, it outperforms pandas by a wide margin and can even rival distributed frameworks like Spark or Dask for datasets up to a few gigabytes.

What Is DuckDB and Why It Matters

DuckDB is an in-process SQL OLAP database. Unlike PostgreSQL or MySQL, it doesn’t run as a separate service. It lives inside your Python interpreter, sharing the same memory space. Think of it as SQLite, but built for analytical queries instead of transactional ones.

The architecture matters because it eliminates network overhead. No connection pooling, no serialization between processes, no server configuration. You write SQL against your data and get results back as Python objects. The entire query engine runs in C++ and processes data in vectors of 1,024 values at a time, which is why it handles large aggregations so quickly.

Getting Started

Installation takes one command:

pip install duckdb

That’s it. No server to start, no Docker containers, no environment variables. Here’s your first query:

import duckdb

# Run SQL directly on a Python object
result = duckdb.sql("""
    SELECT species, AVG(body_mass_g) as avg_mass
    FROM 'penguins.parquet'
    GROUP BY species
    ORDER BY avg_mass DESC
""").show()

DuckDB can read Parquet, CSV, and JSON files directly. No need to load them into memory first. It uses predicate pushdown to read only the columns and rows your query actually needs.

Querying Parquet Files Directly

One of DuckDB’s strongest features is its ability to query Parquet files without loading them into memory:

import duckdb

# Query a single file
result = duckdb.sql("""
    SELECT region, COUNT(*) as sales_count, SUM(amount) as total
    FROM 'sales_2026.parquet'
    WHERE date >= '2026-01-01'
    GROUP BY region
""").fetchdf()

# Query multiple files with a glob pattern
result = duckdb.sql("""
    SELECT month, SUM(revenue)
    FROM 'data/2026/month_*.parquet'
    GROUP BY month
""").fetchdf()

This works because Parquet stores data in columnar chunks with metadata about min/max values. DuckDB reads the metadata first, then skips entire file sections that don’t match your filters. A 10 GB file with a date filter might only require reading 200 MB of actual data.

Working with DuckDB’s Persistent Storage

For workflows that run repeatedly, DuckDB offers a file-based database that persists between sessions:

import duckdb

# Connect to a database file (creates it if it doesn't exist)
conn = duckdb.connect('analytics.duckdb')

# Create a table from a Parquet file
conn.execute("""
    CREATE TABLE sales AS
    SELECT * FROM 'sales_2026.parquet'
""")

# Create a regular table with inserts
conn.execute("""
    CREATE TABLE metadata (
        key VARCHAR,
        value VARCHAR
    )
""")
conn.execute("INSERT INTO metadata VALUES ('last_update', '2026-06-13')")

# Query the persisted data
result = conn.execute("""
    SELECT s.region, m.value as last_update
    FROM sales s
    CROSS JOIN metadata m
    GROUP BY s.region, m.value
""").fetchdf()

conn.close()

When you reconnect to analytics.duckdb later, the sales and metadata tables are still there with all their data. This is useful for intermediate results that take time to compute but get reused across multiple analysis sessions.

Performance: When DuckDB Beats Pandas

The difference becomes clear when data gets large. DuckDB’s vectorized engine processes data in chunks of 1,024 values, running C++ code without Python overhead. Benchmarks consistently show DuckDB outperforming pandas on group-by operations, joins, and aggregations for datasets over a few million rows.

Consider a simple aggregation:

import duckdb
import pandas as pd

# With pandas — loads entire file into memory
df = pd.read_parquet('large_dataset.parquet')
result = df.groupby('category')['value'].sum()

# With DuckDB — streams data, only reads needed columns
result = duckdb.sql("""
    SELECT category, SUM(value)
    FROM 'large_dataset.parquet'
    GROUP BY category
""").fetchdf()

The pandas version loads every column into memory before grouping. DuckDB reads only category and value columns, skipping the rest. On a 5 GB Parquet file with 50 columns, that’s the difference between loading 5 GB and loading 200 MB.

DuckDB Meets Polars: Two Complementary Tools

DuckDB and Polars solve overlapping but distinct problems. Polars is a DataFrame library with a lazy execution engine: you build an execution plan, then execute it. DuckDB is a SQL engine that happens to run inside your Python process.

Many teams use both:

import duckdb
import polars as pl

# Use DuckDB for complex SQL joins across multiple files
joined = duckdb.sql("""
    SELECT c.customer_name, o.order_id, o.total
    FROM 'customers.parquet' c
    JOIN 'orders.parquet' o ON c.id = o.customer_id
    WHERE o.date >= '2026-01-01'
""").pl()  # Convert directly to Polars DataFrame

# Use Polars for downstream data transformations
result = (
    joined
    .with_columns(pl.col("total").log().alias("log_total"))
    .group_by("customer_name")
    .agg(pl.col("log_total").mean())
    .sort("log_total", descending=True)
)

The fetchpl() method and .pl() conversion make switching between them seamless. DuckDB handles the heavy SQL lifting, Polars handles the DataFrame transformations.

Practical Analytics Patterns

Window Functions

DuckDB supports the full SQL window function suite:

result = duckdb.sql("""
    SELECT 
        product_name,
        sale_date,
        amount,
        SUM(amount) OVER (
            PARTITION BY product_name 
            ORDER BY sale_date 
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) as rolling_7_day_total,
        RANK() OVER (
            PARTITION BY product_name 
            ORDER BY amount DESC
        ) as sales_rank
    FROM 'daily_sales.parquet'
    WHERE sale_date >= '2026-01-01'
""").fetchdf()

This calculates a 7-day rolling total and a sales rank for each product. Doing this in pandas requires multiple groupby operations or a complex apply pattern.

Time Series Aggregation

result = duckdb.sql("""
    SELECT 
        date_trunc('week', sale_date) as week,
        region,
        COUNT(DISTINCT customer_id) as unique_customers,
        AVG(amount) as avg_order_value,
        PERCENTILE_CONT(0.95, amount) as p95_order_value
    FROM 'transactions.parquet'
    GROUP BY week, region
    ORDER BY week, region
""").fetchdf()

DuckDB’s date_trunc, PERCENTILE_CONT, and other analytical functions handle common time series patterns without custom Python logic.

Working with Hugging Face Datasets

DuckDB can read datasets directly from Hugging Face Hub:

import duckdb

conn = duckdb.connect()
# Access Hugging Face datasets via hf:// prefix
result = conn.execute("""
    SELECT text, emotion_id
    FROM read_parquet('hf://datasets/dair-ai/emotion/unsplit/train-00000-of-00001.parquet')
    WHERE emotion_id = 0
    LIMIT 10
""").fetchdf()

The hf:// prefix lets DuckDB stream Parquet files from Hugging Face without downloading them first, which is useful for exploring large datasets before deciding what to load locally.

When Not to Use DuckDB

DuckDB excels at analytical queries on a single machine, but it’s not the right tool for every job.

Skip DuckDB when you need:

  • Concurrent writes from multiple processes. DuckDB allows one writer at a time. For multi-writer scenarios, stick with PostgreSQL or a distributed system.
  • Sub-second latency on small lookups. DuckDB is optimized for throughput, not latency. A primary key lookup on 100 rows will be faster in SQLite.
  • Production OLTP workloads. DuckDB is an OLAP database. If your application needs to handle many small concurrent reads and writes, use a transactional database.

For data analysis, exploratory work, and analytical pipelines on a single machine, DuckDB is often the best choice available.

The Bottom Line

DuckDB has carved out a niche that didn’t exist a few years ago: serverless SQL analytics that run as fast as distributed frameworks for most practical dataset sizes. It fits naturally into Python workflows, reads Parquet files directly, and handles complex analytical queries with full SQL support.

If your pandas code is getting slow or running out of memory, DuckDB is worth a look. You don’t need to rewrite your entire pipeline. Start by replacing one heavy aggregation with a DuckDB SQL query and see what happens.

Spread The Article

Share this guide

Send this article to your network or keep a copy of the direct link.

X Facebook LinkedIn Reddit Telegram

Discussion

Leave a comment

No comments yet

Be the first to start the conversation.