How to Set Up DuckDB for Local Data Analysis
Last updated Apr 29, 2026

DuckDB is an in-process analytical database that runs entirely on your machine. You query CSV, Parquet, and JSON files directly with SQL, and results come back fast enough to be interactive on datasets with millions of rows. Version 1.5.2, released April 13, 2026, is the current stable build. It adds DuckLake v1.0 lakehouse format support and a round of performance improvements on top of the 1.5.0 "Variegata" release from March, which introduced a new VARIANT type, a built-in GEOMETRY type, and an experimental PEG parser that produces significantly better error messages.
For analysts who spend time wrestling with Excel limits or waiting for Pandas to finish groupby operations, DuckDB covers the full local workflow with less setup than any alternative. Analysts switching from Pandas to DuckDB for aggregation-heavy queries report latency reductions of up to 90% on the same hardware.
What DuckDB Is and When to Use It
DuckDB was released open source in 2019 by researchers at CWI Amsterdam. It is designed for OLAP workloads: reading large amounts of data, filtering it, aggregating it, and exporting results. It is not a transactional database. If you need an application that writes rows concurrently from multiple users, PostgreSQL or MySQL are better fits.
The sweet spot for DuckDB is everything that happens before a dashboard exists: getting a CSV from operations, cleaning it, grouping it by region or product, and exporting a summary. In Excel, that process involves manual steps and breaks at around 100,000 rows. In Pandas, it requires Python knowledge. In DuckDB, it is one SQL statement that runs in milliseconds on a laptop with no setup beyond a pip install.
DuckDB handles files up to roughly 100GB on typical analyst hardware. Beyond that, MotherDuck (the managed cloud version of DuckDB) or a distributed engine like Apache Spark scales further.
Step 1: Install DuckDB
DuckDB installs as a Python library or as a standalone CLI tool.
Python install (recommended for analysts already using Jupyter or VS Code):
pip install duckdb
No additional dependencies. The package includes the full DuckDB engine.
CLI install on macOS:
brew install duckdb
CLI install on Windows:
winget install DuckDB.cli
Or download the binary directly from duckdb.org/docs/installation. After installing the CLI, open it by typing duckdb in your terminal. To use a persistent database file instead of in-memory, pass a filename:
duckdb myanalysis.duckdb
All tables and data you create in that session are saved to the file. Future sessions that open the same file have access to them immediately.
Step 2: Query a CSV File Without Importing It
DuckDB's most useful feature for analysts is direct file querying. There is no import step.
SELECT region, SUM(revenue) AS total_revenue
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_revenue DESC;
DuckDB reads the CSV, infers column types, and returns results. For most CSVs with headers, this works without configuration.
If the file has inconsistent formatting or mixed types, use the explicit reader with auto-detection:
SELECT *
FROM read_csv('sales_data.csv', header=true, auto_detect=true)
LIMIT 10;
You can also query multiple files at once using a glob pattern:
SELECT month, SUM(revenue)
FROM 'monthly_data/*.csv'
GROUP BY month;
DuckDB concatenates all matching files and queries them as a single table. This handles the common case where monthly exports land as separate files in the same folder.
Step 3: Load Data into a Persistent Table
If you plan to run many queries against the same dataset, load it into a table. This avoids re-reading the file on every query and stores data in DuckDB's columnar format, which is faster for aggregation-heavy work:
CREATE TABLE sales AS
SELECT * FROM 'sales_data.csv';
After this, sales is available in every future session that opens the same .duckdb file.
Check what tables exist:
SHOW TABLES;
Inspect a table's column types:
DESCRIBE sales;
Step 4: Run Analytical SQL
DuckDB supports the full SQL spec including window functions, CTEs, pivots, and the QUALIFY clause -- features that require complex workarounds in other local tools.
Rolling 7-day average:
SELECT date, revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM sales;
Pivot revenue by month:
PIVOT sales ON month USING SUM(revenue);
Top 10 products by revenue with rank, no subquery needed:
SELECT product, SUM(revenue) AS total,
RANK() OVER (ORDER BY SUM(revenue) DESC) AS rnk
FROM sales
GROUP BY product
QUALIFY rnk <= 10;
The QUALIFY clause filters window function results without wrapping everything in a subquery. This is a DuckDB-specific shorthand that significantly reduces query length for ranked output.
Summarize by multiple dimensions with GROUPING SETS:
SELECT region, product, SUM(revenue)
FROM sales
GROUP BY GROUPING SETS ((region), (product), ());
This returns three result sets: one grouped by region, one by product, and one grand total -- in a single pass over the data.
Step 5: Export Results
When analysis is complete, export to CSV or Parquet for handoff:
COPY (
SELECT region, SUM(revenue) AS total
FROM sales
GROUP BY region
) TO 'summary.csv' (FORMAT CSV, HEADER);
For Parquet (smaller file size, faster to re-import, compatible with most BI tools):
COPY sales TO 'sales.parquet' (FORMAT PARQUET);
Parquet files are readable by Tableau, Power BI, Pandas, Spark, and DuckDB itself. If you are passing data to another analyst or loading it into a cloud warehouse, Parquet is the better format.
Step 6: Use DuckDB in Python
For analysts who want to combine DuckDB queries with Python plotting libraries, the integration adds three lines of code:
import duckdb
con = duckdb.connect('myanalysis.duckdb')
df = con.execute("SELECT region, SUM(revenue) AS total FROM sales GROUP BY region").df()
The .df() method returns a Pandas DataFrame. Any downstream library -- Matplotlib, Plotly, seaborn -- works without changes. The query itself runs in DuckDB, which is faster than the Pandas groupby equivalent on tables over a few hundred thousand rows.
You can also query Pandas DataFrames in memory without importing them into DuckDB:
import pandas as pd
import duckdb
df = pd.read_csv('sales_data.csv')
result = duckdb.sql("SELECT region, SUM(revenue) FROM df GROUP BY region").df()
DuckDB reads the df variable from Python's scope. This is useful in existing Pandas workflows where you want SQL-style aggregation on data already loaded in memory.
If you want to run this kind of query from a plain-English prompt on an uploaded CSV without writing SQL, VSLZ handles the full workflow from file upload to output with no configuration.
Common Setup Issues
"Column not found" errors on CSV import: DuckDB infers types on a sample of rows. If a column has mostly nulls in the sample, it may be typed as VARCHAR when you expect DOUBLE. Fix it with an explicit cast in the query: CAST(column_name AS DOUBLE).
File path errors on Windows: Use forward slashes or raw strings: r'C:\\Users\\username\\data.csv' in Python, or forward slashes in the CLI.
Out-of-memory errors on very large files: DuckDB spills intermediate results to disk by default, but you can increase the memory limit explicitly: SET memory_limit='8GB';
Summary
DuckDB installs in one command and queries CSV, Parquet, and JSON files directly with full analytical SQL including window functions, pivots, and CTEs. Version 1.5.2 is stable as of April 2026. The fastest path to a first result: pip install duckdb, open Python, and run duckdb.sql("SELECT * FROM 'yourfile.csv' LIMIT 10").df(). From there, every SQL aggregation that takes minutes in Excel takes milliseconds in DuckDB.
FAQ
How do I install DuckDB on Windows?
Run `winget install DuckDB.cli` in PowerShell or Command Prompt to install the DuckDB CLI. Alternatively, download the Windows binary directly from duckdb.org/docs/installation. For Python users, `pip install duckdb` works on Windows without any additional dependencies. DuckDB 1.5.2 is the current stable version as of April 2026.
Can DuckDB query CSV files directly without importing them?
Yes. DuckDB can query CSV, Parquet, and JSON files directly with SQL without any import step. The syntax is: SELECT * FROM 'yourfile.csv' LIMIT 10. DuckDB infers column types automatically from a sample of rows. You can also use read_csv() with explicit options like auto_detect=true for files with inconsistent formatting. Multiple files can be queried at once using glob patterns like 'data/*.csv'.
What is the difference between DuckDB and SQLite?
DuckDB and SQLite are both embedded databases (no server required), but they are designed for different workloads. SQLite is optimized for transactional (OLTP) workloads: many small reads and writes, row-by-row access, typical in application backends. DuckDB is optimized for analytical (OLAP) workloads: reading large amounts of data, aggregating columns, and exporting summaries. DuckDB is faster than SQLite for GROUP BY, window functions, and large scans by a significant margin. SQLite is faster for high-concurrency writes. For data analysis on CSVs or Parquet files, DuckDB is the better choice.
How much data can DuckDB handle on a laptop?
DuckDB handles datasets up to roughly 100GB on typical analyst hardware (16-32GB RAM). It spills intermediate query results to disk when memory is exhausted, which slows performance but does not crash. For datasets larger than your RAM, DuckDB still works but performance degrades on complex aggregations. Analysts at scale can use MotherDuck, the managed cloud version of DuckDB, which handles terabyte-scale data with the same SQL syntax.
Can I use DuckDB with Pandas in Python?
Yes. DuckDB integrates directly with Pandas in two ways. First, you can execute a DuckDB SQL query and convert the result to a Pandas DataFrame with .df(): `con.execute('SELECT * FROM sales').df()`. Second, DuckDB can query Pandas DataFrames in memory directly: `duckdb.sql('SELECT region, SUM(revenue) FROM df GROUP BY region')` where df is a Pandas DataFrame in Python's scope. This lets you use SQL aggregation on data already loaded in a Pandas workflow without converting your codebase.


