How to Analyze Large CSV Files with DuckDB
Last updated Apr 9, 2026

Why CSV Analysis Gets Painful at Scale
Most analysts hit the same wall. A CSV file grows past 500,000 rows and Excel freezes. Google Sheets caps out at 10 million cells. Python pandas works but eats RAM and requires writing code beyond what many ops managers or founders want to maintain.
DuckDB solves this. It is a free, open-source SQL database that runs entirely on your laptop with zero configuration. No database server. No cloud account. No Docker containers. You download a single binary, point it at a CSV file, and start querying with standard SQL.
According to benchmarks published by MotherDuck in 2026, DuckDB can scan 100 million CSV rows in under 4 seconds on a modern laptop. For comparison, loading the same file into pandas takes over 45 seconds before you can even begin analysis.
Step 1: Install DuckDB
DuckDB supports macOS, Windows, and Linux. Pick whichever method matches your setup.
macOS (Homebrew):
brew install duckdb
Windows (winget):
winget install DuckDB.cli
Python (pip):
pip install duckdb
Verify the install:
duckdb --version
You should see the version number printed. As of April 2026, the latest stable release is v1.2.x.
Step 2: Query a CSV File Directly
This is the core feature that makes DuckDB different. You do not need to import or load anything. Just point a SQL query at the file path.
Open a DuckDB shell:
duckdb
Then run:
SELECT * FROM 'sales_data.csv' LIMIT 10;
DuckDB auto-detects column names, data types, and delimiters. It handles quoted fields, mixed types, and most common CSV quirks without any configuration.
To count rows:
SELECT count(*) FROM 'sales_data.csv';
To aggregate by a column:
SELECT region, sum(revenue) as total_revenue
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_revenue DESC;
Every standard SQL operation works: WHERE filters, JOINs across multiple CSV files, window functions, CTEs, and subqueries.
Step 3: Join Multiple CSV Files
Real analysis often requires combining data from several files. DuckDB handles this natively.
SELECT o.order_id, o.amount, c.company_name
FROM 'orders.csv' AS o
JOIN 'customers.csv' AS c ON o.customer_id = c.id
WHERE o.amount > 1000
ORDER BY o.amount DESC;
You can also query all CSV files in a directory using a glob pattern:
SELECT * FROM 'monthly_reports/*.csv';
DuckDB unions the files automatically, which is useful when you receive data split across monthly or weekly exports.
Step 4: Use DuckDB from Python
If you prefer working in a notebook or script, DuckDB integrates directly with Python. Install it with pip (see Step 1), then:
import duckdb
result = duckdb.sql("""
SELECT department, avg(salary) as avg_salary
FROM 'employees.csv'
GROUP BY department
ORDER BY avg_salary DESC
""").fetchdf()
print(result)
The .fetchdf() method returns a pandas DataFrame, so you can pipe results directly into matplotlib, seaborn, or any visualization library. DuckDB also supports .fetchnumpy() and .fetch_arrow_table() for performance-critical workflows.
Step 5: Speed Up Repeated Queries with Parquet
CSV is a text format. Every time you query it, DuckDB must parse every character. For files you query repeatedly, convert to Parquet once and get 5x to 10x faster reads.
COPY (SELECT * FROM 'sales_data.csv') TO 'sales_data.parquet' (FORMAT PARQUET);
Now query the Parquet file instead:
SELECT region, sum(revenue) FROM 'sales_data.parquet' GROUP BY region;
Parquet files are compressed and columnar. DuckDB only reads the columns your query references, which dramatically reduces I/O on wide tables. A 2 GB CSV file typically compresses to 200-400 MB as Parquet.
Step 6: Export Results
After running your analysis, export results to a new CSV, Excel-compatible format, or Parquet.
Export to CSV:
COPY (
SELECT region, sum(revenue) as total
FROM 'sales_data.csv'
GROUP BY region
) TO 'summary.csv' (HEADER, DELIMITER ',');
Export to Parquet:
COPY (SELECT * FROM 'sales_data.csv' WHERE year = 2026) TO 'filtered.parquet' (FORMAT PARQUET);
You can then open the exported CSV in Excel or Google Sheets, or share the Parquet file with teammates who use DuckDB, Spark, or any Arrow-compatible tool.
Performance Tips for Very Large Files
For files over 10 million rows, a few settings make a noticeable difference.
Increase memory limit:
SET memory_limit = '8GB';
Use multiple threads (enabled by default, but configurable):
SET threads TO 8;
Avoid SELECT * on wide tables. Specify only the columns you need. DuckDB's columnar engine benefits enormously from column pruning.
Create a persistent database for repeated work:
duckdb my_analysis.duckdb
Then create tables from your CSVs:
CREATE TABLE sales AS SELECT * FROM 'sales_data.csv';
Subsequent queries against the sales table run against DuckDB's optimized internal format rather than re-parsing CSV text.
When DuckDB Fits and When It Does Not
DuckDB is ideal for analytical queries on local files: aggregations, joins, filtering, and exports. It handles files from a few thousand rows to several hundred million rows on a single machine.
It is not designed for high-concurrency web applications, real-time streaming, or transactional workloads with many simultaneous writes. For those, tools like PostgreSQL or dedicated streaming platforms are better suited.
If you want to skip manual SQL entirely and get analysis, charts, and statistical summaries from a single natural-language prompt, platforms like VSLZ AI let you upload a CSV and ask questions in plain English without writing any queries.
Summary
Install DuckDB in one command. Query any CSV file with standard SQL. Join multiple files, aggregate millions of rows, and export results in seconds. Convert to Parquet for repeated queries. No server, no cloud dependency, no license fees.
FAQ
Can DuckDB handle CSV files larger than my RAM?
Yes. DuckDB uses a streaming execution engine that processes data in chunks rather than loading the entire file into memory at once. You can query CSV files significantly larger than your available RAM. For best results with very large files, convert them to Parquet format first, which allows DuckDB to read only the columns and row groups your query needs.
How do I query multiple CSV files at once in DuckDB?
Use a glob pattern in your FROM clause. For example, SELECT * FROM 'data/*.csv' will union all CSV files in the data directory into a single result set. DuckDB automatically aligns columns by name across files. You can also use the read_csv_auto function with a list of file paths for more control over schema detection.
Is DuckDB faster than pandas for CSV analysis?
For most analytical queries on files over 100,000 rows, yes. DuckDB uses vectorized execution and columnar processing, which means aggregations and filters run significantly faster than pandas row-by-row operations. Benchmarks show DuckDB scanning 100 million CSV rows in under 4 seconds compared to over 45 seconds for pandas to load the same file. DuckDB also uses less memory because it streams data rather than loading everything at once.
Can I use DuckDB without knowing SQL?
DuckDB requires SQL for queries. However, basic SQL for data analysis (SELECT, WHERE, GROUP BY, ORDER BY) can be learned in an afternoon. DuckDB also integrates with Python and R, so you can use it alongside tools you already know. If you prefer natural-language queries, AI-powered analytics platforms can generate SQL or skip it entirely.
How do I install DuckDB on Windows?
The fastest method is winget install DuckDB.cli from a Windows terminal. Alternatively, download the binary from duckdb.org/docs/installation, extract it, and add the folder to your system PATH. For Python users, pip install duckdb works on all platforms including Windows.


