How to Analyze Multiple CSV Files with DuckDB
Last updated Mar 27, 2026

DuckDB lets you run SQL queries against CSV files from the command line without loading data into a database, installing a server, or writing Python. Install it with a single command, point it at one or more CSV files, and query them like database tables. For analysts working with monthly exports from CRMs, billing platforms, or logistics tools, this workflow replaces a patchwork of Excel manipulation, ad-hoc scripts, and manual joins.
Why DuckDB Works Well for CSV Analysis
DuckDB is an in-process analytical database. It runs inside your terminal or Python session with no server process, no configuration file, and no daemon to manage. Its CSV reader is one of the best-tested parsers available. In the Pollock benchmark, an independent test measuring how accurately databases handle real-world messy CSV files, DuckDB ranked first, correctly parsing 99.61% of test files across hundreds of edge-case formats including inconsistent delimiters, misquoted fields, and mixed line endings.
What separates DuckDB from a spreadsheet or pandas is scale. A file with five million rows that would freeze Excel or take tens of seconds in pandas loads in under two seconds in DuckDB. In side-by-side tests combining multiple large CSV files, DuckDB runs three to ten times faster than pandas concat on the same hardware, with the gap growing as file count increases.
DuckDB is free and open source under the MIT license. As of March 2026, the DuckDB ecosystem includes a VS Code extension for querying local and remote data sources, a SQL transpilation library covering 32 SQL dialects, and MotherDuck's AI-assisted analysis layer reaching 95% accuracy in text-to-SQL conversion on business datasets.
Install DuckDB in 60 Seconds
On macOS or Linux, install via Homebrew:
brew install duckdb
On Windows, download the standalone executable from the DuckDB documentation site. No Python environment, no Docker image, and no additional dependencies are required.
Launch the interactive shell:
duckdb
You should see a D prompt. That is the full setup.
Read a Single CSV with One SQL Query
DuckDB can query a CSV file without importing or copying it. Place a file called sales.csv in your working directory and run:
SELECT * FROM read_csv('sales.csv') LIMIT 10;
The CSV sniffer automatically detects the delimiter, infers column types, and reads the header row as column names. For most well-formed files from business tools, this works without any extra parameters.
Before running analysis on an unfamiliar file, use sniff_csv() to preview how DuckDB will interpret it:
SELECT * FROM sniff_csv('sales.csv');
This returns the detected delimiter, column names, inferred types, and sample values. Running it first saves debugging time when a file contains unexpected formatting.
Combine Monthly Export Files with Glob Patterns
The most common scenario for business analysts is a folder of monthly CSV exports from a SaaS tool. DuckDB reads all of them as a single virtual table using a glob pattern:
SELECT * FROM read_csv('reports/*.csv', union_by_name = true);
The union_by_name = true option aligns columns by header name rather than position. This is essential when files from different months have slightly different column orders, or when a column was added partway through the year. Missing columns in older files are filled with NULL rather than causing a parse error.
To see how many rows each file contributes before running the full analysis:
SELECT filename, COUNT(*) AS row_count
FROM read_csv('reports/*.csv', union_by_name = true, filename = true)
GROUP BY filename
ORDER BY filename;
A month with zero rows or an unexpectedly low count usually means a missing export. This check catches the problem before an analysis silently undercounts.
Join Two Related CSV Files
Business data rarely lives in a single file. A common pattern is a transaction file paired with a reference table of customers, products, or regions. DuckDB joins them directly without importing either into a database.
For a file orders.csv with a customer_id column and a file customers.csv with id, name, and region columns:
SELECT
o.order_id,
o.amount,
c.name AS customer_name,
c.region
FROM read_csv('orders.csv') AS o
JOIN read_csv('customers.csv') AS c ON o.customer_id = c.id;
To aggregate by region and rank by revenue in the same query:
SELECT
c.region,
COUNT(*) AS order_count,
SUM(o.amount) AS total_revenue
FROM read_csv('orders.csv') AS o
JOIN read_csv('customers.csv') AS c ON o.customer_id = c.id
GROUP BY c.region
ORDER BY total_revenue DESC;
Use a LEFT JOIN when some orders may not have a matching customer record and you want to keep them rather than drop them from results:
SELECT
o.order_id,
o.amount,
COALESCE(c.region, 'Unassigned') AS region
FROM read_csv('orders.csv') AS o
LEFT JOIN read_csv('customers.csv') AS c ON o.customer_id = c.id;
Both queries run directly on the files. No staging step, no intermediate table creation, and no schema definition is required.
Handle Missing Values and Malformed Rows
Real CSV exports from business systems contain nulls, blank fields, and rows that fail type conversion. DuckDB provides three practical options.
To skip bad rows silently and continue loading:
SELECT * FROM read_csv('orders.csv', ignore_errors = true);
To capture rejected rows in a temporary table for inspection rather than discarding them:
SELECT * FROM read_csv('orders.csv', store_rejects = true);
SELECT * FROM reject_errors;
The reject_errors table shows the line number, the raw row content, and the specific error message. Use this approach when data completeness matters and silent data loss is not acceptable.
To filter nulls in your analysis:
SELECT *
FROM read_csv('orders.csv')
WHERE amount IS NOT NULL AND customer_id IS NOT NULL;
To replace nulls with a fallback value using COALESCE:
SELECT
order_id,
COALESCE(amount, 0) AS amount,
COALESCE(region, 'Unknown') AS region
FROM read_csv('orders.csv');
These patterns handle most of what comes out of real export pipelines. The store_rejects approach is worth using whenever the source file has unknown data quality.
Export Results to a Clean Output File
After joining and cleaning, DuckDB writes results directly to a new CSV or Parquet file with the COPY command:
COPY (
SELECT
c.region,
SUM(o.amount) AS total_revenue,
COUNT(*) AS order_count
FROM read_csv('orders.csv') AS o
JOIN read_csv('customers.csv') AS c ON o.customer_id = c.id
WHERE o.amount IS NOT NULL
GROUP BY c.region
ORDER BY total_revenue DESC
) TO 'revenue_by_region.csv' (HEADER, DELIMITER ',');
For large outputs you plan to query again, Parquet is more efficient. DuckDB reads Parquet faster than CSV and the files compress to a fraction of the original size:
COPY (...) TO 'revenue_by_region.parquet' (FORMAT PARQUET);
The COPY command accepts any valid SELECT query as its source with no limit on complexity.
Store Data in a Persistent Database
For files you query repeatedly, loading data into a persistent DuckDB database file is faster than re-reading the CSV on every run. Open a named database and create a table from your exports:
duckdb my_analysis.db
CREATE TABLE orders AS
SELECT * FROM read_csv('orders/*.csv', union_by_name = true);
Subsequent queries against the orders table skip CSV parsing entirely. The database file works with the DuckDB VS Code extension, the DuckDB Local UI (accessible via duckdb -ui), and MotherDuck's cloud service for shared team access.
If you want to skip the SQL setup entirely, VSLZ AI handles this workflow from a file upload. Upload your CSV, ask your question in plain English, and get a joined, cleaned result without writing a query.
Practical Summary
The core DuckDB workflow for CSV analysis is: install with one command, query files directly with read_csv(), combine monthly exports using glob patterns with union_by_name = true, join related files without a schema, handle bad rows with store_rejects, and write the final result with COPY. The tool requires no server, no configuration, and no Python environment. For analysts who regularly work with downloaded exports from business tools, this replaces a combination of spreadsheet manipulation, fragile scripts, and manual joins with a single reproducible workflow.
FAQ
Can DuckDB read CSV files without Python or any programming language?
Yes. DuckDB includes a standalone command-line interface that runs SQL queries directly against CSV files. Download the executable for your platform, run `duckdb` in a terminal, and query any CSV file with `SELECT * FROM read_csv('file.csv')`. No Python, no Node.js, and no configuration is required.
How do I combine multiple CSV files into one table with DuckDB?
Use a glob pattern in the `read_csv()` function: `SELECT * FROM read_csv('reports/*.csv', union_by_name = true)`. The `union_by_name = true` parameter aligns columns by header name rather than position, which handles files where column order differs between exports or where new columns were added over time.
Does DuckDB handle large CSV files efficiently?
Yes. In benchmark tests combining multiple large CSV files, DuckDB runs three to ten times faster than pandas concat on the same hardware. Files with millions of rows that would be slow or unworkable in a spreadsheet load in seconds. For files you query repeatedly, loading them into a persistent DuckDB database file is even faster than re-reading the CSV each time.
How do I export DuckDB query results to a CSV or Parquet file?
Use the COPY command: `COPY (SELECT ...) TO 'output.csv' (HEADER, DELIMITER ',')`. For Parquet output, use `COPY (...) TO 'output.parquet' (FORMAT PARQUET)`. The COPY command accepts any valid SELECT query as its source, including queries with joins, aggregations, and filters.
How does DuckDB handle malformed or missing rows in CSV files?
DuckDB offers three options. Use `ignore_errors = true` to skip bad rows silently. Use `store_rejects = true` to capture rejected rows in a temporary table called `reject_errors`, which shows the line number, raw content, and specific error for each failed row. For null values in valid rows, use SQL WHERE clauses or COALESCE to filter or replace them during your query.


