Guides

How to Query CSV Files with SQL Using DuckDB

Arkzero ResearchApr 8, 20266 min read

Last updated Apr 8, 2026

DuckDB is a free, in-process SQL database built for analytical workloads. It reads CSV, Parquet, and JSON files directly with no import step, no Python environment, and no server configuration. Install the CLI with a single command, open a terminal, and run SQL queries on any CSV file immediately. DuckDB version 1.5.1 is the current release. This guide covers installation on Mac, Linux, and Windows, plus the most common analysis queries business analysts and operators actually need.
DuckDB logo for article: How to Query CSV Files with SQL Using DuckDB

DuckDB lets you run SQL directly on CSV files without importing them into a database. Install the CLI, point it at a file, and query. There is no server to start, no schema to define, and no Python required. On a standard laptop, DuckDB loads a 20 GB CSV in about 10 seconds at approximately 1.96 GB/s, according to benchmarks from its own CSV reader tests. For most CSV files analysts work with day-to-day, queries return in under a second.

Why DuckDB for CSV Analysis

Excel and Google Sheets struggle with files above 100,000 rows. Loading large CSVs into a proper database like Postgres means writing import scripts and managing a running server. DuckDB fills the gap.

It is an in-process OLAP database, meaning it runs inside the DuckDB process itself rather than as a separate service. The CLI binary weighs about 40 MB. You install it once, and it works offline. There is no account, no license, and no cloud dependency.

DuckDB's CSV reader ranked first in the Pollock benchmark, a test designed to measure SQL tools against malformed, non-standard, and pathological CSV files. In practice, DuckDB handles encoding mismatches, inconsistent quoting, and mixed null representations that would crash simpler tools. For analysts who receive data exports from third-party systems, this robustness matters more than benchmark numbers.

For analytical queries on large files, DuckDB consistently runs 100 to 1,000 times faster than SQLite or Postgres on the same data. The reason is columnar execution: DuckDB only reads the columns a query actually touches instead of fetching full rows.

Install DuckDB CLI

Choose the command for your operating system and run it in your terminal. No admin privileges are required on Mac or Linux.

macOS (Homebrew):

brew install duckdb

Linux:

curl https://install.duckdb.org | sh

Windows (winget):

winget install DuckDB.cli

Current version as of April 2026 is 1.5.1. Once installed, launch the interactive shell:

duckdb

The shell prompt shows D. Type exit or press Ctrl+D to quit. You can also open a one-off query without the interactive shell:

duckdb -c "SELECT COUNT(*) FROM 'sales_data.csv'"

Run Your First Query on a CSV File

No import step. Reference the file path directly in a FROM clause:

SELECT * FROM 'sales_data.csv' LIMIT 10;

DuckDB reads the header row automatically and infers column types. If the file is in a different directory, use the full path:

SELECT * FROM '/Users/yourname/downloads/sales_data.csv' LIMIT 10;

To inspect column names and detected types before writing queries:

DESCRIBE SELECT * FROM 'sales_data.csv';

This is useful when you receive an unfamiliar export and want to understand the schema before writing aggregations.

Aggregations and Grouping

The most common analytics pattern: sum or count by category, then sort by the result.

SELECT region,
       SUM(revenue)   AS total_revenue,
       AVG(revenue)   AS avg_deal_size,
       COUNT(*)       AS deal_count
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_revenue DESC;

Filter to a date range before aggregating:

SELECT rep_name,
       SUM(revenue) AS q1_revenue
FROM 'sales_data.csv'
WHERE close_date >= '2026-01-01'
  AND close_date <  '2026-04-01'
GROUP BY rep_name
ORDER BY q1_revenue DESC
LIMIT 10;

DuckDB accepts standard ISO date strings in WHERE clauses without casting.

Find Duplicates and Nulls

Two queries every analyst runs at the start of a data review:

-- Duplicate check by email
SELECT email, COUNT(*) AS occurrences
FROM 'leads.csv'
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;
-- Null check across all key columns
SELECT
  COUNT(*) AS total_rows,
  COUNT(email) AS has_email,
  COUNT(company) AS has_company,
  COUNT(revenue) AS has_revenue
FROM 'leads.csv';

Join Two CSV Files Without Importing Either

DuckDB treats CSV files like database tables. Join them directly:

SELECT o.order_id,
       o.amount,
       c.company_name,
       c.industry
FROM 'orders.csv'    o
JOIN 'customers.csv' c  ON o.customer_id = c.id
WHERE o.amount > 5000
ORDER BY o.amount DESC;

Neither file is imported or modified. DuckDB reads both on the fly during the query. This replaces the common manual workaround of using VLOOKUP to combine two exports in Excel.

Handle Type Conversions

DuckDB's type inference is accurate for well-formed files, but revenue columns stored as text (common in CRM exports) need explicit casting:

SELECT company,
       CAST(REPLACE(revenue_str, ',', '') AS DOUBLE) AS revenue
FROM 'pipeline.csv'
WHERE revenue_str IS NOT NULL;

For dates stored in non-ISO formats:

SELECT strptime(close_date, '%m/%d/%Y') AS close_date_parsed,
       deal_name
FROM 'deals.csv';

strptime converts a string to a date using a format pattern. This handles the MM/DD/YYYY format common in US CRM exports.

Export Results to a New CSV

COPY (
  SELECT region, SUM(revenue) AS total
  FROM 'sales_data.csv'
  GROUP BY region
  ORDER BY total DESC
) TO 'revenue_by_region.csv' (HEADER, DELIMITER ',');

The result file lands in the current directory. Open it in Excel or pass it to another tool.

Query Multiple Files at Once

If you have monthly exports saved with a consistent naming pattern, DuckDB reads them all in one query using a glob:

SELECT month,
       SUM(revenue) AS monthly_revenue
FROM 'sales_2026_*.csv'
GROUP BY month
ORDER BY month;

DuckDB unions the files automatically. This removes the manual step of combining exports before analyzing.

Persist Data for Repeated Analysis

Re-reading a large CSV on every query is fast but adds up for daily work. Load the file into a persistent DuckDB database once:

duckdb my_analysis.duckdb

Inside the shell:

CREATE TABLE sales AS SELECT * FROM 'sales_data.csv';

Subsequent queries against sales skip the CSV read entirely and hit in-memory columnar storage. The .duckdb file is portable: copy it to another machine and query it there with the same CLI.

When SQL Is Too Much Setup

DuckDB is the right tool when you know SQL and need speed on large files without database infrastructure. If you want to skip the terminal entirely and ask questions in plain English about your CSV, VSLZ AI handles that from a file upload with no configuration needed.

Key Takeaways

  • Install DuckDB CLI with one command: brew install duckdb, curl https://install.duckdb.org | sh, or winget install DuckDB.cli
  • Query any CSV with SELECT * FROM 'file.csv' -- no import required
  • Join multiple CSVs with standard SQL JOIN syntax
  • DuckDB loads 20 GB CSVs in about 10 seconds at ~1.96 GB/s
  • Export results with the COPY ... TO command
  • Persist frequently queried data to a .duckdb file to skip repeated CSV reads

FAQ

Does DuckDB work on Windows without Python?

Yes. DuckDB provides a standalone CLI binary for Windows that installs via winget (winget install DuckDB.cli) or a direct download. No Python, no runtime environment, and no administrator privileges are required. Once installed, launch it from any terminal — Command Prompt or PowerShell — and query CSV files immediately.

How large of a CSV file can DuckDB handle?

DuckDB handles files larger than available RAM by streaming data from disk during query execution. In benchmarks, DuckDB reads CSV files at approximately 1.96 GB/s on standard hardware, loading a 20 GB file in about 10 seconds. Queries on files that exceed RAM use disk-based spilling automatically, though performance slows compared to fully in-memory processing. Files up to several hundred GB are practical on a modern laptop.

Can DuckDB read Excel files directly?

DuckDB can read Excel files (.xlsx) using the spatial or Excel extension: INSTALL excel; LOAD excel; SELECT * FROM st_read('data.xlsx');. However, Excel support is less reliable than CSV or Parquet. The recommended approach for Excel files is to save as CSV first, then query with DuckDB. For CSV files, DuckDB reads them natively with no extension needed.

Is DuckDB free to use?

Yes. DuckDB is open source under the MIT license and free for all uses, including commercial. There is no usage limit, no account required, and no telemetry by default. MotherDuck is a paid cloud service built on DuckDB that adds collaboration and hosted compute, but the local CLI and Python library are completely free.

How is DuckDB different from SQLite for CSV analysis?

SQLite is an OLTP database optimized for row-by-row reads and writes, which makes it slow for aggregations across millions of rows. DuckDB is an OLAP database that uses columnar execution, reading only the columns a query needs. For analytical queries on large datasets, DuckDB runs 100 to 1,000 times faster than SQLite on the same data. SQLite also requires importing CSV data before querying; DuckDB queries CSV files directly without import.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026