Guides

How to Analyze CSV Files with DuckDB

Arkzero ResearchApr 10, 20266 min read

Last updated Apr 10, 2026

DuckDB is an in-process SQL database that lets you query CSV files directly with standard SQL, without installing a database server or loading data into a separate tool. You can run it from the command line or inside a Python script, and it handles files of several gigabytes without significant setup. This guide walks through installing DuckDB, querying CSVs, running aggregations and window functions, and exporting results, covering the steps that most beginner tutorials skip.
DuckDB CSV file analysis tutorial

What DuckDB Is and Why It Matters for CSV Work

DuckDB is an embeddable analytical database that runs inside your process rather than as a separate server. You install it, point it at a file, and run SQL. There is no daemon to start, no connection string to configure, and no schema to define before you can query.

For ops managers and analysts who receive CSV exports from ERPs, payment processors, or ad platforms, this removes the biggest friction in ad hoc data work: getting the file into something queryable. You do not need pandas, Excel, or a hosted warehouse. You need DuckDB and a terminal.

DuckDB's columnar storage engine is built for analytics. In internal benchmarks published with the 1.1 release in late 2024, join optimization with dynamic filter pushdown delivered roughly 10x performance improvement, and the Python streaming interface dropped from 1.17 seconds to 0.12 seconds on standard query sets. For files in the hundreds of megabytes range, the difference between DuckDB and loading a CSV into pandas is noticeable.

Installing DuckDB

DuckDB ships as a single binary. There are no dependencies.

macOS:

brew install duckdb

Windows:

winget install DuckDB.cli

Python (if you prefer to work in notebooks or scripts):

pip install duckdb

Verify the install:

duckdb --version

Querying a CSV File Without Loading It

The fastest way to start is to treat the CSV file as a table directly. DuckDB can read it without any import step.

SELECT * FROM 'sales_data.csv' LIMIT 10;

DuckDB's built-in CSV sniffer automatically detects the delimiter, header row, and column types. For most standard exports this works without any configuration.

If the file uses a pipe delimiter or non-standard encoding:

SELECT * FROM read_csv('sales_data.csv', delim='|', header=true, encoding='UTF-8') LIMIT 10;

To see what columns and types DuckDB inferred:

DESCRIBE SELECT * FROM 'sales_data.csv';

Running Aggregations

Once you can query the file, standard SQL aggregations work immediately.

Count rows by category:

SELECT region, COUNT(*) as orders, SUM(revenue) as total_revenue
FROM 'sales_data.csv'
GROUP BY region
ORDER BY total_revenue DESC;

Filter before aggregating:

SELECT product_id, AVG(sale_price) as avg_price
FROM 'sales_data.csv'
WHERE order_date >= '2024-01-01'
GROUP BY product_id
HAVING AVG(sale_price) > 50
ORDER BY avg_price DESC;

DuckDB handles date parsing automatically from string columns in most common formats. You can also cast explicitly:

SELECT CAST(order_date AS DATE) as day, SUM(revenue) as daily_revenue
FROM 'sales_data.csv'
GROUP BY day
ORDER BY day;

Using Window Functions for Running Totals and Rankings

This is the step most beginner DuckDB tutorials skip. Window functions let you compute running totals, rankings, and period comparisons without subqueries or self-joins.

Running total by date:

SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM 'sales_data.csv'
ORDER BY order_date;

Rank customers by total spend:

SELECT
  customer_id,
  SUM(revenue) as total_spend,
  RANK() OVER (ORDER BY SUM(revenue) DESC) as spend_rank
FROM 'sales_data.csv'
GROUP BY customer_id
ORDER BY spend_rank;

Month-over-month comparison using LAG:

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', CAST(order_date AS DATE)) as month,
    SUM(revenue) as monthly_revenue
  FROM 'sales_data.csv'
  GROUP BY month
)
SELECT
  month,
  monthly_revenue,
  LAG(monthly_revenue) OVER (ORDER BY month) as prev_month,
  ROUND(
    (monthly_revenue - LAG(monthly_revenue) OVER (ORDER BY month))
    / LAG(monthly_revenue) OVER (ORDER BY month) * 100, 1
  ) as pct_change
FROM monthly
ORDER BY month;

Using CTEs to Break Down Complex Queries

Common Table Expressions (CTEs) let you name intermediate results and chain them, which makes complex logic readable without nesting subqueries.

WITH top_products AS (
  SELECT product_id, SUM(revenue) as total_revenue
  FROM 'sales_data.csv'
  GROUP BY product_id
  ORDER BY total_revenue DESC
  LIMIT 10
),
product_orders AS (
  SELECT product_id, COUNT(*) as order_count
  FROM 'sales_data.csv'
  GROUP BY product_id
)
SELECT
  t.product_id,
  t.total_revenue,
  p.order_count,
  ROUND(t.total_revenue / p.order_count, 2) as revenue_per_order
FROM top_products t
JOIN product_orders p ON t.product_id = p.product_id
ORDER BY t.total_revenue DESC;

Querying Multiple CSV Files at Once

If your exports are split by month or region, DuckDB can query all of them together using glob patterns:

SELECT * FROM 'sales_2024_*.csv';

DuckDB unions the files automatically, assuming they share a schema. This is useful for stitching together monthly exports without manual concatenation.

Exporting Results

Once you have a query that produces the output you want, write it directly to a new CSV or Parquet file:

COPY (
  SELECT region, SUM(revenue) as total
  FROM 'sales_data.csv'
  GROUP BY region
  ORDER BY total DESC
) TO 'region_summary.csv' (HEADER, DELIMITER ',');

Parquet is useful if you want to pass the result to another tool or archive it efficiently:

COPY (SELECT * FROM 'sales_data.csv' WHERE region = 'APAC') TO 'apac_sales.parquet' (FORMAT PARQUET);

Using DuckDB Inside Python

If you prefer to stay in a script or notebook:

import duckdb

result = duckdb.sql("""
  SELECT region, SUM(revenue) as total_revenue
  FROM 'sales_data.csv'
  GROUP BY region
  ORDER BY total_revenue DESC
""").df()

print(result)

The .df() method returns a pandas DataFrame. DuckDB integrates directly with pandas without copying data, so this works efficiently even for large files. If you prefer to skip the pandas dependency, call .fetchall() to get plain Python lists instead.

For teams that upload CSV files and need charts and statistical summaries without writing SQL, VSLZ AI provides an alternative: upload a file, ask a question in plain English, and get end-to-end output from a single prompt.

Troubleshooting Common Issues

Column type mismatch: If numeric columns are read as strings, specify types explicitly using read_csv('file.csv', types={'price': 'DOUBLE'}).

Encoding errors: Add encoding='Latin-1' or encoding='UTF-16' for non-UTF-8 files exported from older Windows tools.

Memory limits on very large files: DuckDB streams data from disk by default, but you can tune available memory with SET memory_limit = '4GB'; before running queries.

Missing rows: If the sniffer skips rows due to parsing errors, add ignore_errors=true to read_csv(). DuckDB will skip malformed lines and report counts separately.

Summary

DuckDB gives you a full SQL analytics engine that reads CSV files directly, with no server, no schema setup, and no data migration. The key workflow is: install, query with SELECT * FROM 'file.csv', then add aggregations, window functions, or CTEs as the analysis demands. Export results with COPY when you need to share or archive them. For large file sets or repeated analysis tasks, the Python client makes it straightforward to embed DuckDB queries in existing scripts.

FAQ

Does DuckDB require a database server to query CSV files?

No. DuckDB is an in-process database that runs entirely inside your application or terminal session. There is no separate server process to start or manage. You install the CLI or Python package, point it at a CSV file, and run SQL immediately.

How large a CSV file can DuckDB handle?

DuckDB streams data from disk rather than loading everything into memory, which means it can handle files significantly larger than available RAM. In practice, files in the range of several gigabytes work without configuration changes. For very large files, you can tune the memory limit with SET memory_limit = "8GB"; before running queries.

Can DuckDB query multiple CSV files at once?

Yes. DuckDB supports glob patterns in FROM clauses, so SELECT * FROM "sales_2024_*.csv" will union all matching files automatically. The files must share a compatible schema. This is useful for monthly or regional exports that have been split across multiple files.

How do I export DuckDB query results to a new CSV file?

Use the COPY statement: COPY (SELECT ...) TO "output.csv" (HEADER, DELIMITER ","). You can also export to Parquet format by specifying (FORMAT PARQUET) instead. The COPY statement accepts any valid SELECT query as its source.

Can I use DuckDB with Python and pandas?

Yes. Install duckdb via pip, then use duckdb.sql("SELECT ...").df() to return a pandas DataFrame. DuckDB integrates with pandas without copying data, which keeps memory usage efficient. If you do not need pandas, .fetchall() returns plain Python lists.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026