How to Set Up DuckLake for Local Analytics
Last updated Apr 26, 2026

DuckLake is a lakehouse format that stores all metadata in a SQL database instead of thousands of small files on object storage. Released as version 1.0 on April 13, 2026, it ships as a DuckDB extension and requires no catalog server to run. Install DuckDB v1.5.2, load the ducklake extension, and you have a production-ready lakehouse with ACID transactions, time travel, and schema evolution in under five minutes.
What DuckLake Actually Is
Most lakehouse formats store metadata as files. Delta Lake writes JSON transaction logs to object storage. Apache Iceberg writes manifest files and snapshot metadata. Both require a separate catalog service, Unity Catalog for Delta, Lakekeeper or Polaris for Iceberg, to coordinate access across multiple clients.
DuckLake takes a different approach. All metadata lives in a standard SQL database called the catalog. That catalog can be a local SQLite file, a shared PostgreSQL instance, or a DuckDB database. No server is needed for a local setup, and no third-party catalog service is needed for most team setups.
The performance difference is measurable. COUNT(*) queries run 8x to 258x faster than scanning Parquet files directly because the row count sits in the catalog rather than inside the data files. In streaming workloads with frequent small writes, DuckDB Labs benchmarks show 900x faster reads and 100x faster writes compared to Apache Iceberg, driven primarily by the data inlining feature covered below.
Prerequisites
You need DuckDB v1.5.2 or later. The ducklake extension shipped with this release. Check your version:
duckdb --version
If you are below v1.5.2, upgrade via your package manager:
# macOS
brew upgrade duckdb
# Python
pip install duckdb --upgrade
No other dependencies are required for a local SQLite-backed lakehouse.
Step 1: Install the ducklake Extension
Open a DuckDB shell or a Python session with import duckdb. Install the extension:
INSTALL ducklake;
LOAD ducklake;
Installation pulls from the DuckDB Community Extensions registry and requires an internet connection once. After that, the extension loads offline.
Step 2: Attach a DuckLake
The ATTACH command creates a new lakehouse. The ducklake: prefix tells DuckDB to use the ducklake extension. For a local setup, use SQLite as the catalog backend:
ATTACH 'ducklake:sqlite:my_catalog.db' AS lake (DATA_PATH 'my_data/');
This creates my_catalog.db (the catalog, stored as SQLite) and my_data/ (a directory where Parquet files will eventually land). Both are created automatically if they do not exist.
For a shared environment where multiple users or processes need to read and write simultaneously, use PostgreSQL:
ATTACH 'ducklake:postgres:dbname=my_catalog host=localhost' AS lake
(DATA_PATH 's3://my-bucket/data/');
For a fully local setup with DuckDB as its own catalog:
ATTACH 'ducklake:duckdb:catalog.db' AS lake (DATA_PATH 'data/');
Step 3: Create Tables and Load Data
Once attached, write standard SQL against the lake schema:
CREATE TABLE lake.orders (
order_id INTEGER,
customer VARCHAR,
amount DECIMAL(10, 2),
created_at TIMESTAMP
);
INSERT INTO lake.orders VALUES
(1, 'Acme Corp', 1200.00, '2026-04-01 09:00:00'),
(2, 'Beta Ltd', 450.00, '2026-04-01 10:30:00'),
(3, 'Gamma Inc', 8750.00, '2026-04-02 14:15:00');
To load from a CSV or Parquet file:
INSERT INTO lake.orders SELECT * FROM read_csv('raw_orders.csv');
INSERT INTO lake.orders SELECT * FROM 'orders_backup.parquet';
Query the table exactly as you would any DuckDB table:
SELECT customer, SUM(amount) AS total
FROM lake.orders
GROUP BY customer
ORDER BY total DESC;
Step 4: How Data Inlining Works
DuckLake solves the small file problem at write time rather than after the fact. By default, write operations touching 10 rows or fewer land directly in the catalog database, not in new Parquet files on storage. This is called data inlining.
To confirm that small writes stay inlined:
FROM ducklake_list_files('lake', 'orders');
-- returns empty after small inserts; data is in the catalog
To flush all inlined data to Parquet files on storage, run a checkpoint:
CHECKPOINT;
For streaming ingestion where you receive a constant flow of small batches, raise the threshold:
SET ducklake_data_inlining_row_limit = 100;
This is the feature behind the 900x read speedup in DuckDB Labs benchmarks against Iceberg. Iceberg creates a new data file for every small write and requires scheduled compaction to clean up. DuckLake absorbs those writes into the catalog and writes to storage only when the threshold is reached or CHECKPOINT is called.
Step 5: Sorted Tables for Faster Reads
If queries frequently filter by a particular column, declare a sort order on the table. DuckLake will pre-sort new inserts and skip irrelevant files at read time:
ALTER TABLE lake.orders SET SORTED BY (created_at ASC);
Inserts after this statement are pre-sorted automatically. Existing data is not retroactively sorted. To sort existing data, re-insert from a sorted query or use the migration scripts in the DuckDB documentation.
Sort expressions support arbitrary SQL, which means you can sort by a computed expression:
ALTER TABLE lake.events SET SORTED BY (date_trunc('day', occurred_at) ASC);
Step 6: Bucket Partitioning
For high-cardinality columns you filter on frequently, bucket partitioning distributes data across a fixed number of buckets using a murmur3 hash. This is the same partitioning scheme used by Apache Iceberg v2, making DuckLake tables interoperable with Iceberg-compatible engines:
ALTER TABLE lake.orders SET PARTITIONED BY (bucket(8, customer));
A query with WHERE customer = 'Acme Corp' now scans one of eight buckets rather than the full table. The speedup scales proportionally with table size.
Time Travel
DuckLake records every transaction as a snapshot. Query the table as it appeared at any past point:
SELECT * FROM lake.orders AT (TIMESTAMP = NOW() - INTERVAL '1 day');
List all available snapshots:
FROM ducklake_snapshots('lake', 'orders');
Time travel is useful for debugging bad writes, auditing changes, and rolling back to a known-good state without maintaining separate backup copies.
When DuckLake Makes Sense
DuckLake is suited for local and single-team analytics that need lakehouse guarantees, ACID transactions, time travel, schema evolution, without running a catalog server. It works well when data arrives in many small batches and compaction would otherwise become a recurring maintenance job, when teams share a PostgreSQL instance and need multiple DuckDB clients to coordinate safely, and when files are too large for in-memory DuckDB but the workload does not justify a full Databricks or Snowflake contract.
For enterprises already running workloads on Spark or Flink, DuckLake is not a replacement. It covers the local and mid-scale range where those platforms are overkill.
If you want to skip format setup entirely, VSLZ connects directly to file sources and runs analytics from a plain-English prompt without requiring a catalog or storage path configuration.
Next Steps
After your first DuckLake is running, explore schema evolution with ALTER TABLE ... ADD COLUMN, the VARIANT type for semi-structured event data where you need field-level filter pushdown, and the migration guide for moving an existing DuckDB database into DuckLake format. DuckLake v2.0 is not on the near-term roadmap. The DuckDB team has stated the focus through 2026 is maturing the current feature set and guaranteeing backward compatibility of the v1.0 specification.
FAQ
What databases can DuckLake use as a catalog?
DuckLake supports three catalog backends: SQLite (local file, no server needed), PostgreSQL (recommended for shared or multi-user setups), and DuckDB itself. You specify the backend in the ATTACH command using the ducklake: prefix followed by the backend type and connection string.
How is DuckLake different from Apache Iceberg?
Iceberg stores all metadata as files on object storage and requires a separate catalog service such as Lakekeeper or Polaris. DuckLake stores metadata in a SQL database, which eliminates the catalog server requirement for most setups. DuckLake also includes data inlining, which absorbs small writes into the catalog rather than creating individual Parquet files, avoiding the small file problem that affects Iceberg in streaming workloads. DuckDB Labs benchmarks show 900x faster reads and 100x faster writes than Iceberg in streaming scenarios.
Does DuckLake require object storage like S3?
No. For local development and single-machine analytics, DuckLake works entirely with a local directory as the data path. Object storage such as S3, GCS, or Azure Blob is supported for team and production setups, but is not required. You set the data path when attaching the lakehouse using the DATA_PATH parameter.
What is data inlining in DuckLake?
Data inlining is a feature that stages small write operations directly in the catalog database rather than writing new Parquet files on storage. The default threshold is 10 rows. When a write operation is below that threshold, no new file is created. Data is flushed to storage files when you run CHECKPOINT or when a write exceeds the threshold. This eliminates the small file accumulation problem that affects Delta Lake and Iceberg in high-frequency write scenarios. The threshold is configurable via SET ducklake_data_inlining_row_limit.
Which version of DuckDB is required for DuckLake?
DuckLake v1.0 requires DuckDB v1.5.2 or later. Both were released on April 13, 2026. You can check your current version with duckdb --version and upgrade via brew upgrade duckdb on macOS or pip install duckdb --upgrade in Python. The ducklake extension is installed via INSTALL ducklake; LOAD ducklake; inside a DuckDB session.


