Guides

How to Set Up DuckLake for Your Data Lake

Arkzero ResearchApr 25, 20266 min read

Last updated Apr 25, 2026

DuckLake is an open table format released in April 2026 that stores lakehouse metadata in a SQL database instead of thousands of JSON and Avro files. To set it up, install the ducklake DuckDB extension, attach a local catalog file for testing, then swap to PostgreSQL and S3 for production. The result is a multi-user, ACID-compliant data lake with time travel and branching, running on infrastructure most teams already have.
DuckLake data lakehouse setup with DuckDB, PostgreSQL metadata, and S3 storage

What DuckLake Solves

Most data lakes have a metadata problem. Apache Iceberg stores table metadata as a hierarchy of JSON and Avro files on object storage. A single streaming ingest run can generate more than 300 metadata files before a row of business data is queryable. Those files require periodic compaction, a separate catalog service to track them, and additional infrastructure to keep them consistent.

DuckLake v1.0, released on April 13, 2026, takes a different approach. All table metadata lives in a standard SQL database: PostgreSQL, MySQL, or SQLite. The actual data files remain Parquet on object storage, the same as any other lakehouse format. The catalog state that tells DuckDB what those files represent is stored as a database row, not a JSON file on S3.

The practical result: consistent reads, atomic writes, schema evolution, and time travel built on infrastructure most engineering teams already operate.

How the Architecture Works

DuckLake separates three concerns. Data storage holds the Parquet files. For production this is an S3 bucket. For local experiments it is a directory on disk. Metadata storage holds the catalog: table definitions, schema versions, snapshots, branch state, and transaction history. In production this is PostgreSQL or MySQL. For single-user work it is a .ducklake file backed by DuckDB. Compute is DuckDB v1.5.2 or later, using the ducklake extension to bridge the catalog and the Parquet data.

This separation allows multiple users to connect to the same data lake through a shared catalog. When one analyst inserts rows and another runs a SELECT, the PostgreSQL catalog handles concurrency and isolation natively, without file-level locking on object storage.

Prerequisites

Before running any commands, confirm you have DuckDB v1.5.2 or later installed. Run duckdb --version to check. You also need an S3 bucket with read/write access, or a local directory for testing. If using S3, set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION in your environment. For production setup you need a PostgreSQL 14 or later instance reachable from your machine.

Install DuckDB on Linux:

curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip
unzip duckdb.zip && mv duckdb /usr/local/bin/duckdb

On macOS: brew install duckdb

Step 1: Local Setup

The fastest way to try DuckLake requires no external dependencies. Open a DuckDB shell and run:

INSTALL ducklake;
LOAD ducklake;

ATTACH 'ducklake:my_catalog.ducklake' AS lake;
USE lake;

The ATTACH command creates a .ducklake file that stores all catalog metadata locally. Data files write to a data/ subdirectory by default. Standard SQL works from this point:

CREATE TABLE lake.orders (
    order_id   INTEGER,
    customer   VARCHAR,
    amount     DECIMAL(10, 2),
    order_date DATE
);

INSERT INTO lake.orders VALUES
    (1, 'Acme Corp', 4500.00, '2026-04-01'),
    (2, 'Beta LLC',  1200.50, '2026-04-03');

SELECT customer, SUM(amount) AS total
FROM lake.orders
GROUP BY customer;

DuckLake writes each batch as a Parquet file and records the transaction snapshot in the metadata file. Run SELECT * FROM ducklake_snapshots('lake'); to view the history.

Step 2: Production Setup with PostgreSQL and S3

For multi-user or production workloads, move the metadata to PostgreSQL and the data to S3.

Install the required extensions in DuckDB:

INSTALL ducklake; INSTALL postgres; INSTALL httpfs; INSTALL aws;
LOAD ducklake; LOAD postgres; LOAD httpfs; LOAD aws;

Configure S3 credentials if not already in your environment:

SET s3_region = 'us-east-1';
SET s3_access_key_id = 'YOUR_KEY';
SET s3_secret_access_key = 'YOUR_SECRET';

Attach the production catalog:

ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=your-pg-host user=your-user password=your-password'
    AS prod_lake (DATA_PATH 's3://your-bucket/ducklake/');

USE prod_lake;

On first ATTACH, DuckLake creates the catalog schema in PostgreSQL automatically. Every subsequent CREATE TABLE, INSERT, UPDATE, DELETE, and ALTER TABLE writes metadata to Postgres and data to S3 as a single atomic transaction. Any DuckDB client that attaches to the same Postgres connection string sees the same consistent lake state.

Step 3: Schema Evolution and Time Travel

Schema changes are a first-class operation. Adding a column does not rewrite existing Parquet files:

ALTER TABLE prod_lake.orders ADD COLUMN channel VARCHAR DEFAULT 'online';

DuckLake records the change as a new snapshot. Historical rows return NULL for the new column when queried. To query a prior state, use the AT clause with a snapshot ID:

-- List available snapshots
SELECT * FROM ducklake_snapshots('prod_lake');

-- Query table as it was at snapshot 1
SELECT * FROM prod_lake.orders AT (VERSION => 1);

The time travel state is stored entirely in the PostgreSQL catalog. No separate log replay or compaction job is needed.

Step 4: Branching for Safe Experimentation

DuckLake v1.0 ships branching as a first-class feature. Create a branch to test changes in isolation before committing:

CREATE BRANCH staging FROM main;
USE BRANCH staging;

DELETE FROM prod_lake.orders WHERE amount < 100;
SELECT COUNT(*) FROM prod_lake.orders;

-- Merge if the result is correct
MERGE BRANCH staging INTO main;

-- Or discard without touching main
DROP BRANCH staging;

Branches use copy-on-write semantics on the underlying Parquet files, so branching does not duplicate storage. The PostgreSQL catalog tracks lineage per branch with no additional tooling.

When DuckLake Fits and When It Does Not

DuckLake is the right choice when your team is DuckDB-centric and wants minimal catalog infrastructure. It eliminates the catalog service layer that Iceberg requires (Lakekeeper, Polaris, Nessie, Unity Catalog) and the Spark dependency that Delta Lake defaults toward. A streaming DuckLake ingest creates Parquet data files plus metadata rows in PostgreSQL. The equivalent Iceberg setup generates more than 300 metadata files on S3 for the same workload, all requiring compaction and a running catalog service to stay queryable.

Use Iceberg or Delta Lake if you run multi-engine workloads where Spark, Trino, and Flink all need to read the same tables. DuckLake's primary implementation is the DuckDB extension. Other engines can query DuckLake catalogs, but the native tooling is DuckDB-first.

Practical Summary

DuckLake v1.0 moves lakehouse metadata from object storage files into a relational database. For teams that run SQL workflows, the setup reduces operational overhead to a PostgreSQL instance and an S3 bucket. Time travel, schema evolution, branching, and multi-user access all work on infrastructure you already have. The full setup takes four extension installs in a DuckDB shell and a single ATTACH command. If you want to explore your lake data in plain English without writing SQL, VSLZ handles end-to-end analysis from a file upload with no configuration needed.

FAQ

What is DuckLake and how does it differ from other lakehouse formats?

DuckLake is an open table format that stores lakehouse metadata in a standard SQL database rather than as files on object storage. Apache Iceberg and Delta Lake store metadata as JSON and Avro files, which creates operational overhead. A streaming ingest in Iceberg generates more than 300 metadata files that need compaction and a running catalog service. DuckLake stores the same metadata as rows in PostgreSQL, MySQL, or a local SQLite file, eliminating the separate catalog layer.

Do I need PostgreSQL to use DuckLake?

No. For single-user or local testing, DuckLake uses a local .ducklake file backed by DuckDB as the metadata store with no external database required. PostgreSQL or MySQL is recommended for production and multi-user workloads where multiple DuckDB clients need consistent access to the same catalog. Switching from local to PostgreSQL-backed metadata is a single change to the ATTACH connection string.

Does DuckLake support AWS S3 for data storage?

Yes. DuckLake stores data as Parquet files on object storage. For S3, install the httpfs and aws DuckDB extensions, set your AWS credentials as DuckDB SET variables, and specify the S3 path in the DATA_PATH parameter of your ATTACH command. Any S3-compatible object store works, including Google Cloud Storage, Cloudflare R2, and MinIO.

How does DuckLake time travel work?

Each INSERT, UPDATE, DELETE, or schema change creates a new snapshot recorded in the metadata catalog. List snapshots with SELECT * FROM ducklake_snapshots('your_catalog') and query any prior state with SELECT * FROM your_table AT (VERSION => snapshot_id). Because snapshots are catalog records in a database, there are no additional manifest files to compact compared to Iceberg's snapshot approach.

Which version of DuckDB is required for DuckLake?

DuckLake v1.0 requires DuckDB v1.5.2 or later. The ducklake extension installs from DuckDB's official extension repository with INSTALL ducklake; LOAD ducklake; from any DuckDB shell. Check your installed version with duckdb --version before running setup. DuckLake v1.0 ships with guaranteed backward compatibility, so catalogs created with this version remain readable in future DuckDB releases.

Related

OpenMetadata data catalog interface showing database schema discovery
Guides

How to Set Up OpenMetadata for Data Discovery

OpenMetadata is an open-source data catalog that gives teams a single place to discover, document, and govern their data assets. Setting it up takes under 30 minutes using Docker: spin up the containers, log into the UI at localhost:8585, then connect your first data source using one of 90+ pre-built connectors. Once ingestion runs, every table, column, and owner is searchable and lineage-linked across your entire stack.

Arkzero Research · Apr 29, 2026
Streamlit logo on a clean white background
Guides

How to Build a Data Dashboard with Streamlit

Streamlit is an open-source Python library that turns a script into a shareable web dashboard without any front-end code. Install it with pip, write a Python file that loads your CSV with pandas, add sidebar widgets for filtering, and render interactive charts with Plotly. Push the file to GitHub, connect it to Streamlit Community Cloud, and anyone with the URL can view live results. No server configuration required.

Arkzero Research · Apr 29, 2026
Airbyte Cloud data integration platform
Guides

How to Set Up Airbyte Cloud for Data Syncing

Airbyte Cloud is a managed data integration platform that syncs data from SaaS tools, databases, and APIs into a central warehouse without requiring Docker, infrastructure, or engineering resources. A free 30-day trial lets you connect sources like Salesforce, HubSpot, Stripe, or Google Sheets to destinations like BigQuery, Snowflake, or Postgres in minutes. This guide walks through the full setup from account creation to your first automated sync.

Arkzero Research · Apr 29, 2026