Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Getting started with data lakes
TL;DRA hands-on walkthrough of querying data lake tables, accelerating them with MergeTree, and writing results back to Iceberg. All steps use public datasets and work on both Cloud and OSS.
Query Iceberg data directly
The fastest way to start is with theicebergS3() table function — point it at an Iceberg table in S3 and query immediately, no setup required.Inspect the schema:deltaLake(), hudi(), and paimon().Learn more: Querying open table formats directly covers all four formats, cluster variants for distributed reads, and storage backend options (S3, Azure, HDFS, local).Create a persistent table engine
For repeated access, create a table using the Iceberg table engine so you don’t need to pass the path every time. The data stays in S3 — no data is duplicated:Connect to a catalog
Most organizations manage Iceberg tables through a data catalog to centralize the table metadata and data discovery. ClickHouse supports connecting to your catalog using theDataLakeCatalog database engine, exposing all catalog tables as a ClickHouse database. This is the more scalable path so as new Iceberg tables are created, they are always accessible in ClickHouse without additional work.Here’s an example connecting to AWS Glue:Backticks are required around
<database>.<table> because ClickHouse doesn’t natively support more than one namespace.Issue a query
Regardless of which method you used above — table function, table engine, or catalog — the same ClickHouse SQL works across all of them:FROM clause changes. All ClickHouse SQL functions, joins, and aggregations work the same way regardless of the data source.Load a subset into ClickHouse
Querying Iceberg directly is convenient, but performance is bounded by network throughput and the file layout. For analytical workloads, load data into a native MergeTree table.First, run a filtered query over the Iceberg table to get a baseline:counterid filter — expect it to take several seconds.Now create a MergeTree table and load the data:counterid is the first column in the ORDER BY key, ClickHouse’s sparse primary index skips directly to the relevant granules — only reading the rows for counterid = 38 instead of scanning all 100 million rows. The result is a dramatic speedup.The accelerating analytics guide takes this further with LowCardinality types, full-text indices, and optimized ordering keys, demonstrating a ~40x improvement on a 283 million row dataset.Learn more: Accelerating analytics with MergeTree covers schema optimization, full-text indexing, and a complete before/after performance comparison.Write back to Iceberg
ClickHouse can also write data back to Iceberg tables, enabling reverse ETL workflows — publishing aggregated results or subsets for consumption by other tools (Spark, Trino, DuckDB, etc.).Create an Iceberg table for output:Next steps
Now that you’ve seen the full workflow, dive deeper into each area:- Querying directly — All four formats, cluster variants, table engines, caching
- Connecting to catalogs — Full Unity Catalog walkthrough with Delta and Iceberg
- Accelerating analytics — Schema optimization, indexing, ~40x speedup demo
- Writing to data lakes — Raw writes, aggregated writes, type mapping
- Support matrix — Feature comparison across formats and storage backends