All quickstartsDocumentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
BeginnerReal-Time AnalyticsData WarehousingObservabilityAI/MLCloudOss
Prerequisites
To successfully follow this guide, you’ll need the following:- A running ClickHouse Cloud service. If you don’t have one yet, complete the Create your first Cloud service quickstart first.
What you’ll build
In this quickstart you’ll create a MergeTree table to store UK residential property sale records dating back to 1995. You’ll design a schema with appropriate column types, choose a meaningfulORDER BY and PARTITION BY, load data directly from S3, and then query system.parts to see how ClickHouse physically organizes data on disk.
By the end, you’ll understand why the MergeTree engine is the foundation of almost every ClickHouse table, and how its sorting and partitioning decisions directly shape query performance.
Understand how MergeTree works
Before writing any SQL, it helps to know what makes MergeTree different from a traditional database table.When you insert data into a MergeTree table, ClickHouse does not write rows one by one. Instead it writes a data part - a small, sorted, compressed chunk of rows - directly to disk. ClickHouse then merges these parts together in the background over time. This is where the name comes from: merge + tree.Every data part is sorted by the table’sORDER BY expression. This sort order becomes the primary key index, which allows ClickHouse to skip large blocks of data it doesn’t need to read during a query (we call this data pruning). The more selective your ORDER BY columns are for your most common queries, the less data ClickHouse reads.Three clauses control how MergeTree organizes your data:| Clause | What it does |
|---|---|
ORDER BY | Physically sorts data within each part. Determines the primary key. Required. |
PARTITION BY | Splits data into separate partitions, typically by a date range. Parts from different partitions are never merged together, enabling fast partition pruning. |
PRIMARY KEY | Defaults to ORDER BY unless you explicitly set a shorter prefix. The sparse index is built from this. |
Preview the source data
Before creating your table, inspect the source file using thes3 table function. This lets you query S3 directly without writing any data to ClickHouse first.Run the following in your SQL console:Nullable(String). ClickHouse is reading a raw CSV, so it doesn’t know the true data types - that’s something you’ll fix when you design your table schema in the next step.Preview a few rows:id, sale price, date, property type, address fields, and geographic identifiers. You’ll also notice two trailing columns (column15, column16) that are empty - these can be ignored.Verify this by confirming you can see rows with columns including id, price, date, postcode, type, town, and county.Design and create your MergeTree table
Now create a permanent table with an appropriate schema. The column types below are chosen deliberately:LowCardinality(String)is used for columns with limited unique values (postcodes, town names, county names). It uses dictionary encoding internally and dramatically reduces storage and improves performance for grouping and filtering on these columns.Enum8encodes thetypeanddurationcolumns as small integers on disk while keeping human-readable string labels in queries. The source CSV uses single-letter codes, so we’ll map them during the insert.PARTITION BY toYYYYMM(date)creates one partition per calendar month, allowing ClickHouse to skip entire months when yourWHEREclause filters ondate.ORDER BY (postcode, addr1, addr2)sorts data to support fast lookups by property address - the most natural access pattern for this dataset.
ENGINE = MergeTree, ClickHouse Cloud has created the table with SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}'). This is expected - Cloud automatically converts MergeTree to SharedMergeTree, which adds replication and shared storage support. The behaviour and query interface remain the same.Load data from S3
Insert the full dataset by selecting directly from thes3() table function. ClickHouse streams the compressed file from S3 and writes it into your table in sorted parts.T for terraced, F for freehold, Y/N for new-build), we use transform to map them to readable labels and toUInt32/if to cast numeric columns. The id, column15, and column16 columns are excluded since we don’t need them.This will take a minute or two depending on your service size. Once complete, confirm the row count:Inspect parts using system.parts
This is where MergeTree’s internals become visible. Thesystem.parts table tracks every data part on disk for every MergeTree table in your service.partition- theYYYYMMvalue derived from yourPARTITION BYexpression. Each month’s data is isolated.name- the part name encodes the partition, the block number range, and the merge level (e.g.199501_1_4_2would mean partition199501, blocks 1–4, merged twice).marks- the number of index granules. Each granule covers 8,192 rows by default, and the primary key index stores one entry per granule. This sparse index is what stays in memory and enables fast data skipping.bytes_on_disk- ClickHouse compresses each part column by column using LZ4 by default. Compare this to the raw size to appreciate the compression ratio.
active = true filter ensures you only see the current, merged parts rather than any older parts that are pending cleanup.Query the data and observe primary key behaviour
Now run some real analytical queries. First, find the most expensive sales ever recorded:price is not part of the ORDER BY key, ClickHouse cannot use the primary index to skip data and must perform a full table scan.Next, find the average sale price by county:county is not in the ORDER BY or PARTITION BY, so ClickHouse scans the entire table.Now run a query that combines aggregation with your ORDER BY. Since data is sorted by (postcode, addr1, addr2), filtering on a postcode prefix allows ClickHouse to skip most of the table. Here we find the average sale price per year for properties in the SW1A postcode area:postcode filtered aggregation should read a fraction of the table’s rows, demonstrating the primary key index at work. Compare this to the earlier queries that scan more broadly - the difference shows why choosing the right ORDER BY matters.