Skip to main content

Documentation Index

Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

All quickstarts
BeginnerReal-Time AnalyticsData WarehousingObservabilityAI/MLCloudOss

Prerequisites

To successfully follow this guide, you’ll need the following:

What you’ll build

In this quickstart you’ll create a MergeTree table to store UK residential property sale records dating back to 1995. You’ll design a schema with appropriate column types, choose a meaningful ORDER BY and PARTITION BY, load data directly from S3, and then query system.parts to see how ClickHouse physically organizes data on disk. By the end, you’ll understand why the MergeTree engine is the foundation of almost every ClickHouse table, and how its sorting and partitioning decisions directly shape query performance.
1

Understand how MergeTree works

Before writing any SQL, it helps to know what makes MergeTree different from a traditional database table.When you insert data into a MergeTree table, ClickHouse does not write rows one by one. Instead it writes a data part - a small, sorted, compressed chunk of rows - directly to disk. ClickHouse then merges these parts together in the background over time. This is where the name comes from: merge + tree.Every data part is sorted by the table’s ORDER BY expression. This sort order becomes the primary key index, which allows ClickHouse to skip large blocks of data it doesn’t need to read during a query (we call this data pruning). The more selective your ORDER BY columns are for your most common queries, the less data ClickHouse reads.Three clauses control how MergeTree organizes your data:
ClauseWhat it does
ORDER BYPhysically sorts data within each part. Determines the primary key. Required.
PARTITION BYSplits data into separate partitions, typically by a date range. Parts from different partitions are never merged together, enabling fast partition pruning.
PRIMARY KEYDefaults to ORDER BY unless you explicitly set a shorter prefix. The sparse index is built from this.
You should now be able to explain the relationship between data parts, the primary key, and query performance in a MergeTree table.
2

Preview the source data

Before creating your table, inspect the source file using the s3 table function. This lets you query S3 directly without writing any data to ClickHouse first.Run the following in your SQL console:
DESCRIBE s3(
'https://learn-clickhouse.s3.us-east-2.amazonaws.com/uk_property_prices/uk_prices.csv.zst'
);
Notice that almost every column is inferred as Nullable(String). ClickHouse is reading a raw CSV, so it doesn’t know the true data types - that’s something you’ll fix when you design your table schema in the next step.Preview a few rows:
SELECT *
FROM s3(
'https://learn-clickhouse.s3.us-east-2.amazonaws.com/uk_property_prices/uk_prices.csv.zst'
)
LIMIT 5;
The dataset contains residential property sales in England and Wales registered with HM Land Registry, including a transaction id, sale price, date, property type, address fields, and geographic identifiers. You’ll also notice two trailing columns (column15, column16) that are empty - these can be ignored.Verify this by confirming you can see rows with columns including id, price, date, postcode, type, town, and county.
3

Design and create your MergeTree table

Now create a permanent table with an appropriate schema. The column types below are chosen deliberately:
  • LowCardinality(String) is used for columns with limited unique values (postcodes, town names, county names). It uses dictionary encoding internally and dramatically reduces storage and improves performance for grouping and filtering on these columns.
  • Enum8 encodes the type and duration columns as small integers on disk while keeping human-readable string labels in queries. The source CSV uses single-letter codes, so we’ll map them during the insert.
  • PARTITION BY toYYYYMM(date) creates one partition per calendar month, allowing ClickHouse to skip entire months when your WHERE clause filters on date.
  • ORDER BY (postcode, addr1, addr2) sorts data to support fast lookups by property address - the most natural access pattern for this dataset.
CREATE TABLE uk_price_paid
(
price      UInt32,
date       Date,
postcode   LowCardinality(String),
type       Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
is_new     UInt8,
duration   Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
addr1      String,
addr2      String,
street     LowCardinality(String),
locality   LowCardinality(String),
town       LowCardinality(String),
district   LowCardinality(String),
county     LowCardinality(String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (postcode, addr1, addr2);
Verify the table was created by running:
SHOW CREATE TABLE uk_price_paid;
Double-click the result cell to inspect the full output. Notice that although you specified ENGINE = MergeTree, ClickHouse Cloud has created the table with SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}'). This is expected - Cloud automatically converts MergeTree to SharedMergeTree, which adds replication and shared storage support. The behaviour and query interface remain the same.
4

Load data from S3

Insert the full dataset by selecting directly from the s3() table function. ClickHouse streams the compressed file from S3 and writes it into your table in sorted parts.
INSERT INTO uk_price_paid
SELECT
    toUInt32(price),
    date,
    postcode,
    transform(type, ['T', 'S', 'D', 'F', 'O'],
        ['terraced', 'semi-detached', 'detached', 'flat', 'other'], 'other') AS type,
    if(is_new = 'Y', 1, 0) AS is_new,
    transform(duration, ['F', 'L', 'U'],
        ['freehold', 'leasehold', 'unknown'], 'unknown') AS duration,
    addr1,
    addr2,
    street,
    locality,
    town,
    district,
    county
FROM s3(
'https://learn-clickhouse.s3.us-east-2.amazonaws.com/uk_property_prices/uk_prices.csv.zst'
);
Since the source CSV stores everything as strings with single-letter codes (e.g. T for terraced, F for freehold, Y/N for new-build), we use transform to map them to readable labels and toUInt32/if to cast numeric columns. The id, column15, and column16 columns are excluded since we don’t need them.This will take a minute or two depending on your service size. Once complete, confirm the row count:
SELECT formatReadableQuantity(count())
FROM uk_price_paid;
You should see approximately 30 million rows loaded.
5

Inspect parts using system.parts

This is where MergeTree’s internals become visible. The system.parts table tracks every data part on disk for every MergeTree table in your service.
SELECT
partition,
name,
rows,
bytes_on_disk,
marks
FROM system.parts
WHERE table = 'uk_price_paid'
AND active = true
ORDER BY partition
LIMIT 20;
Each row represents one active data part. Notice:
  • partition - the YYYYMM value derived from your PARTITION BY expression. Each month’s data is isolated.
  • name - the part name encodes the partition, the block number range, and the merge level (e.g. 199501_1_4_2 would mean partition 199501, blocks 1–4, merged twice).
  • marks - the number of index granules. Each granule covers 8,192 rows by default, and the primary key index stores one entry per granule. This sparse index is what stays in memory and enables fast data skipping.
  • bytes_on_disk - ClickHouse compresses each part column by column using LZ4 by default. Compare this to the raw size to appreciate the compression ratio.
To see the total number of parts and the overall compressed size of your table, run:
SELECT
count()          AS parts,
sum(rows)        AS total_rows,
formatReadableSize(sum(bytes_on_disk)) AS compressed_size
FROM system.parts
WHERE table = 'uk_price_paid'
AND active = true;
If you run this query again some time in the future you may notice the part count has decreased. This is the merge in MergeTree at work - ClickHouse continuously merges smaller parts into larger ones in the background, reducing the number of parts. The active = true filter ensures you only see the current, merged parts rather than any older parts that are pending cleanup.
6

Query the data and observe primary key behaviour

Now run some real analytical queries. First, find the most expensive sales ever recorded:
SELECT
addr1,
addr2,
town,
county,
price,
date
FROM uk_price_paid
ORDER BY price DESC
LIMIT 5;
Check the query statistics in the SQL console - notice that all 30,033,199 rows were read. Because price is not part of the ORDER BY key, ClickHouse cannot use the primary index to skip data and must perform a full table scan.Next, find the average sale price by county:
SELECT
county,
round(avg(price)) AS avg_price,
count()           AS sales
FROM uk_price_paid
GROUP BY county
ORDER BY avg_price DESC;
Again, all 30,033,199 rows are read - county is not in the ORDER BY or PARTITION BY, so ClickHouse scans the entire table.Now run a query that combines aggregation with your ORDER BY. Since data is sorted by (postcode, addr1, addr2), filtering on a postcode prefix allows ClickHouse to skip most of the table. Here we find the average sale price per year for properties in the SW1A postcode area:
SELECT
toYear(date) AS year,
round(avg(price)) AS avg_price,
count() AS sales,
min(price) AS cheapest,
max(price) AS most_expensive
FROM uk_price_paid
WHERE postcode LIKE 'SW1A%'
GROUP BY year
ORDER BY year DESC;
Check the query stats in the SQL console after each query. The postcode filtered aggregation should read a fraction of the table’s rows, demonstrating the primary key index at work. Compare this to the earlier queries that scan more broadly - the difference shows why choosing the right ORDER BY matters.

Next steps

In this quickstart you built a MergeTree table from scratch, loaded 30 million UK property sale records from S3, explored how ClickHouse organizes data into sorted parts and partitions, and ran queries that demonstrate the power of the primary key index. The MergeTree engine is the foundation - from here you can explore the specialized engines built on top of it, or learn how Materialized Views extend the pattern further. Check out the following quickstarts next: Or go deeper with the reference documentation:
ClickHouse Academy — Master ClickHouse with expert-designed training for every skill level