TheDocumentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
MergeTree engine and other engines of the MergeTree family (e.g. ReplacingMergeTree, AggregatingMergeTree ) are the most commonly used and most robust table engines in ClickHouse.
MergeTree-family table engines are designed for high data ingest rates and huge data volumes.
Insert operations create table parts which are merged by a background process with other table parts.
Main features of MergeTree-family table engines.
- The table’s primary key determines the sort order within each table part (clustered index). The primary key also does not reference individual rows but blocks of 8192 rows called granules. This makes primary keys of huge data sets small enough to remain loaded in main memory, while still providing fast access to on-disk data.
- Tables can be partitioned using an arbitrary partition expression. Partition pruning ensures partitions are omitted from reading when the query allows it.
- Data can be replicated across multiple cluster nodes for high availability, failover, and zero downtime upgrades. See Data replication.
-
MergeTreetable engines support various statistics kinds and sampling methods to help query optimization.
Despite a similar name, the Merge engine is different from
*MergeTree engines.Creating tables
Query clauses
ENGINE
ENGINE — Name and parameters of the engine. ENGINE = MergeTree(). The MergeTree engine has no parameters.
ORDER BY
ORDER BY — The sorting key.
A tuple of column names or arbitrary expressions. Example: ORDER BY (CounterID + 1, EventDate).
If no primary key is defined (i.e. PRIMARY KEY was not specified), ClickHouse uses the the sorting key as primary key.
If no sorting is required, you can use syntax ORDER BY tuple().
Alternatively, if setting create_table_empty_primary_key_by_default is enabled, ORDER BY () is implicitly added to CREATE TABLE statements. See Selecting a Primary Key.
PARTITION BY
PARTITION BY — The partitioning key. Optional. In most cases, you don’t need a partition key, and if you do need to partition, generally you do not need a partition key more granular than by month. Partitioning does not speed up queries (in contrast to the ORDER BY expression). You should never use too granular partitioning. Don’t partition your data by client identifiers or names (instead, make client identifier or name the first column in the ORDER BY expression).
For partitioning by month, use the toYYYYMM(date_column) expression, where date_column is a column with a date of the type Date. The partition names here have the "YYYYMM" format.
PRIMARY KEY
PRIMARY KEY — The primary key if it differs from the sorting key. Optional.
Specifying a sorting key (using ORDER BY clause) implicitly specifies a primary key.
It is usually not necessary to specify the primary key in addition to the sorting key.
SAMPLE BY
SAMPLE BY — A sampling expression. Optional.
If specified, it must be contained in the primary key.
The sampling expression must result in an unsigned integer.
Example: SAMPLE BY intHash32(UserID) ORDER BY (CounterID, EventDate, intHash32(UserID)).
TTL
TTL — A list of rules that specify the storage duration of rows and the logic of automatic parts movement between disks and volumes. Optional.
Expression must result in a Date or DateTime, e.g. TTL date + INTERVAL 1 DAY.
Type of the rule DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'|GROUP BY specifies an action to be done with the part if the expression is satisfied (reaches current time): removal of expired rows, moving a part (if expression is satisfied for all rows in a part) to specified disk (TO DISK 'xxx') or to volume (TO VOLUME 'xxx'), or aggregating values in expired rows. Default type of the rule is removal (DELETE). List of multiple rules can be specified, but there should be no more than one DELETE rule.
For more details, see TTL for columns and tables
SETTINGS
See MergeTree Settings. Example of Sections SettingCounterID and EventDate. If you define a SAMPLE clause when selecting the data, ClickHouse will return an evenly pseudorandom data sample for a subset of users.
The index_granularity setting can be omitted because 8192 is the default value.
Data storage
A table consists of data parts sorted by primary key. When data is inserted in a table, separate data parts are created and each of them is lexicographically sorted by primary key. For example, if the primary key is(CounterID, Date), the data in the part is sorted by CounterID, and within each CounterID, it is ordered by Date.
Data belonging to different partitions are separated into different parts. In the background, ClickHouse merges data parts for more efficient storage. Parts belonging to different partitions are not merged. The merge mechanism does not guarantee that all rows with the same primary key will be in the same data part.
Data parts can be stored in Wide or Compact format. In Wide format each column is stored in a separate file in a filesystem, in Compact format all columns are stored in one file. Compact format can be used to increase performance of small and frequent inserts.
Data storing format is controlled by the min_bytes_for_wide_part and min_rows_for_wide_part settings of the table engine. If the number of bytes or rows in a data part is less then the corresponding setting’s value, the part is stored in Compact format. Otherwise it is stored in Wide format. If none of these settings is set, data parts are stored in Wide format.
Each data part is logically divided into granules. A granule is the smallest indivisible data set that ClickHouse reads when selecting data. ClickHouse does not split rows or values, so each granule always contains an integer number of rows. The first row of a granule is marked with the value of the primary key for the row. For each data part, ClickHouse creates an index file that stores the marks. For each column, whether it’s in the primary key or not, ClickHouse also stores the same marks. These marks let you find data directly in column files.
The granule size is restricted by the index_granularity and index_granularity_bytes settings of the table engine. The number of rows in a granule lays in the [1, index_granularity] range, depending on the size of the rows. The size of a granule can exceed index_granularity_bytes if the size of a single row is greater than the value of the setting. In this case, the size of the granule equals the size of the row.
Primary Keys and Indexes in Queries
Take the(CounterID, Date) primary key as an example. In this case, the sorting and index can be illustrated as follows:
CounterID in ('a', 'h'), the server reads the data in the ranges of marks[0, 3)and[6, 8).CounterID IN ('a', 'h') AND Date = 3, the server reads the data in the ranges of marks[1, 3)and[7, 8).Date = 3, the server reads the data in the range of marks[1, 10].
index_granularity * 2 extra rows in each data block can be read.
Sparse indexes allow you to work with a very large number of table rows, because in most cases, such indexes fit in the computer’s RAM.
ClickHouse does not require a unique primary key. You can insert multiple rows with the same primary key.
You can use Nullable-typed expressions in the PRIMARY KEY and ORDER BY clauses but it is strongly discouraged. To allow this feature, turn on the allow_nullable_key setting. The NULLS_LAST principle applies for NULL values in the ORDER BY clause.
Selecting a primary key
The number of columns in the primary key is not explicitly limited. Depending on the data structure, you can include more or fewer columns in the primary key. This may:-
Improve the performance of an index.
If the primary key is
(a, b), then adding another columncwill improve the performance if the following conditions are met:- There are queries with a condition on column
c. - Long data ranges (several times longer than the
index_granularity) with identical values for(a, b)are common. In other words, when adding another column allows you to skip quite long data ranges.
- There are queries with a condition on column
- Improve data compression. ClickHouse sorts data by primary key, so the higher the consistency, the better the compression.
- Provide additional logic when merging data parts in the CollapsingMergeTree and SummingMergeTree engines. In this case it makes sense to specify the sorting key that is different from the primary key.
SELECT queries.
You can create a table without a primary key using the ORDER BY tuple() syntax. In this case, ClickHouse stores data in the order of inserting. If you want to save data order when inserting data by INSERT ... SELECT queries, set max_insert_threads = 1.
To select data in the initial order, use single-threaded SELECT queries.
Choosing a primary key that differs from the sorting key
It is possible to specify a primary key (an expression with values that are written in the index file for each mark) that is different from the sorting key (an expression for sorting the rows in data parts). In this case the primary key expression tuple must be a prefix of the sorting key expression tuple. This feature is helpful when using the SummingMergeTree and AggregatingMergeTree table engines. In a common case when using these engines, the table has two types of columns: dimensions and measures. Typical queries aggregate values of measure columns with arbitraryGROUP BY and filtering by dimensions. Because SummingMergeTree and AggregatingMergeTree aggregate rows with the same value of the sorting key, it is natural to add all dimensions to it. As a result, the key expression consists of a long list of columns and this list must be frequently updated with newly added dimensions.
In this case it makes sense to leave only a few columns in the primary key that will provide efficient range scans and add the remaining dimension columns to the sorting key tuple.
ALTER of the sorting key is a lightweight operation because when a new column is simultaneously added to the table and to the sorting key, existing data parts do not need to be changed. Since the old sorting key is a prefix of the new sorting key and there is no data in the newly added column, the data is sorted by both the old and new sorting keys at the moment of table modification.
Use of indexes and partitions in queries
ForSELECT queries, ClickHouse analyzes whether an index can be used. An index can be used if the WHERE/PREWHERE clause has an expression (as one of the conjunction elements, or entirely) that represents an equality or inequality comparison operation, or if it has IN or LIKE with a fixed prefix on columns or expressions that are in the primary key or partitioning key, or on certain partially repetitive functions of these columns, or logical relationships of these expressions.
Thus, it is possible to quickly run queries on one or many ranges of the primary key. In this example, queries will be fast when run for a specific tracking tag, for a specific tag and date range, for a specific tag and date, for multiple tags with a date range, and so on.
Let’s look at the engine configured as follows:
Use of index for deterministic expressions in primary keys
The primary key can contain expressions, not only column names. These expressions are not limited to simple function chains: they can be arbitrary expression trees (for example, nested functions and composite expressions), as long as they are deterministic. An expression is deterministic if it always returns the same result for the same input values (for example:length(), toDate(), lower(), left(), cityHash64(), toUUID(); unlike now() or rand()). If the primary key contains deterministic expressions, ClickHouse can apply them to constant values from the query and use the result to build conditions on the primary key index. This enables data skipping for predicates like =, IN, and has.
A common use case is to keep the primary key compact (e.g. store a hash instead of a long String), while still allowing predicates on the original column to use the index.
Example of a deterministic (but non-injective) primary key:
length('alice') (and other constants) once and uses the length values to narrow the ranges in the primary key index. Since length of a string is not injective, different user_id strings can share the same length, so the index may read extra granules (false positives). The result remains correct because the original predicate (user_id = ..., IN, etc.) is still applied after reading.
If the deterministic expression is also injective (different inputs cannot produce the same output for the argument types used), additionally ClickHouse can effectively use the index for the negated forms: !=, NOT IN, and NOT has(...). For example, reverse(p) and hex(p) are injective for String.
Example of an injective primary key:
Use of index for partially-monotonic primary keys
Consider, for example, the days of the month. They form a monotonic sequence for one month, but not monotonic for more extended periods. This is a partially-monotonic sequence. If a user creates the table with partially-monotonic primary key, ClickHouse creates a sparse index as usual. When a user selects data from this kind of table, ClickHouse analyzes the query conditions. If the user wants to get data between two marks of the index and both these marks fall within one month, ClickHouse can use the index in this particular case because it can calculate the distance between the parameters of a query and index marks. ClickHouse cannot use an index if the values of the primary key in the query parameter range do not represent a monotonic sequence. In this case, ClickHouse uses the full scan method. ClickHouse uses this logic not only for days of the month sequences, but for any primary key that represents a partially-monotonic sequence.Data skipping indexes
The index declaration is in the columns section of theCREATE query.
*MergeTree family, data skipping indices can be specified.
These indices aggregate some information about the specified expression on blocks, which consist of granularity_value granules (the size of the granule is specified using the index_granularity setting in the table engine). Then these aggregates are used in SELECT queries for reducing the amount of data to read from the disk by skipping big blocks of data where the where query cannot be satisfied.
The GRANULARITY clause can be omitted, the default value of granularity_value is 1.
Example
Skip Index Types
TheMergeTree table engine supports the following types of skip indexes.
For more information on how skip indexes can be used for performance optimization
see “Understanding ClickHouse data skipping indexes”.
MinMaxindexSetindexbloom_filterindexngrambf_v1index (Deprecated)tokenbf_v1index (Deprecated)textindexvector_similarityindex
MinMax skip index
For each index granule, the minimum and maximum values of an expression are stored. (If the expression is of typetuple, it stores the minimum and maximum for each tuple element.)
Syntax
Set
For each index granule at mostmax_rows many unique values of the specified expression are stored.
max_rows = 0 means “store all unique values”.
Syntax
Bloom filter
For each index granule stores a bloom filter for the specified columns.Syntax
false_positive_rate parameter can take on a value between 0 and 1 (by default: 0.025) and specifies the probability of generating a positive (which increases the amount of data to be read).
The following data types are supported:
(U)Int*Float*EnumDateDateTimeStringFixedStringArrayLowCardinalityNullableUUIDMap
JSON data type: indexing JSON pathsFor the
JSON data type, a bloom filter index can be created on the set of paths using the JSONAllPaths function. This allows skipping granules where a queried JSON path is absent. See Data skipping indexes for JSON for details.N-gram bloom filter (Deprecated)
With general availability (GA) of the
text index starting from ClickHouse version 26.2, the ngrambf_v1 index is no longer recommended for full text search.See page “Full-text search with text indexes” for details.Syntax
| Parameter | Description |
|---|---|
n | ngram size |
size_of_bloom_filter_in_bytes | Bloom filter size in bytes. You can use a large value here, for example, 256 or 512, because it can be compressed well). |
number_of_hash_functions | The number of hash functions used in the bloom filter. |
random_seed | Seed for the bloom filter hash functions. |
ngrambf_v1, you can use the following User Defined Functions (UDFs).
UDFs for ngrambf_v1
total_number_of_all_gramsprobability_of_false_positives
4300 ngrams in the granule and you expect false positives to be less than 0.0001.
The other parameters can then be estimated by executing the following queries:
Token bloom filter
With general availability (GA) of the
text index starting from ClickHouse version 26.2, the tokenbf_v1 index is no longer recommended for full text search.See page “Full-text search with text indexes” for details.Syntax
Sparse grams bloom filter
The sparse grams bloom filter is similar tongrambf_v1 but uses sparse grams tokens instead of ngrams.
Syntax
Text index
Builds an inverted index over tokenized string data, enabling efficient and deterministic full-text search. See here for details.Vector similarity
Supports approximate nearest neighbor search, see here for details.Functions support
Conditions in theWHERE clause contains calls of the functions that operate with columns. If the column is a part of an index, ClickHouse tries to use this index when performing the functions. ClickHouse supports different subsets of functions for using indexes.
Indexes of type set can be utilized by all functions. The other index types are supported as follows:
| Function (operator) / Index | primary key | minmax | ngrambf_v1 | tokenbf_v1 | bloom_filter | sparse_grams | text |
|---|---|---|---|---|---|---|---|
| equals (=, ==) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| notEquals(!=, <>) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
| like | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ | ✔ |
| notLike | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ | ✗ |
| match | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ | ✔ |
| startsWith | ✔ | ✔ | ✔ | ✔ | ✗ | ✔ | ✔ |
| endsWith | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ | ✔ |
| multiSearchAny | ✗ | ✗ | ✔ | ✗ | ✗ | ✗ | ✗ |
| in | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| notIn | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✗ |
less (<) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
greater (>) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
lessOrEquals (<=) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
greaterOrEquals (>=) | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
| empty | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
| notEmpty | ✗ | ✔ | ✗ | ✗ | ✗ | ✔ | ✗ |
| has | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| hasAny | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ | ✗ |
| hasAll | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ | ✗ |
| hasToken | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ | ✔ |
| hasTokenOrNull | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ | ✔ |
hasTokenCaseInsensitive (*) | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ | ✗ |
hasTokenCaseInsensitiveOrNull (*) | ✗ | ✗ | ✗ | ✔ | ✗ | ✗ | ✗ |
| hasAnyTokens | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
| hasAllTokens | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
| pointInPolygon | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ | ✗ |
| mapContains (mapContainsKey) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
| mapContainsKeyLike | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
| mapContainsValue | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
| mapContainsValueLike | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✔ |
ngrambf_v1 for query optimization.
(*) For hasTokenCaseInsensitive and hasTokenCaseInsensitiveOrNull to be effective, the tokenbf_v1 index must be created on lowercased data, for example INDEX idx (lower(str_col)) TYPE tokenbf_v1(512, 3, 0).
Bloom filters can have false positive matches, so the
ngrambf_v1, tokenbf_v1, sparse_grams, and bloom_filter indexes can not be used for optimizing queries where the result of a function is expected to be false.For example:- Can be optimized:
s LIKE '%test%'NOT s NOT LIKE '%test%'s = 1NOT s != 1startsWith(s, 'test')
- Can not be optimized:
NOT s LIKE '%test%'s NOT LIKE '%test%'NOT s = 1s != 1NOT startsWith(s, 'test')
Projections
Projections are like materialized views but defined in part-level. It provides consistency guarantees along with automatic usage in queries.When you are implementing projections you should also consider the force_optimize_projection setting.
SELECT statements with the FINAL modifier.
Projection query
A projection query is what defines a projection. It implicitly selects data from the parent table. SyntaxProjection indexes
Projection indexes extend the projection subsystem by providing a lightweight and explicit way to define projection-level indexes. Externally, a projection index is still a projection, but with simplified syntax and clearer intent: it defines an expression which is dedicated to filtering, rather than serving materialized data. Internally, a projection index does not materialize the original table in permuted row order like a regular projection. Instead, the permutation is stored in the form of a numeric permutation column_part_offset, i.e. SELECT _part_offset ORDER BY <index_expr>.
Syntax
Index types
Currently supported:- basic: equivalent to a normal MergeTree index on the expression.
Projection storage
Projections are stored inside the part directory. It’s similar to an index but contains a subdirectory that stores an anonymousMergeTree table’s part. The table is induced by the definition query of the projection. If there is a GROUP BY clause, the underlying storage engine becomes AggregatingMergeTree, and all aggregate functions are converted to AggregateFunction. If there is an ORDER BY clause, the MergeTree table uses it as its primary key expression. During the merge process the projection part is merged via its storage’s merge routine. The checksum of the parent table’s part is combined with the projection’s part. Other maintenance jobs are similar to skip indices.
Query analysis
- Check if the projection can be used to answer the given query, that is, it generates the same answer as querying the base table.
- Select the best feasible match, which contains the least granules to read.
- The query pipeline which uses projections will be different from the one that uses the original parts. If the projection is absent in some parts, we can add the pipeline to “project” it on the fly.
Concurrent data access
For concurrent table access, we use multi-versioning. In other words, when a table is simultaneously read and updated, data is read from a set of parts that is current at the time of the query. There are no lengthy locks. Inserts do not get in the way of read operations. Reading from a table is automatically parallelized.TTL for columns and tables
Determines the lifetime of values. TheTTL clause can be set for the whole table and for each individual column. Table-level TTL can also specify the logic of automatic moving data between disks and volumes, or recompressing parts where all the data has been expired.
Expressions must evaluate to Date, Date32, DateTime or DateTime64 data type.
Syntax
Setting time-to-live for a column:
interval, use time interval operators, for example:
Column TTL
When the values in the column expire, ClickHouse replaces them with the default values for the column data type. If all the column values in the data part expire, ClickHouse deletes this column from the data part in a filesystem. TheTTL clause can’t be used for key columns.
Examples
Creating a table with TTL:
Adding TTL to a column of an existing table
Altering TTL of the column
Table TTL
Table can have an expression for removal of expired rows, and multiple expressions for automatic move of parts between disks or volumes. When rows in the table expire, ClickHouse deletes all corresponding rows. For parts moving or recompressing, all rows of a part must satisfy theTTL expression criteria.
DELETE- delete expired rows (default action);RECOMPRESS codec_name- recompress data part with thecodec_name;TO DISK 'aaa'- move part to the diskaaa;TO VOLUME 'bbb'- move part to the diskbbb;GROUP BY- aggregate expired rows.
DELETE action can be used together with WHERE clause to delete only some of the expired rows based on a filtering condition:
GROUP BY expression must be a prefix of the table primary key.
If a column is not part of the GROUP BY expression and is not set explicitly in the SET clause, in result row it contains an occasional value from the grouped rows (as if aggregate function any is applied to it).
Examples
Creating a table with TTL:
Altering TTL of the table:
Creating a table, where expired rows are recompressed:
x contains the maximum value across the grouped rows, y — the minimum value, and d — any occasional value from grouped rows.
Removing expired data
Data with an expiredTTL is removed when ClickHouse merges data parts.
When ClickHouse detects that data is expired, it performs an off-schedule merge. To control the frequency of such merges, you can set merge_with_ttl_timeout. If the value is too low, it will perform many off-schedule merges that may consume a lot of resources.
If you perform the SELECT query between merges, you may get expired data. To avoid it, use the OPTIMIZE query before SELECT.
See Also
- ttl_only_drop_parts setting
Disk types
In addition to local block devices, ClickHouse supports these storage types:s3for S3 and MinIOgcsfor GCSblob_storage_diskfor Azure Blob Storagehdfsfor HDFSwebfor read-only from webcachefor local cachings3_plainfor backups to S3s3_plain_rewritablefor immutable, non-replicated tables in S3
Using multiple block devices for data storage
Introduction
MergeTree family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into “hot” and “cold”. The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely. If several disks are available, the “hot” data may be located on fast disks (for example, NVMe SSDs or in memory), while the “cold” data - on relatively slow ones (for example, HDD).
This applies to all disk types, including S3 and other object storage disks. For example, you can spread data across multiple S3 buckets within a single volume, or create tiered policies that move data from local disks to S3. See Using S3 disks with multiple volumes for details.
Data part is the minimum movable unit for MergeTree-engine tables. The data belonging to one part are stored on one disk. Data parts can be moved between disks in the background (according to user settings) as well as by means of the ALTER queries.
Terms
- Disk — Block device mounted to the filesystem.
- Default disk — Disk that stores the path specified in the path server setting.
- Volume — Ordered set of equal disks (similar to JBOD).
- Storage policy — Set of volumes and the rules for moving data between them.
storage_policy setting of MergeTree-engine family tables.
Configuration
Disks, volumes and storage policies should be declared inside the<storage_configuration> tag either in a file in the config.d directory.
Configuration structure:
<disk_name_N>— Disk name. Names must be different for all disks.path— path under which a server will store data (dataandshadowfolders), should be terminated with ’/’.keep_free_space_bytes— the amount of free disk space to be reserved.
policy_name_N— Policy name. Policy names must be unique.volume_name_N— Volume name. Volume names must be unique.disk— a disk within a volume.max_data_part_size_bytes— the maximum size of a part that can be stored on any of the volume’s disks. If the a size of a merged part estimated to be bigger thanmax_data_part_size_bytesthen this part will be written to a next volume. Basically this feature allows to keep new/small parts on a hot (SSD) volume and move them to a cold (HDD) volume when they reach large size. Do not use this setting if your policy has only one volume.move_factor— when the amount of available space gets lower than this factor, data automatically starts to move on the next volume if any (by default, 0.1). ClickHouse sorts existing parts by size from largest to smallest (in descending order) and selects parts with the total size that is sufficient to meet themove_factorcondition. If the total size of all parts is insufficient, all parts will be moved.perform_ttl_move_on_insert— Disables TTL move on data part INSERT. By default (if enabled) if we insert a data part that already expired by the TTL move rule it immediately goes to a volume/disk declared in move rule. This can significantly slowdown insert in case if destination volume/disk is slow (e.g. S3). If disabled then already expired data part is written into a default volume and then right after moved to TTL volume.load_balancing- Policy for disk balancing,round_robinorleast_used.least_used_ttl_ms- Configure timeout (in milliseconds) for the updating available space on all disks (0- update always,-1- never update, default is60000). Note, if the disk can be used by ClickHouse only and is not subject to a online filesystem resize/shrink you can use-1, in all other cases it is not recommended, since eventually it will lead to incorrect space distribution.prefer_not_to_merge— You should not use this setting. Disables merging of data parts on this volume (this is harmful and leads to performance degradation). When this setting is enabled (don’t do it), merging data on this volume is not allowed (which is bad). This allows (but you don’t need it) controlling (if you want to control something, you’re making a mistake) how ClickHouse works with slow disks (but ClickHouse knows better, so please don’t use this setting).volume_priority— Defines the priority (order) in which volumes are filled. Lower value means higher priority. The parameter values should be natural numbers and collectively cover the range from 1 to N (lowest priority given) without skipping any numbers.- If all volumes are tagged, they are prioritized in given order.
- If only some volumes are tagged, those without the tag have the lowest priority, and they are prioritized in the order they are defined in config.
- If no volumes are tagged, their priority is set correspondingly to their order they are declared in configuration.
- Two volumes cannot have the same priority value.
hdd_in_order policy implements the round-robin approach. Thus this policy defines only one volume (single), the data parts are stored on all its disks in circular order. Such policy can be quite useful if there are several similar disks are mounted to the system, but RAID is not configured. Keep in mind that each individual disk drive is not reliable and you might want to compensate it with replication factor of 3 or more.
If there are different kinds of disks available in the system, moving_from_ssd_to_hdd policy can be used instead. The volume hot consists of an SSD disk (fast_ssd), and the maximum size of a part that can be stored on this volume is 1GB. All the parts with the size larger than 1GB will be stored directly on the cold volume, which contains an HDD disk disk1.
Also, once the disk fast_ssd gets filled by more than 80%, data will be transferred to the disk1 by a background process.
The order of volume enumeration within a storage policy is important in case at least one of the volumes listed has no explicit volume_priority parameter.
Once a volume is overfilled, data are moved to the next one. The order of disk enumeration is important as well because data are stored on them in turns.
When creating a table, one can apply one of the configured storage policies to it:
default storage policy implies using only one volume, which consists of only one disk given in <path>.
You could change storage policy after table creation with [ALTER TABLE … MODIFY SETTING] query, new policy should include all old disks and volumes with same names.
The number of threads performing background moves of data parts can be changed by background_move_pool_size setting.
Details
In the case ofMergeTree tables, data is getting to disk in different ways:
- As a result of an insert (
INSERTquery). - During background merges and mutations.
- When downloading from another replica.
- As a result of partition freezing ALTER TABLE … FREEZE PARTITION.
- The first volume (in the order of definition) that has enough disk space for storing a part (
unreserved_space > current_part_size) and allows for storing parts of a given size (max_data_part_size_bytes > current_part_size) is chosen. - Within this volume, that disk is chosen that follows the one, which was used for storing the previous chunk of data, and that has free space more than the part size (
unreserved_space - keep_free_space_bytes > current_part_size).
move_factor parameter) according to the order the volumes are declared in the configuration file.
Data is never transferred from the last one and into the first one. One may use system tables system.part_log (field type = MOVE_PART) and system.parts (fields path and disk) to monitor background moves. Also, the detailed information can be found in server logs.
User can force moving a part or a partition from one volume to another using the query ALTER TABLE … MOVE PART|PARTITION … TO VOLUME|DISK …, all the restrictions for background operations are taken into account. The query initiates a move on its own and does not wait for background operations to be completed. User will get an error message if not enough free space is available or if any of the required conditions are not met.
Moving data does not interfere with data replication. Therefore, different storage policies can be specified for the same table on different replicas.
After the completion of background merges and mutations, old parts are removed only after a certain amount of time (old_parts_lifetime).
During this time, they are not moved to other volumes or disks. Therefore, until the parts are finally removed, they are still taken into account for evaluation of the occupied disk space.
User can assign new big parts to different disks of a JBOD volume in a balanced way using the min_bytes_to_rebalance_partition_over_jbod setting.
Using external storage for data storage
MergeTree family table engines can store data toS3, AzureBlobStorage, HDFS using a disk with types s3, azure_blob_storage, hdfs accordingly. See configuring external storage options for more details.
Example for S3 as external storage using a disk with type s3.
Configuration markup:
Using S3 disks with multiple volumes
S3 (and other object storage) disks can be used in multi-disk and multi-volume storage policies the same way as local disks. This allows you to spread data across multiple S3 buckets within a single volume (JBOD-style), or set up tiered storage policies with S3 volumes. For example, to distribute data across two S3 buckets in a round-robin fashion:When using
use_environment_credentials for S3 authentication, the environment credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN) are shared across all S3 disks. It is not possible to use different environment credentials for different disks. If you need different credentials for each S3 disk, use explicit access_key_id and secret_access_key settings per disk instead.table_disk = true with a table-local disk). See refresh_parts_interval and table_disk.
cache configurationClickHouse versions 22.3 through 22.7 use a different cache configuration, see using local cache if you are using one of those versions.
Virtual columns
_part— Name of a part._part_index— Sequential index of the part in the query result._part_starting_offset— Cumulative starting row of the part in the query result._part_offset— Number of row in the part._part_granule_offset— Number of granule in the part._partition_id— Name of a partition._part_uuid— Unique part identifier (if enabled MergeTree settingassign_part_uuids)._part_data_version— Data version of part (either min block number or mutation version)._partition_value— Values (a tuple) of apartition byexpression._sample_factor— Sample factor (from the query)._block_number— Original number of block for row that was assigned at insert, persisted on merges when settingenable_block_number_columnis enabled._block_offset— Original number of row in block that was assigned at insert, persisted on merges when settingenable_block_offset_columnis enabled._disk_name— Disk name used for the storage.
Column statistics
The statistics declaration is in the columns section of theCREATE query for tables from the *MergeTree* Family:
ALTER statements:
set use_statistics = 1.
Part Pruning with Statistics
Whenuse_statistics_for_part_pruning is enabled, statistics can be used for part pruning.
Currently, only MinMax statistics support part pruning. When MinMax statistics are defined on a column, ClickHouse tracks the minimum and maximum values for that column in each part.
Part pruning allows to skip reading entire data parts when the query filter condition cannot match any rows in that part.
Example:
Available types of column statistics
-
MinMaxThe minimum and maximum column value which allows to estimate the selectivity of range filters on numeric columns. Syntax:minmax -
TDigestTDigest sketches which allow to compute approximate percentiles (e.g. the 90th percentile) for numeric columns. Syntax:tdigest -
UniqHyperLogLog sketches which provide an estimation how many distinct values a column contains. Syntax:uniq -
NullCountTracks the number ofNULLvalues inNullablecolumns. Used for accurate selectivity estimation ofIS NULL/IS NOT NULLpredicates in PREWHERE optimization. Syntax:nullcount -
CountMinCountMin sketches which provide an approximate count of the frequency of each value in a column. Syntaxcountmin
Supported data types
| (U)Int*, Float*, Decimal(), Date, Boolean, Enum* | String or FixedString | Nullable() / LowCardinality(Nullable()) | |
|---|---|---|---|
| CountMin | ✔ | ✔ | ✗ |
| MinMax | ✔ | ✗ | ✔ |
| NullCount | ✗ | ✗ | ✔ |
| TDigest | ✔ | ✗ | ✔ |
| Uniq | ✔ | ✔ | ✔ |
Supported operations
| Equality filters (==) | Range filters (>, >=, <, <=) | IS NULL / IS NOT NULL | |
|---|---|---|---|
| CountMin | ✔ | ✗ | ✗ |
| MinMax | ✗ | ✔ | ✗ |
| NullCount | ✗ | ✗ | ✔ |
| TDigest | ✗ | ✔ | ✗ |
| Uniq | ✔ | ✗ | ✗ |
Column-level settings
Certain MergeTree settings can be overridden at column level:max_compress_block_size— Maximum size of blocks of uncompressed data before compressing for writing to a table.min_compress_block_size— Minimum size of blocks of uncompressed data required for compression when writing the next mark.
- Remove
SETTINGSfrom column declaration:
- Modify a setting:
- Reset one or more settings, also removes the setting declaration in the column expression of the table’s CREATE query.