Text indexes in ClickHouse (also known as “inverted indexes”) provide fast full-text capabilities on string data. The index maps each token in the column to the rows which contain the token. The tokens are generated by a process called tokenization. For example, ClickHouse tokenizes the English sentence “All cat like mice.” by default as [“All”, “cat”, “like”, “mice”] (note that the trailing dot is ignored). More advanced tokenizers are available, for example for log data.Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Creating a Text Index
To create a text index, first enable the corresponding experimental setting:tokenizer argument specifies the tokenizer:
splitByNonAlphasplits strings along non-alphanumeric ASCII characters (also see function splitByNonAlpha).splitByString(S)splits strings along certain user-defined separator stringsS(also see function splitByString). The separators can be specified using an optional parameter, for example,tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', 'in the example). The default separator list, if not specified explicitly (for example,tokenizer = splitByString), is a single whitespace[' '].ngrams(N)splits strings into equally largeN-grams (also see function ngrams). The ngram length can be specified using an optional integer parameter between 2 and 8, for example,tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example,tokenizer = ngrams), is 3.arrayperforms no tokenization, i.e. every row value is a token (also see function array).sparseGrams(min_length, max_length, min_cutoff_length)— uses the algorithm as in the sparseGrams function to split a string into all ngrams ofmin_lengthand several ngrams of larger size up tomax_length, inclusive. Ifmin_cutoff_lengthis specified, only N-grams with length greater than or equal tomin_cutoff_lengthare saved in the index. Unlikengrams(N), which generates only fixed-length N-grams,sparseGramsproduces a set of variable-length N-grams within the specified range, allowing for a more flexible representation of text context. For example,tokenizer = sparseGrams(3, 5, 4)will generate 3-, 4-, 5-grams from the input string and save only the 4- and 5-grams in the index.
The
splitByString tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc'].
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.preprocessor is an expression which transforms the input string before tokenization.
Typical use cases for the preprocessor argument include
- Lower-casing (or upper-casing) the input strings to enable case-insensitive matching, e.g., lower, lowerUTF8, see the first example below.
- UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, toValidUTF8.
- Removing or transforming unwanted characters or substrings, e.g. extractTextFromHTML, substring, idnaEncode.
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col))
Optional advanced parameters
Optional advanced parameters
The default values of the following advanced parameters will work well in virtually all situations.
We do not recommend changing them.Optional parameter
dictionary_block_size (default: 128) specifies the size of dictionary blocks in rows.Optional parameter dictionary_block_frontcoding_compression (default: 1) specifies if the dictionary blocks use front coding as compression.Optional parameter max_cardinality_for_embedded_postings (default: 16) specifies the cardinality threshold below which posting lists should be embedded into dictionary blocks.Optional parameter bloom_filter_false_positive_rate (default: 0.1) specifies the false-positive rate of the dictionary bloom filter.Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists, below string search functions will fall back to slow brute-force scans.Supported functions
The text index can be used if text functions are used in theWHERE clause of a SELECT query:
= and !=
= (equals) and != (notEquals ) match the entire given search term.
Example:
= and !=, yet equality and inequality search only make sense with the array tokenizer (which causes the index to store entire row values).
IN and NOT IN
IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or none (NOT IN) of the search terms.
Example:
= and != apply, i.e. IN and NOT IN only make sense in conjunction with the array tokenizer.
LIKE, NOT LIKE and match
These functions currently use the text index for filtering only if the index tokenizer is either
splitByNonAlpha or ngrams.LIKE like, NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
Example:
support in the example could match support, supports, supporting etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
support make sure that the term can be extracted as a token.
startsWith and endsWith
Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
Example:
clickhouse is considered a token.
support is no token because it can match support, supports, supporting etc.
To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:
endsWith should be used with a leading space:
hasToken and hasTokenOrNull
Functions hasToken and hasTokenOrNull match against a single given token.
Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).
Example:
hasToken and hasTokenOrNull are the most performant functions to use with the text index.
hasAnyTokens and hasAllTokens
Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens.
These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching.
See the function documentation for more info.
Example:
has
Array function has matches against a single token in the array of strings.
Example:
mapContains
Function mapContains(alias of: mapContainsKey) matches against a single token in the keys of a map.
Example:
operator[]
Access operator[] can be used with the text index to filter out keys and values.
Example:
Array(T) and Map(K, V) with the text index.
Examples for the text index Array and Map support.
Indexing Array(String)
In a simple blogging platform, authors assign keywords to their posts to categorize content. A common feature allows users to discover related content by clicking on keywords or searching for topics. Consider this table definition:clickhouse) requires scanning all entries:
keywords that creates a search-optimized structure that pre-processes all keywords, enabling instant lookups:
Important: After adding the text index, you must rebuild it for existing data:
Indexing Map
In a logging system, server requests often store metadata in key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring. Consider this logs table:- Finds all logs with rate limiting:
- Finds all logs from a specific IP:
Important: After adding the text index, you must rebuild it for existing data:
- Find all rate-limited requests:
- Finds all logs from a specific IP:
Implementation
Index layout
Each text index consists of two (abstract) data structures:- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
dictionary_block_size).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index granules file (.idx)
The index granules file contains for each dictionary block the block’s first token, its relative offset in the dictionary blocks file, and a bloom filter for all tokens in the block.
This sparse index structure is similar to ClickHouse’s sparse primary key index).
The bloom filter allows to skip dictionary blocks early if the searched token is not contained in a dictionary block.
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the cardinality of a posting list is less than 16 (configurable by parameter max_cardinality_for_embedded_postings), it is embedded into the dictionary.
Direct read
Certain types of text queries can be speed up significantly by an optimization called “direct read”. More specifically, the optimization can be applied if the SELECT query does not project from the text column. Example:- Setting query_plan_direct_read_from_text_index (default: 1) which specifies if direct read is generally enabled.
- Setting use_skip_indexes_on_data_read (default: 1) which is another prerequisite for direct read. Note that on ClickHouse databases with compatibility < 25.10,
use_skip_indexes_on_data_readis disabled, so you either need to raise the compatibility setting value orSET use_skip_indexes_on_data_read = 1explicitly.
ALTER TABLE ... MATERIALIZE INDEX for that).
Supported functions
The direct read optimization supports functions hasToken, hasAllTokens, and hasAnyTokens.
These functions can also be combined by AND, OR, and NOT operators.
The WHERE clause can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions).
To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1.
As an example, a query with disabled direct read
query_plan_direct_read_from_text_index = 1
__text_index_<index_name>_<function_name>_<id>.
If this column is present, then direct read is used.
Example: Hackernews dataset
Let’s look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:hackernews table:
ALTER TABLE and add a text index on comment column, then materialize it:
hasToken, hasAnyTokens, and hasAllTokens functions.
The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.
1. Using hasToken
hasToken checks if the text contains a specific single token.
We’ll search for the case-sensitive token ‘ClickHouse’.
Direct read disabled (Standard scan)
By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules.
We can simulate this behavior by disabling direct read.
2. Using hasAnyTokens
hasAnyTokens checks if the text contains at least one of the given tokens.
We’ll search for comments containing either ‘love’ or ‘ClickHouse’.
Direct read disabled (Standard scan)
3. Using hasAllTokens
hasAllTokens checks if the text contains all of the given tokens.
We’ll search for comments containing both ‘love’ and ‘ClickHouse’.
Direct read disabled (Standard scan)
Even with direct read disabled, the standard skip index is still effective.
It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.
4. Compound search: OR, AND, NOT, …
The direct read optimization also applies to compound boolean expressions. Here, we’ll perform a case-insensitive search for ‘ClickHouse’ OR ‘clickhouse’. Direct read disabled (Standard scan)hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.
Tuning the text index
Currently, there are caches for the deserialized dictionary blocks, headers and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_dictionary_cache, use_text_index_header_cache and use_text_index_postings_cache respectively. By default, they are disabled. Refer the following server settings to configure the cache.Server Settings
Dictionary blocks cache settings
| Setting | Description | Default |
|---|---|---|
| text_index_dictionary_block_cache_policy | Text index dictionary block cache policy name. | SLRU |
| text_index_dictionary_block_cache_size | Maximum cache size in bytes. | 1073741824 |
| text_index_dictionary_block_cache_max_entries | Maximum number of deserialized dictionary blocks in cache. | 1'000'000 |
| text_index_dictionary_block_cache_size_ratio | The size of the protected queue in the text index dictionary block cache relative to the cache’s total size. | 0.5 |
Header cache settings
| Setting | Description | Default |
|---|---|---|
| text_index_header_cache_policy | Text index header cache policy name. | SLRU |
| text_index_header_cache_size | Maximum cache size in bytes. | 1073741824 |
| text_index_header_cache_max_entries | Maximum number of deserialized headers in cache. | 100'000 |
| text_index_header_cache_size_ratio | The size of the protected queue in the text index header cache relative to the cache’s total size. | 0.5 |
Posting lists cache settings
| Setting | Description | Default |
|---|---|---|
| text_index_postings_cache_policy | Text index postings cache policy name. | SLRU |
| text_index_postings_cache_size | Maximum cache size in bytes. | 2147483648 |
| text_index_postings_cache_max_entries | Maximum number of deserialized postings in cache. | 1'000'000 |
| text_index_postings_cache_size_ratio | The size of the protected queue in the text index postings cache relative to the cache’s total size. | 0.5 |