Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Full-text search with text indexes
Text indexes (also known as inverted indexes) enable fast full-text search on text data. A text index stores a mapping from tokens to the row numbers which contain each token. Tokens are generated by a process called tokenization. For example, ClickHouse’s default tokenizer converts the English sentence “The cat likes mice.” to tokens [“The”, “cat”, “likes”, “mice”]. As an example, assume a table with a single column and three rowsCreating a Text Index
Text indexes are generally available (GA) in ClickHouse version 26.2 and newer. In these versions, no special settings need to be configured to use the text index. We strongly recommend using ClickHouse versions >= 26.2 for production use cases.Text indexes can be used with any ClickHouse version >= 26.2, regardless of the compatibility setting.
- String and FixedString,
- Array(String) and Array(FixedString),
- Map (via mapKeys and mapValues functions), and
- JSON (via JSONAllPaths and
JSONAllValuesfunctions).
Array(Nullable(String or FixedString)).
Alternatively, to add a text index to an existing table:
tokenizer argument specifies the tokenizer:
splitByNonAlphasplits strings along non-alphanumeric ASCII characters (see function splitByNonAlpha).splitByString(S)splits strings along certain user-defined separator stringsS(see function splitByString). The separators can be specified using an optional parameter, for example,tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', 'in the example). The default separator list, if not specified explicitly (for example,tokenizer = splitByString), is a single whitespace[' '].asciiCJKsplits strings into tokens using Unicode word boundary rules (similar to Unicode Text Segmentation (UAX #29)). ASCII alphanumeric characters and underscores form tokens with connectors (ASCII:for letters,.and'for same-type characters). Non-ASCII Unicode characters, including CJK characters, become single-character tokens.ngrams(N)splits strings into equally largeN-grams (see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example,tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example,tokenizer = ngrams), is 3.sparseGrams(min_length, max_length, min_cutoff_length)splits strings into variable-length n-grams of at leastmin_lengthand at mostmax_length(inclusive) characters (see function sparseGrams). Unless specified explicitly,min_lengthandmax_lengthdefault to 3 and 100. If parametermin_cutoff_lengthis provided, only n-grams with length greater or equal thanmin_cutoff_lengthare returned. Compared tongrams(N), thesparseGramstokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example,tokenizer = sparseGrams(3, 5, 4)internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.arrayperforms no tokenization, i.e. every row value is a token (see function array).
The
splitByString tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc'].
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.asciiCJK tokenizer is recommended as it correctly handles Unicode word boundaries including CJK characters.
:::
Preprocessor argument (optional). The preprocessor refers to an expression which is applied to the input string before tokenization.
Typical use cases for the preprocessor argument include
- Lower/upper-casing, or case folding to enable case-insensitive matching, e.g., lower, lowerUTF8, caseFoldUTF8.
- UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, normalizeUTF8NFKCCasefold, toValidUTF8.
- Removing or transforming unwanted characters or substrings, such as accents e.g. extractTextFromHTML, substring, idnaEncode, translate, removeDiacriticsUTF8.
Nullable(T) or LowCardinality(T) column, then the preprocessor expression should accept nullable or low-cardinality values (i.e. not throw an exception).
Examples:
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col)))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = removeDiacriticsUTF8(caseFoldUTF8(col)))
INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = upper(lower(col)))INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = concat(lower(col), lower(col)))- Not allowed:
INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = concat(col, col))
Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists on a column or table part, the string search functions will fall back to slow brute-force scans.We recommend using functions
hasAnyTokens and hasAllTokens to search the text index, please see below.
These functions work with all available tokenizers and all possible preprocessor expressions.
As the other supported functions historically preceded the text index, they had to retain their legacy behavior in many cases (e.g. no preprocessor support).Supported functions
The text index can be used if text functions are used in theWHERE clause or PREWHERE clauses:
=
= (equals) matches the entire given search term.
Example:
IN
IN (in) is similar to equals but matches all search terms.
Example:
NOT IN (notIn) is not supported by the text index.LIKE and match
These functions currently use the text index for filtering only if the index tokenizer is either
splitByNonAlpha, ngrams or sparseGrams.NOT LIKE (notLike) is not supported by the text index.LIKE (like) and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
support in the example could match support, supports, supporting etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
support make sure that the term can be extracted as a token.
Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly.
See the LIKE/ILIKE performance tuning section for details.
startsWith and endsWith
Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
clickhouse is considered a token.
support is no token because it can match support, supports, supporting etc.
To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:
endsWith should be used with a leading space:
hasToken and hasTokenOrNull
Function
hasToken looks straightforward to use but it has certain pitfalls with non-default tokenizers and preprocessor expressions.
We recommend using functions hasAnyTokens and hasAllTokens instead.hasAnyTokens and hasAllTokens
Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens.
These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching.
See the function documentation for more info.
Example:
hasPhrase
Function hasPhrase matches against a phrase: all tokens must appear consecutively and in the same order as in the search string.
Unlike hasAllTokens, which only requires all tokens to be present somewhere, hasPhrase requires them to appear as a consecutive sequence.
The search phrase is tokenized using the same tokenizer configured for the index column.
Note that the function requires one of the splitByNonAlpha, splitByString, ngrams, or asciiCJK tokenizers.
Example:
has
Array function has matches against a single token in the array of strings.
Example:
hasAny and hasAll
Array functions hasAny and hasAll test whether the indexed array column contains any or all of a constant set of needle strings.
Example:
mapContains
Function mapContains (an alias of mapContainsKey) matches against tokens extracted from the searched string in the keys of a map.
The behaviour is similar to the equals function with a String column.
The text index is only used if it was created on a mapKeys(map) expression.
Example:
mapContainsValue
Function mapContainsValue matches against tokens extracted from the searched string in the values of a map.
The behaviour is similar to the equals function with a String column.
The text index is only used if it was created on a mapValues(map) expression.
Example:
mapContainsKeyLike and mapContainsValueLike
The functions mapContainsKeyLike and mapContainsValueLike match a pattern against all keys or values (respectively) of a map.
Example:
operator[]
Access operator[] can be used with the text index to filter out keys and values. The text index is only used if it is created on mapKeys(map) or mapValues(map) expressions, or both.
Example:
Array(T) and Map(K, V) with the text index.
Indexing Array(String) columns
Imagine a blogging platform, where authors categorize their blog posts using keywords. We like users to discover related content by searching for or clicking on topics. Consider this table definition:clickhouse) requires scanning all entries:
keywords:
Indexing Map columns
In many observability use cases, log messages are split into “components” and stored as appropriate data types, e.g. date time for the timestamp, enum for the log level etc. Metrics fields are best stored as key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring. Consider this logs table:Indexing JSON columns
Text indexes can be used withJSON columns in three ways:
- Indexes on specific subcolumns — create a text index on a known JSON path, just like on a regular column. This indexes the values at that path.
- Path-based indexes with JSONAllPaths — indexes all paths present in each granule to skip granules that cannot contain the queried path. Similar to
Mapcolumns. - Value-based indexes with JSONAllValues — indexes all values across all JSON paths to accelerate full-text search on any JSON subcolumn with a single index.
Indexes on specific subcolumns
You can create a skip index on any JSON subcolumn using the same syntax as for regular columns. There are two ways to reference a JSON subcolumn in an index expression:- Typed path declared in the JSON type hint — access by name directly:
json.a. - Dynamic path with explicit cast — use the
::cast syntax:json.b::String.
Path-based indexes with JSONAllPaths
Similar toMap columns, text indexes can be created on JSON columns using JSONAllPaths.
The index stores the set of JSON paths present in each granule and uses them to skip granules where a queried path is absent.
Example index definition:
EXPLAIN indexes = 1 to verify that the skip index is being used.
When a path exists only in one part, the index skips the other part.
Example:
Response
IS NOT NULL also uses the index — it skips granules where the path is absent (since the value would be NULL):
Example:
Value-based indexes with JSONAllValues
Text indexes can be used to accelerate searches on JSON columns via functionJSONAllValues.
JSONAllValues returns all values from a JSON column as Array(String).
Values of non-string datatypes (e.g. integers and arrays) are converted to their text representation.
A text index build using JSONAllValues indexes these text representations across all JSON paths in each row.
This index can then accelerate queries that filter on individual JSON subcolumns.
When a query filters on a specific subcolumn (e.g. data.user_name = 'alice'), the text index can quickly skip rows (and granules) that do not contain the search tokens in any of their JSON values.
The index may produce false positives when different JSON paths contain the same tokens.
For example, if row 1 has
{"a": "hello", "b": "world"} and a query searches for data.a = 'world', the text index cannot distinguish that world belongs to path b, not a.
In such cases, the index will not skip the row, and the filter on the actual column data will handle the final evaluation.
This is the same behavior as with other text index use cases where the index acts as a fast pre-filter.String columns and function equals for all columns.
Subcolumn access:
CAST:
IN operator:
Phrase search
Text index supports phrase search via thehasPhrase function.
All tokens in the phrase must appear consecutively and in the same order in the document.
The text index accelerates phrase search by intersecting the posting lists for all tokens in the phrase to identify candidate granules.
Within those granules, ClickHouse then verifies exact token adjacency.
hasPhrase is supported with tokenizers splitByNonAlpha, splitByString, ngrams, and asciiCJK.
The phrase string is tokenized using the index’s configured tokenizer.
Tokenizer separator characters in the phrase are ignored: hasPhrase(text, 'quick+brown') is equivalent to hasPhrase(text, 'quick brown') for the splitByNonAlpha tokenizer.
Example
'New weather in York') does not match because the tokens are in the wrong order.
Row 3 ('weather in New Orleans') does not match because it does not contain the token 'York'.
Performance Tuning
Direct read
Certain types of text queries can be speed up significantly by an optimization called “direct read”. Example:- Setting query_plan_direct_read_from_text_index (true by default) which specifies if direct read is generally enabled.
- Setting use_skip_indexes_on_data_read was a prerequisite for direct read in ClickHouse versions < 26.4.
hasToken, hasAllTokens, and hasAnyTokens.
If the text index is defined with an array tokenizer, direct read is also supported for functions equals, has, hasAny, hasAll, mapContainsKey, and mapContainsValue.
These functions can also be combined by AND, OR, and NOT operators.
The WHERE or PREWHERE clauses can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions).
To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1.
As an example, a query with disabled direct read
query_plan_direct_read_from_text_index = 1
__text_index_<index_name>_<function_name>_<id>.
If this column is present, then direct read is used.
If the WHERE filter clause only contains text search functions, the query can avoid reading the column data entirely and has the greatest performance benefit by direct read.
However, even if the text column is accessed elsewhere in the query, direct read will still provide performance improvement.
Direct read as a hint
Direct read as a hint is based on the same principles as normal direct read, but instead adds an additional filter build from the text index data without removing the underlying text column.
It is used for functions when reading only from the text index would produce false positives.
Supported functions are: like, startsWith, endsWith, equals, has, hasPhrase, mapContainsKey, and mapContainsValue.
The additional filter can provide additional selectivity to restrict the result set in combination with other filters further, helping to reduce the amount of data read from other columns.
Direct read as a hint is controlled by setting query_plan_text_index_add_hint (enabled by default).
Example of query without hint:
query_plan_text_index_add_hint = 1
__text_index_...) has been added to the filter condition.
Thanks to the PREWHERE optimization, the filter condition is broken down into three separate conjuncts, which are applied in order of increasing computational complexity.
For this query, the application order is __text_index_..., then greaterOrEquals(...), and finally like(...).
This ordering enables skipping even more data granules than the granules skipped by the text index and the original filter, before reading the heavy columns used in the query after WHERE clause further reducing the amount of data to read.
LIKE/ILIKE queries
When a LIKE/ILIKE query pattern is%<alpha-numeric-characters-without-spaces>% and the text index tokenizer is splitByNonAlpha or array, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern.
When the optimization is enabled, LIKE/ILIKE queries should be significantly faster than a full-table scan. However, when the pattern matches most dictionary tokens, the performance can be worse compared to a full-table scan. Luckily, there is a fallback mechanism to prevent that.
The optimization is controlled by a setting:
The fallback mechanism is controlled by two settings:
This optimization supports only functions like and ilike.
Caching
Different caches are available to buffer parts of the text index in memory (see section Implementation Details): Currently, there are caches for the deserialized headers, tokens, and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_header_cache, use_text_index_tokens_cache, and use_text_index_postings_cache. By default, all caches are disabled. To clear the caches, use statement SYSTEM CLEAR TEXT INDEX CACHES Please refer the following server settings to configure the caches.Tokens cache settings
| Setting | Description |
|---|---|
| text_index_tokens_cache_policy | Text index tokens cache policy name. |
| text_index_tokens_cache_size | Maximum cache size in bytes. |
| text_index_tokens_cache_max_entries | Maximum number of deserialized tokens in cache. |
| text_index_tokens_cache_size_ratio | The size of the protected queue in the text index tokens cache relative to the cache’s total size. |
Header cache settings
| Setting | Description |
|---|---|
| text_index_header_cache_policy | Text index header cache policy name. |
| text_index_header_cache_size | Maximum cache size in bytes. |
| text_index_header_cache_max_entries | Maximum number of deserialized headers in cache. |
| text_index_header_cache_size_ratio | The size of the protected queue in the text index header cache relative to the cache’s total size. |
Posting lists cache settings
| Setting | Description |
|---|---|
| text_index_postings_cache_policy | Text index postings cache policy name. |
| text_index_postings_cache_size | Maximum cache size in bytes. |
| text_index_postings_cache_max_entries | Maximum number of deserialized postings in cache. |
| text_index_postings_cache_size_ratio | The size of the protected queue in the text index postings cache relative to the cache’s total size. |
Limitations
The text index currently has the following limitations:- The materialization of text indexes with a high number of tokens (e.g. 10 billion tokens) can consume significant amounts of memory. Text
index materialization can happen directly (
ALTER TABLE <table> MATERIALIZE INDEX <index>) or indirectly in part merges. - It is not possible to materialize text indexes on parts with more than 4.294.967.296 (= 2^32 = ca. 4.2 billion) rows. Without a materialized text index, queries fall back to slow brute-force search within the part. As a worst case estimation, assume a part contains a single column of type String and MergeTree setting
max_bytes_to_merge_at_max_space_in_pool(default: 150 GB) was not changed. In this case, the situation happens if the column contains less than 29.5 characters per row on average. In practice, tables also contain other columns and the threshold is multiples times smaller than that (depending on the number, type and size of the other columns).
Text Indexes vs Bloom-Filter-Based Indexes
String predicates can be sped up using text indexes and bloom-filter-based based indexes (index typebloom_filter, ngrambf_v1, tokenbf_v1, sparse_grams), yet both are fundamentally different in their design and intended use cases:
Bloom filter indexes
- Are based on probabilistic data structures which may produce false positives.
- Are only able to answer set membership questions, i.e. the column may contain token X vs. definitely does not contain X.
- Store granule-level information to enable skipping coarse ranges during query execution.
- Are hard to tune properly (see here for an example).
- Are rather compact (a few kilobytes or megabytes per part).
- Build a deterministic inverted index over tokens. No false positives are possible by the index itself.
- Are specifically optimized for text search workloads.
- Store row-level information which enables efficient term lookup.
- Are rather large (dozens to hundreds of megabytes per part).
- They do not support advanced tokenization and preprocessing.
- They do not support multi-token search.
- They do not provide the performance characteristics expected from an inverted index.
- They provide tokenization and preprocessing
- They provide efficient support for
hasAllTokens,LIKE,match, and similar text-search functions. - They have significantly better scalability for large text corpora.
Implementation Details
Each text index consists of two (abstract) data structures:- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
dictionary_block_size).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index header file (.idx)
The index header file contains for each dictionary block the block’s first token and its relative offset in the dictionary blocks file.
This sparse index structure is similar to ClickHouse’s sparse primary key index).
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the posting list is larger than posting_list_block_size, it is split into multiple blocks that are stored sequentially to the postings lists file.
Merging of text indexes
When data parts are merged, the text index does not need to be rebuilt from scratch; instead, it can be merged efficiently in a separate step of the merge process.
During this step, the sorted dictionaries of the text indexes of each input part are read and combined into a new unified dictionary.
The row numbers in the postings lists are also recalculated to reflect their new positions in the merged data part, using a mapping of old to new row numbers that is created during the initial merge phase.
This method of merging text indexes is similar to how projections with _part_offset column are merged.
If index is not materialized in the source part, it is built, written into a temporary file and then merged together with indexes from the other parts and from other temporary index files.
Debugging
Table function mergeTreeTextIndex can be used to introspect text indexes.
Example: Hackernews dataset
Let’s look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:hackernews table:
ALTER TABLE and add a text index on comment column, then materialize it:
hasToken, hasAnyTokens, and hasAllTokens functions.
The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.
- Using
hasToken
hasTokenhasToken checks if the text contains a specific single token.
We’ll search for the case-sensitive token ‘ClickHouse’.
Direct read disabled (Standard scan)
By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules.
We can simulate this behavior by disabling direct read.
- Using
hasAnyTokens
hasAnyTokenshasAnyTokens checks if the text contains at least one of the given tokens.
We’ll search for comments containing either ‘love’ or ‘ClickHouse’.
Direct read disabled (Standard scan)
- Using
hasAllTokens
hasAllTokenshasAllTokens checks if the text contains all of the given tokens.
We’ll search for comments containing both ‘love’ and ‘ClickHouse’.
Direct read disabled (Standard scan)
Even with direct read disabled, the standard skip index is still effective.
It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.
- Compound search: OR, AND, NOT, …
The direct read optimization also applies to compound boolean expressions.
Here, we’ll perform a case-insensitive search for ‘ClickHouse’ OR ‘clickhouse’.
Direct read disabled (Standard scan)
hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.
Related content
- Presentation: https://github.com/ClickHouse/clickhouse-presentations/blob/master/2025-tumuchdata-munich/ClickHouse_%20full-text%20search%20-%2011.11.2025%20Munich%20Database%20Meetup.pdf
- Presentation: https://presentations.clickhouse.com/2026-fosdem-inverted-index/Inverted_indexes_the_what_the_why_the_how.pdf