Full-text search with text indexes

Text indexes (also known as inverted indexes) enable fast full-text search on text data. A text index stores a mapping from tokens to the row numbers which contain each token. Tokens are generated by a process called tokenization. For example, ClickHouse’s default tokenizer converts the English sentence “The cat likes mice.” to tokens [“The”, “cat”, “likes”, “mice”]. As an example, assume a table with a single column and three rows

The cat likes mice.
Mice are afraid of dogs.
I have two dogs and a cat.

The corresponding tokens are:

The, cat, likes, mice
Mice, are, afraid, of, dogs
I, have, two, dogs, and, a, cat

We usually like to search case-insensitively, therefore we lower-case the tokens:

the, cat, likes, mice
mice, are, afraid, of, dogs
i, have, two, dogs, and, a, cat

We will also remove fill words such as “I”, “the” and “and” as they occur in almost every row:

cat, likes, mice
mice, afraid, dogs
have, two, dogs, cat

A text index then (conceptually) contains this information:

afraid : [2]
cat    : [1, 3]
dogs   : [2, 3]
have   : [3]
likes  : [1]
mice   : [1]
two    : [3]

Given a search token, this index structure allows to find all matching rows quickly.

Creating a Text Index

Text indexes are generally available (GA) in ClickHouse version 26.2 and newer. In these versions, no special settings need to be configured to use the text index. We strongly recommend using ClickHouse versions >= 26.2 for production use cases.

Text indexes can be used with any ClickHouse version >= 26.2, regardless of the compatibility setting.

To create a text index use the following syntax:

CREATE TABLE table
(
    key UInt64,
    str String,
    INDEX text_idx(str) TYPE text(
                                -- Mandatory parameters:
                                tokenizer = splitByNonAlpha
                                            | splitByString[(S)]
                                            | asciiCJK
                                            | ngrams[(N)]
                                            | sparseGrams[(min_length[, max_length[, min_cutoff_length]])]
                                            | array
                                -- Optional parameters:
                                [, preprocessor = expression(str)]
                                -- Optional advanced parameters:
                                [, dictionary_block_size = D]
                                [, dictionary_block_frontcoding_compression = B]
                                [, posting_list_block_size = C]
                                [, posting_list_codec = 'none' | 'bitpacking' ]
                            )
)
ENGINE = MergeTree
ORDER BY key

Text indexes can be defined on columns of these types:

String and FixedString,
Array(String) and Array(FixedString),
Map (via mapKeys and mapValues functions), and
JSON (via JSONAllPaths and JSONAllValues functions).

Columns of type Nullable(T) and LowCardinality() are also supported, including Array(Nullable(String or FixedString)). Alternatively, to add a text index to an existing table:

ALTER TABLE table
    ADD INDEX text_idx(str) TYPE text(
                                -- Mandatory parameters:
                                tokenizer = splitByNonAlpha
                                            | splitByString[(S)]
                                            | asciiCJK
                                            | ngrams[(N)]
                                            | sparseGrams[(min_length[, max_length[, min_cutoff_length]])]
                                            | array
                                -- Optional parameters:
                                [, preprocessor = expression(str)]
                                -- Optional advanced parameters:
                                [, dictionary_block_size = D]
                                [, dictionary_block_frontcoding_compression = B]
                                [, posting_list_block_size = C]
                                [, posting_list_codec = 'none' | 'bitpacking' ]
                            )

If you add an index to an existing table, we recommend materializing the index for existing table parts (otherwise search on parts without index will fall back to slow brute-force scans).

ALTER TABLE table MATERIALIZE INDEX text_idx SETTINGS mutations_sync = 2;

To remove a text index, please run

ALTER TABLE table DROP INDEX text_idx;

Tokenizer argument (mandatory). The tokenizer argument specifies the tokenizer:

splitByNonAlpha splits strings along non-alphanumeric ASCII characters (see function splitByNonAlpha).
splitByString(S) splits strings along certain user-defined separator strings S (see function splitByString). The separators can be specified using an optional parameter, for example, tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', ' in the example). The default separator list, if not specified explicitly (for example, tokenizer = splitByString), is a single whitespace [' '].
asciiCJK splits strings into tokens using Unicode word boundary rules (similar to Unicode Text Segmentation (UAX #29)). ASCII alphanumeric characters and underscores form tokens with connectors (ASCII : for letters, . and ' for same-type characters). Non-ASCII Unicode characters, including CJK characters, become single-character tokens.
ngrams(N) splits strings into equally large N-grams (see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example, tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example, tokenizer = ngrams), is 3.
sparseGrams(min_length, max_length, min_cutoff_length) splits strings into variable-length n-grams of at least min_length and at most max_length (inclusive) characters (see function sparseGrams). Unless specified explicitly, min_length and max_length default to 3 and 100. If parameter min_cutoff_length is provided, only n-grams with length greater or equal than min_cutoff_length are returned. Compared to ngrams(N), the sparseGrams tokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example, tokenizer = sparseGrams(3, 5, 4) internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.
array performs no tokenization, i.e. every row value is a token (see function array).

All available tokenizers are listed in system.tokenizers.

The splitByString tokenizer applies the split separators left-to-right. This can create ambiguities. For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc']. In the most cases, you want that matching prefers longer separators first. This can generally be done by passing the separator strings in order of descending length. If the separator strings happen to form a prefix code, they can be passed in arbitrary order.

To understand how a tokenizer split the input string, you can use the tokens and tokensForLikePattern functions: Example:

SELECT tokens('abc def', 'ngrams', 3);

Result:

['abc','bc ','c d',' de','def']

Working with non-ASCII inputs. Text indexes can be built on top of text data in any language and character set. For non-ASCII text, the asciiCJK tokenizer is recommended as it correctly handles Unicode word boundaries including CJK characters. ::: Preprocessor argument (optional). The preprocessor refers to an expression which is applied to the input string before tokenization. Typical use cases for the preprocessor argument include

Lower/upper-casing, or case folding to enable case-insensitive matching, e.g., lower, lowerUTF8, caseFoldUTF8.
UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, normalizeUTF8NFKCCasefold, toValidUTF8.
Removing or transforming unwanted characters or substrings, such as accents e.g. extractTextFromHTML, substring, idnaEncode, translate, removeDiacriticsUTF8.

The preprocessor expression must transform an input value of type String or FixedString to a value of the same type. If the text index was build on a column of type Nullable(T) or LowCardinality(T) column, then the preprocessor expression should accept nullable or low-cardinality values (i.e. not throw an exception). Examples:

INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col)))
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = removeDiacriticsUTF8(caseFoldUTF8(col)))

Also, the preprocessor expression must only reference the column or expression on top of which the text index is defined. Examples:

INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = upper(lower(col)))
INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = concat(lower(col), lower(col)))
Not allowed: INDEX idx(lower(col)) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = concat(col, col))

Using non-deterministic functions is disallowed. Functions hasToken, hasAllTokens and hasAnyTokens use the preprocessor to first transform the search term before tokenizing it. For example,

CREATE TABLE table
(
    str String,
    INDEX idx(str) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(str))
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM table WHERE hasToken(str, 'Foo');

is equivalent to:

CREATE TABLE table
(
    str String,
    INDEX idx(lower(str)) TYPE text(tokenizer = 'splitByNonAlpha')
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM table WHERE hasToken(str, lower('Foo'));

In this case, the preprocessor expression transforms the array elements individually. Example:

CREATE TABLE table
(
    arr Array(String),
    INDEX idx arr TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(arr))

    -- This is not legal:
    INDEX idx_illegal arr TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = arraySort(arr))
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM tab WHERE hasAllTokens(arr, 'foo');

To define a preprocessor in a text index on build Map-type columns, users need to decide if the index is build on the map keys or values. Example:

CREATE TABLE table
(
    map Map(String, String),
    INDEX idx mapKeys(map)  TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(mapKeys(map)))
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM tab WHERE hasAllTokens(mapKeys(map), 'foo');

Other arguments (optional). Index granularity. Text indexes are implemented within ClickHouse as a type of skip indexes. However, unlike other skip indexes, text indexes use an infinite granularity (100 million). This can be seen in the table definition of a text index. Example:

CREATE TABLE table(
    k UInt64,
    s String,
    INDEX idx(s) TYPE text(tokenizer = ngrams(2)))
ENGINE = MergeTree()
ORDER BY k;

SHOW CREATE TABLE table;

Result:

┌─statement──────────────────────────────────────────────────────────────┐
│ CREATE TABLE default.table                                            ↴│
│↳(                                                                     ↴│
│↳    `k` UInt64,                                                       ↴│
│↳    `s` String,                                                       ↴│
│↳    INDEX idx s TYPE text(tokenizer = ngrams(2)) GRANULARITY 100000000↴│ <-- here
│↳)                                                                     ↴│
│↳ENGINE = MergeTree                                                    ↴│
│↳ORDER BY k                                                            ↴│
│↳SETTINGS index_granularity = 8192                                      │
└────────────────────────────────────────────────────────────────────────┘

The huge index granularity ensures that the text index is created for the entire part. An explicitly specified index granularity is ignored.

Using a Text Index

Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists on a column or table part, the string search functions will fall back to slow brute-force scans.

We recommend using functions hasAnyTokens and hasAllTokens to search the text index, please see below. These functions work with all available tokenizers and all possible preprocessor expressions. As the other supported functions historically preceded the text index, they had to retain their legacy behavior in many cases (e.g. no preprocessor support).

Supported functions

The text index can be used if text functions are used in the WHERE clause or PREWHERE clauses:

SELECT [...]
FROM [...]
WHERE string_search_function(column_with_text_index)

`=`

= (equals) matches the entire given search term. Example:

SELECT * from table WHERE str = 'Hello';

`IN`

IN (in) is similar to equals but matches all search terms. Example:

SELECT * from table WHERE str IN ('Hello', 'World');

NOT IN (notIn) is not supported by the text index.

`LIKE` and `match`

These functions currently use the text index for filtering only if the index tokenizer is either splitByNonAlpha, ngrams or sparseGrams.

NOT LIKE (notLike) is not supported by the text index.

In order to use LIKE (like) and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term. For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length. Example for the text index with splitByNonAlpha tokenizer:

SELECT count() FROM table WHERE comment LIKE 'support%';

support in the example could match support, supports, supporting etc. This kind of query is a substring query and it cannot be sped up by a text index. To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:

SELECT count() FROM table WHERE comment LIKE ' support %'; -- or `% support %`

The spaces left and right of support make sure that the term can be extracted as a token. Fortunately, there is a special case where ClickHouse can leverage the inverted index to speed up LIKE queries significantly. See the LIKE/ILIKE performance tuning section for details.

`startsWith` and `endsWith`

Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term. For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length. Example for the text index with splitByNonAlpha tokenizer:

SELECT count() FROM table WHERE startsWith(comment, 'clickhouse support');

In the example, only clickhouse is considered a token. support is no token because it can match support, supports, supporting etc. To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:

startsWith(comment, 'clickhouse supports ')`

Similarly, endsWith should be used with a leading space:

SELECT count() FROM table WHERE endsWith(comment, ' olap engine');

`hasToken` and `hasTokenOrNull`

Function hasToken looks straightforward to use but it has certain pitfalls with non-default tokenizers and preprocessor expressions. We recommend using functions hasAnyTokens and hasAllTokens instead.

Functions hasToken and hasTokenOrNull match against a single given token. Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token). Example:

SELECT count() FROM table WHERE hasToken(comment, 'clickhouse');

`hasAnyTokens` and `hasAllTokens`

Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens. These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching. See the function documentation for more info. Example:

-- Search tokens passed as string argument
SELECT count() FROM table WHERE hasAnyTokens(comment, 'clickhouse olap');
SELECT count() FROM table WHERE hasAllTokens(comment, 'clickhouse olap');

-- Search tokens passed as Array(String)
SELECT count() FROM table WHERE hasAnyTokens(comment, ['clickhouse', 'olap']);
SELECT count() FROM table WHERE hasAllTokens(comment, ['clickhouse', 'olap']);

`hasPhrase`

Function hasPhrase matches against a phrase: all tokens must appear consecutively and in the same order as in the search string. Unlike hasAllTokens, which only requires all tokens to be present somewhere, hasPhrase requires them to appear as a consecutive sequence. The search phrase is tokenized using the same tokenizer configured for the index column. Note that the function requires one of the splitByNonAlpha, splitByString, ngrams, or asciiCJK tokenizers. Example:

-- Matches: 'clickhouse' and 'olap' must appear consecutively in that order
SELECT count() FROM table WHERE hasPhrase(comment, 'clickhouse olap');

-- Does NOT match a row containing 'olap clickhouse' (wrong order)
-- Does NOT match a row containing 'clickhouse fast olap' (non-consecutive)

`has`

Array function has matches against a single token in the array of strings. Example:

SELECT count() FROM table WHERE has(array, 'clickhouse');

`hasAny` and `hasAll`

Array functions hasAny and hasAll test whether the indexed array column contains any or all of a constant set of needle strings. Example:

SELECT count() FROM table WHERE hasAny(tags, ['clickhouse', 'olap']);
SELECT count() FROM table WHERE hasAll(tags, ['clickhouse', 'olap']);

`mapContains`

Function mapContains (an alias of mapContainsKey) matches against tokens extracted from the searched string in the keys of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it was created on a mapKeys(map) expression. Example:

SELECT count() FROM table WHERE mapContainsKey(map, 'clickhouse');
-- OR
SELECT count() FROM table WHERE mapContains(map, 'clickhouse');

`mapContainsValue`

Function mapContainsValue matches against tokens extracted from the searched string in the values of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it was created on a mapValues(map) expression. Example:

SELECT count() FROM table WHERE mapContainsValue(map, 'clickhouse');

`mapContainsKeyLike` and `mapContainsValueLike`

The functions mapContainsKeyLike and mapContainsValueLike match a pattern against all keys or values (respectively) of a map. Example:

SELECT count() FROM table WHERE mapContainsKeyLike(map, '% clickhouse %');
SELECT count() FROM table WHERE mapContainsValueLike(map, '% clickhouse %');

`operator[]`

Access operator[] can be used with the text index to filter out keys and values. The text index is only used if it is created on mapKeys(map) or mapValues(map) expressions, or both. Example:

SELECT count() FROM table WHERE map['engine'] = 'clickhouse';

See the following examples for using columns of type Array(T) and Map(K, V) with the text index.

Indexing Array(String) columns

Imagine a blogging platform, where authors categorize their blog posts using keywords. We like users to discover related content by searching for or clicking on topics. Consider this table definition:

CREATE TABLE posts
(
    post_id UInt64,
    title String,
    content String,
    keywords Array(String)
)
ENGINE = MergeTree
ORDER BY (post_id);

Without a text index, finding posts with a specific keyword (e.g. clickhouse) requires scanning all entries:

SELECT count() FROM posts WHERE has(keywords, 'clickhouse'); -- slow full-table scan - checks every keyword in every post

As the platform grows, this becomes increasingly slow because the query must examine every keywords array in every row. To overcome this performance issue, we define a text index for column keywords:

ALTER TABLE posts ADD INDEX keywords_idx(keywords) TYPE text(tokenizer = splitByNonAlpha);
ALTER TABLE posts MATERIALIZE INDEX keywords_idx; -- Don't forget to rebuild the index for existing data

Indexing Map columns

In many observability use cases, log messages are split into “components” and stored as appropriate data types, e.g. date time for the timestamp, enum for the log level etc. Metrics fields are best stored as key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring. Consider this logs table:

CREATE TABLE logs
(
    id UInt64,
    timestamp DateTime,
    message String,
    attributes Map(String, String)
)
ENGINE = MergeTree
ORDER BY (timestamp);

Without a text index, searching through Map data requires full table scans:

-- Finds all logs with rate limiting data:
SELECT * FROM logs WHERE has(mapKeys(attributes), 'rate_limit'); -- slow full-table scan

-- Finds all logs from a specific IP:
SELECT * FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- slow full-table scan

As log volume grows, these queries become slow. The solution is creating a text index for the Map keys and values. Use mapKeys to create a text index when you need to find logs by field names or attribute types:

ALTER TABLE logs ADD INDEX attributes_keys_idx mapKeys(attributes) TYPE text(tokenizer = array);
ALTER TABLE posts MATERIALIZE INDEX attributes_keys_idx;

Use mapValues to create a text index when you need to search within the actual content of attributes:

ALTER TABLE logs ADD INDEX attributes_vals_idx mapValues(attributes) TYPE text(tokenizer = array);
ALTER TABLE posts MATERIALIZE INDEX attributes_vals_idx;

Example queries:

-- Find all rate-limited requests:
SELECT * FROM logs WHERE mapContainsKey(attributes, 'rate_limit'); -- fast

-- Finds all logs from a specific IP:
SELECT * FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- fast

-- Finds all logs where any attribute includes an error:
SELECT * FROM logs WHERE mapContainsValueLike(attributes, '% error %'); -- fast

Indexing JSON columns

Text indexes can be used with JSON columns in three ways:

Indexes on specific subcolumns — create a text index on a known JSON path, just like on a regular column. This indexes the values at that path.
Path-based indexes with JSONAllPaths — indexes all paths present in each granule to skip granules that cannot contain the queried path. Similar to Map columns.
Value-based indexes with JSONAllValues — indexes all values across all JSON paths to accelerate full-text search on any JSON subcolumn with a single index.

Indexes on specific subcolumns

You can create a skip index on any JSON subcolumn using the same syntax as for regular columns. There are two ways to reference a JSON subcolumn in an index expression:

Typed path declared in the JSON type hint — access by name directly: json.a.
Dynamic path with explicit cast — use the :: cast syntax: json.b::String.

Example index definition:

CREATE TABLE sensor_data
(
    data JSON(sensor_id String),
    INDEX idx_sensor data.sensor_id TYPE text(tokenizer = splitByNonAlpha),
    INDEX idx_location data.location::String TYPE text(tokenizer = splitByNonAlpha)
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS index_granularity = 1;

INSERT INTO sensor_data SELECT toJSONString(map('sensor_id', 'id_' || number , 'location', 'room_' || toString(number))) FROM numbers(4);
INSERT INTO sensor_data SELECT toJSONString(map('sensor_id', 'id_' || number, 'location', 'room_' || toString(number))) FROM numbers(4, 4);

Example query:

EXPLAIN indexes = 1 SELECT * FROM sensor_data WHERE data.sensor_id = 'id_5';

Result:

...
    Indexes:
      Skip
        Name: idx_sensor
        Description: text
        Condition: (mode: All; tokens: ["5", "id"])
        Parts: 1/2
        Granules: 1/8

Example query:

EXPLAIN indexes = 1 SELECT * FROM sensor_data WHERE data.location::String = 'room_5';

Result:

...
    Indexes:
      Skip
        Name: idx_location
        Description: text
        Condition: (mode: All; tokens: ["5", "room"])
        Parts: 1/2
        Granules: 1/8

Path-based indexes with JSONAllPaths

Similar to Map columns, text indexes can be created on JSON columns using JSONAllPaths. The index stores the set of JSON paths present in each granule and uses them to skip granules where a queried path is absent. Example index definition:

CREATE TABLE events
(
    data JSON,
    INDEX idx JSONAllPaths(data) TYPE text(tokenizer = array)
)
ENGINE = MergeTree
ORDER BY tuple();

INSERT INTO events VALUES ('{"user": {"name": "Alice"}, "action": "login"}');
INSERT INTO events VALUES ('{"metric": {"cpu": 0.95}, "host": "srv1"}');

You can use EXPLAIN indexes = 1 to verify that the skip index is being used. When a path exists only in one part, the index skips the other part. Example:

EXPLAIN indexes = 1 SELECT * FROM events WHERE data.user.name = 'Alice';

Result:

...
    Indexes:
      Skip
        Name: idx
        Description: text
        Condition: (mode: All; tokens: ["user.name"])
        Parts: 1/2
        Granules: 1/2

When a path does not exist in any part, all parts and granules are skipped. Example:

EXPLAIN indexes = 1 SELECT * FROM events WHERE data.nonexistent = 1;

Result:

Response

...
    Indexes:
      Skip
        Name: idx
        Description: text
        Condition: (mode: All; tokens: ["nonexistent"])
        Parts: 0/2
        Granules: 0/2

IS NOT NULL also uses the index — it skips granules where the path is absent (since the value would be NULL): Example:

EXPLAIN indexes = 1 SELECT * FROM events WHERE data.user.name IS NOT NULL;

Result:

...
    Indexes:
      Skip
        Name: idx
        Description: text
        Condition: (mode: All; tokens: ["user.name"])
        Parts: 1/2
        Granules: 1/2

Value-based indexes with JSONAllValues

Text indexes can be used to accelerate searches on JSON columns via function JSONAllValues. JSONAllValues returns all values from a JSON column as Array(String). Values of non-string datatypes (e.g. integers and arrays) are converted to their text representation. A text index build using JSONAllValues indexes these text representations across all JSON paths in each row. This index can then accelerate queries that filter on individual JSON subcolumns. When a query filters on a specific subcolumn (e.g. data.user_name = 'alice'), the text index can quickly skip rows (and granules) that do not contain the search tokens in any of their JSON values.

The index may produce false positives when different JSON paths contain the same tokens. For example, if row 1 has {"a": "hello", "b": "world"} and a query searches for data.a = 'world', the text index cannot distinguish that world belongs to path b, not a. In such cases, the index will not skip the row, and the filter on the actual column data will handle the final evaluation. This is the same behavior as with other text index use cases where the index acts as a fast pre-filter.

Example index definition:

CREATE TABLE events
(
    id UInt64,
    data JSON,
    INDEX json_idx JSONAllValues(data) TYPE text(tokenizer = splitByNonAlpha)
)
ENGINE = MergeTree
ORDER BY id;

Once the index is created, it can speed up queries on JSON subcolumns using the same functions as for String columns and function equals for all columns. Subcolumn access:

SELECT * FROM events WHERE data.user_name = 'alice';
SELECT * FROM events WHERE data.message LIKE '% error %';
SELECT * FROM events WHERE startsWith(data.status, 'fail');
SELECT * FROM events WHERE hasToken(data.title, 'clickhouse');

Subcolumn access with explicit CAST:

SELECT * FROM events WHERE hasAllTokens(data.message::String, 'connection timeout');
SELECT * FROM events WHERE data.status_code::UInt64 = 404;
SELECT * FROM events WHERE has(data.tags::Array(String), 'bug')

IN operator:

SELECT * FROM events WHERE data.level IN ('error', 'critical');

Phrase search

Text index supports phrase search via the hasPhrase function. All tokens in the phrase must appear consecutively and in the same order in the document. The text index accelerates phrase search by intersecting the posting lists for all tokens in the phrase to identify candidate granules. Within those granules, ClickHouse then verifies exact token adjacency. hasPhrase is supported with tokenizers splitByNonAlpha, splitByString, ngrams, and asciiCJK. The phrase string is tokenized using the index’s configured tokenizer. Tokenizer separator characters in the phrase are ignored: hasPhrase(text, 'quick+brown') is equivalent to hasPhrase(text, 'quick brown') for the splitByNonAlpha tokenizer.

Example

CREATE TABLE tab (
    id UInt32,
    text String,
    INDEX idx(text) TYPE text(tokenizer = splitByNonAlpha)
)
ENGINE = MergeTree
ORDER BY id;

INSERT INTO tab VALUES
    (1, 'weather in New York'),
    (2, 'New weather in York'),
    (3, 'weather in New Orleans');

SELECT id, text FROM tab WHERE hasPhrase(text, 'weather in New York');

Result:

   ┌─id─┬─text────────────────┐
1. │  1 │ weather in New York │
   └────┴─────────────────────┘

Row 2 ('New weather in York') does not match because the tokens are in the wrong order. Row 3 ('weather in New Orleans') does not match because it does not contain the token 'York'.

Performance Tuning

Direct read

Certain types of text queries can be speed up significantly by an optimization called “direct read”. Example:

SELECT column_a, column_b, ...
FROM [...]
WHERE string_search_function(column_with_text_index)

The direct read optimization answers the query exclusively using the text index (i.e., text index lookups) without accessing the underlying text column. Text index lookups read relatively little data and are therefore much faster than usual skip indexes in ClickHouse (which do a skip index lookup, followed by loading and filtering remaining granules). Direct read is controlled by two settings:

Setting query_plan_direct_read_from_text_index (true by default) which specifies if direct read is generally enabled.
Setting use_skip_indexes_on_data_read was a prerequisite for direct read in ClickHouse versions < 26.4.

Supported functions The direct read optimization supports functions hasToken, hasAllTokens, and hasAnyTokens. If the text index is defined with an array tokenizer, direct read is also supported for functions equals, has, hasAny, hasAll, mapContainsKey, and mapContainsValue. These functions can also be combined by AND, OR, and NOT operators. The WHERE or PREWHERE clauses can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions). To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1. As an example, a query with disabled direct read

EXPLAIN PLAN actions = 1
SELECT count()
FROM table
WHERE hasToken(col, 'some_token')
SETTINGS query_plan_direct_read_from_text_index = 0, -- disable direct read

returns

[...]
Filter ((WHERE + Change column names to column identifiers))
Filter column: hasToken(__table1.col, 'some_token'_String) (removed)
Actions: INPUT : 0 -> col String : 0
         COLUMN Const(String) -> 'some_token'_String String : 1
         FUNCTION hasToken(col :: 0, 'some_token'_String :: 1) -> hasToken(__table1.col, 'some_token'_String) UInt8 : 2
[...]

whereas the same query run with query_plan_direct_read_from_text_index = 1

EXPLAIN PLAN actions = 1
SELECT count()
FROM table
WHERE hasToken(col, 'some_token')
SETTINGS query_plan_direct_read_from_text_index = 1, -- enable direct read

returns

[...]
Expression (Before GROUP BY)
Positions:
  Filter
  Filter column: __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 (removed)
  Actions: INPUT :: 0 -> __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 UInt8 : 0
[...]

The second EXPLAIN PLAN output contains a virtual column __text_index_<index_name>_<function_name>_<id>. If this column is present, then direct read is used. If the WHERE filter clause only contains text search functions, the query can avoid reading the column data entirely and has the greatest performance benefit by direct read. However, even if the text column is accessed elsewhere in the query, direct read will still provide performance improvement. Direct read as a hint Direct read as a hint is based on the same principles as normal direct read, but instead adds an additional filter build from the text index data without removing the underlying text column. It is used for functions when reading only from the text index would produce false positives. Supported functions are: like, startsWith, endsWith, equals, has, hasPhrase, mapContainsKey, and mapContainsValue. The additional filter can provide additional selectivity to restrict the result set in combination with other filters further, helping to reduce the amount of data read from other columns. Direct read as a hint is controlled by setting query_plan_text_index_add_hint (enabled by default). Example of query without hint:

EXPLAIN actions = 1
SELECT count()
FROM table
WHERE (col LIKE '%some-token%') AND (d >= today())
SETTINGS query_plan_text_index_add_hint = 0
FORMAT TSV

returns

[...]
Prewhere filter column: and(like(__table1.col, \'%some-token%\'_String), greaterOrEquals(__table1.d, _CAST(20440_Date, \'Date\'_String))) (removed)
[...]

whereas the same query run with query_plan_text_index_add_hint = 1

EXPLAIN actions = 1
SELECT count()
FROM table
WHERE col LIKE '%some-token%'
SETTINGS query_plan_text_index_add_hint = 1

returns

[...]
Prewhere filter column: and(__text_index_idx_col_like_d306f7c9c95238594618ac23eb7a3f74, like(__table1.col, \'%some-token%\'_String), greaterOrEquals(__table1.d, _CAST(20440_Date, \'Date\'_String))) (removed)
[...]

In the second EXPLAIN PLAN output, you can see that an additional conjunct (__text_index_...) has been added to the filter condition. Thanks to the PREWHERE optimization, the filter condition is broken down into three separate conjuncts, which are applied in order of increasing computational complexity. For this query, the application order is __text_index_..., then greaterOrEquals(...), and finally like(...). This ordering enables skipping even more data granules than the granules skipped by the text index and the original filter, before reading the heavy columns used in the query after WHERE clause further reducing the amount of data to read.

LIKE/ILIKE queries

When a LIKE/ILIKE query pattern is %<alpha-numeric-characters-without-spaces>% and the text index tokenizer is splitByNonAlpha or array, ClickHouse leverages the inverted index to speed up LIKE/ILIKE queries significantly. To achieve that, ClickHouse scans the inverted index dictionary instead of a full-table scan to find the matching pattern. When the optimization is enabled, LIKE/ILIKE queries should be significantly faster than a full-table scan. However, when the pattern matches most dictionary tokens, the performance can be worse compared to a full-table scan. Luckily, there is a fallback mechanism to prevent that. The optimization is controlled by a setting:

use_text_index_like_evaluation_by_dictionary_scan

The fallback mechanism is controlled by two settings:

This optimization supports only functions like and ilike.

Caching

Different caches are available to buffer parts of the text index in memory (see section Implementation Details): Currently, there are caches for the deserialized headers, tokens, and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_header_cache, use_text_index_tokens_cache, and use_text_index_postings_cache. By default, all caches are disabled. To clear the caches, use statement SYSTEM CLEAR TEXT INDEX CACHES Please refer the following server settings to configure the caches.

Tokens cache settings

Setting	Description
text_index_tokens_cache_policy	Text index tokens cache policy name.
text_index_tokens_cache_size	Maximum cache size in bytes.
text_index_tokens_cache_max_entries	Maximum number of deserialized tokens in cache.
text_index_tokens_cache_size_ratio	The size of the protected queue in the text index tokens cache relative to the cache’s total size.

Header cache settings

Setting	Description
text_index_header_cache_policy	Text index header cache policy name.
text_index_header_cache_size	Maximum cache size in bytes.
text_index_header_cache_max_entries	Maximum number of deserialized headers in cache.
text_index_header_cache_size_ratio	The size of the protected queue in the text index header cache relative to the cache’s total size.

Posting lists cache settings

Setting	Description
text_index_postings_cache_policy	Text index postings cache policy name.
text_index_postings_cache_size	Maximum cache size in bytes.
text_index_postings_cache_max_entries	Maximum number of deserialized postings in cache.
text_index_postings_cache_size_ratio	The size of the protected queue in the text index postings cache relative to the cache’s total size.

Limitations

The text index currently has the following limitations:

The materialization of text indexes with a high number of tokens (e.g. 10 billion tokens) can consume significant amounts of memory. Text index materialization can happen directly (ALTER TABLE <table> MATERIALIZE INDEX <index>) or indirectly in part merges.
It is not possible to materialize text indexes on parts with more than 4.294.967.296 (= 2^32 = ca. 4.2 billion) rows. Without a materialized text index, queries fall back to slow brute-force search within the part. As a worst case estimation, assume a part contains a single column of type String and MergeTree setting max_bytes_to_merge_at_max_space_in_pool (default: 150 GB) was not changed. In this case, the situation happens if the column contains less than 29.5 characters per row on average. In practice, tables also contain other columns and the threshold is multiples times smaller than that (depending on the number, type and size of the other columns).

Text Indexes vs Bloom-Filter-Based Indexes

String predicates can be sped up using text indexes and bloom-filter-based based indexes (index type bloom_filter, ngrambf_v1, tokenbf_v1, sparse_grams), yet both are fundamentally different in their design and intended use cases: Bloom filter indexes

Are based on probabilistic data structures which may produce false positives.
Are only able to answer set membership questions, i.e. the column may contain token X vs. definitely does not contain X.
Store granule-level information to enable skipping coarse ranges during query execution.
Are hard to tune properly (see here for an example).
Are rather compact (a few kilobytes or megabytes per part).

Text indexes

Build a deterministic inverted index over tokens. No false positives are possible by the index itself.
Are specifically optimized for text search workloads.
Store row-level information which enables efficient term lookup.
Are rather large (dozens to hundreds of megabytes per part).

Bloom-filter-based indexes support full-text search only as a “side effect”:

They do not support advanced tokenization and preprocessing.
They do not support multi-token search.
They do not provide the performance characteristics expected from an inverted index.

Text indexes, in contrast, are purpose-built for full-text search:

They provide tokenization and preprocessing
They provide efficient support for hasAllTokens, LIKE, match, and similar text-search functions.
They have significantly better scalability for large text corpora.

Implementation Details

Each text index consists of two (abstract) data structures:

a dictionary which maps each token to a postings list, and
a set of postings lists, each representing a set of row numbers.

Text index is built for the whole part. Unlike other skip indexes, text index can be merged instead of rebuilt on merge of the data parts (see below). During index creation, three files are created (per part): Dictionary blocks file (.dct) The tokens in the text index are sorted and stored in dictionary blocks of 512 tokens each (the block size is configurable by parameter dictionary_block_size). A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part. Index header file (.idx) The index header file contains for each dictionary block the block’s first token and its relative offset in the dictionary blocks file. This sparse index structure is similar to ClickHouse’s sparse primary key index). Postings lists file (.pst) The posting lists for all tokens are laid out sequentially in the postings list file. To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps. If the posting list is larger than posting_list_block_size, it is split into multiple blocks that are stored sequentially to the postings lists file. Merging of text indexes When data parts are merged, the text index does not need to be rebuilt from scratch; instead, it can be merged efficiently in a separate step of the merge process. During this step, the sorted dictionaries of the text indexes of each input part are read and combined into a new unified dictionary. The row numbers in the postings lists are also recalculated to reflect their new positions in the merged data part, using a mapping of old to new row numbers that is created during the initial merge phase. This method of merging text indexes is similar to how projections with _part_offset column are merged. If index is not materialized in the source part, it is built, written into a temporary file and then merged together with indexes from the other parts and from other temporary index files. Debugging Table function mergeTreeTextIndex can be used to introspect text indexes.

Example: Hackernews dataset

Let’s look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:

CREATE TABLE hackernews (
    id UInt64,
    deleted UInt8,
    type String,
    author String,
    timestamp DateTime,
    comment String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    children Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32
)
ENGINE = MergeTree
ORDER BY (type, author);

The 28.7M rows are in a Parquet file in S3 - let’s insert them into the hackernews table:

INSERT INTO hackernews
    SELECT * FROM s3Cluster(
        'default',
        'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet',
        'Parquet',
        '
    id UInt64,
    deleted UInt8,
    type String,
    by String,
    time DateTime,
    text String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    kids Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32');

We will use ALTER TABLE and add a text index on comment column, then materialize it:

-- Add the index
ALTER TABLE hackernews ADD INDEX comment_idx(comment) TYPE text(tokenizer = splitByNonAlpha);

-- Materialize the index for existing data
ALTER TABLE hackernews MATERIALIZE INDEX comment_idx SETTINGS mutations_sync = 2;

Now, let’s run queries using hasToken, hasAnyTokens, and hasAllTokens functions. The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.

Using `hasToken`

hasToken checks if the text contains a specific single token. We’ll search for the case-sensitive token ‘ClickHouse’. Direct read disabled (Standard scan) By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules. We can simulate this behavior by disabling direct read.

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0;

┌─count()─┐
│     516 │
└─────────┘

1 row in set. Elapsed: 0.362 sec. Processed 24.90 million rows, 9.51 GB

Direct read enabled (Fast index read) Now we run the same query with direct read enabled (the default).

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1;

┌─count()─┐
│     516 │
└─────────┘

1 row in set. Elapsed: 0.008 sec. Processed 3.15 million rows, 3.15 MB

The direct read query is over 45 times faster (0.362s vs 0.008s) and processes significantly less data (9.51 GB vs 3.15 MB) by reading from the index alone.

Using `hasAnyTokens`

hasAnyTokens checks if the text contains at least one of the given tokens. We’ll search for comments containing either ‘love’ or ‘ClickHouse’. Direct read disabled (Standard scan)

SELECT count()
FROM hackernews
WHERE hasAnyTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0;

┌─count()─┐
│  408426 │
└─────────┘

1 row in set. Elapsed: 1.329 sec. Processed 28.74 million rows, 9.72 GB

Direct read enabled (Fast index read)

SELECT count()
FROM hackernews
WHERE hasAnyTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1;

┌─count()─┐
│  408426 │
└─────────┘

1 row in set. Elapsed: 0.015 sec. Processed 27.99 million rows, 27.99 MB

The speedup is even more dramatic for this common “OR” search. The query is nearly 89 times faster (1.329s vs 0.015s) by avoiding the full column scan.

Using `hasAllTokens`

hasAllTokens checks if the text contains all of the given tokens. We’ll search for comments containing both ‘love’ and ‘ClickHouse’. Direct read disabled (Standard scan) Even with direct read disabled, the standard skip index is still effective. It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.

SELECT count()
FROM hackernews
WHERE hasAllTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0;

┌─count()─┐
│      11 │
└─────────┘

1 row in set. Elapsed: 0.184 sec. Processed 147.46 thousand rows, 57.03 MB

Direct read enabled (Fast index read) Direct read answers the query by operating on the index data, reading only 147.46 KB.

SELECT count()
FROM hackernews
WHERE hasAllTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1;

┌─count()─┐
│      11 │
└─────────┘

1 row in set. Elapsed: 0.007 sec. Processed 147.46 thousand rows, 147.46 KB

For this “AND” search, the direct read optimization is over 26 times faster (0.184s vs 0.007s) than the standard skip index scan.

Compound search: OR, AND, NOT, …

The direct read optimization also applies to compound boolean expressions. Here, we’ll perform a case-insensitive search for ‘ClickHouse’ OR ‘clickhouse’. Direct read disabled (Standard scan)

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse')
SETTINGS query_plan_direct_read_from_text_index = 0;

┌─count()─┐
│     769 │
└─────────┘

1 row in set. Elapsed: 0.450 sec. Processed 25.87 million rows, 9.58 GB

Direct read enabled (Fast index read)

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse')
SETTINGS query_plan_direct_read_from_text_index = 1;

┌─count()─┐
│     769 │
└─────────┘

1 row in set. Elapsed: 0.013 sec. Processed 25.87 million rows, 51.73 MB

By combining the results from the index, the direct read query is 34 times faster (0.450s vs 0.013s) and avoids reading the 9.58 GB of column data. For this specific case, hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.

Outdated material

SQL Reference

Documentation Index

​Full-text search with text indexes

​Creating a Text Index

​Using a Text Index

​Supported functions

​=

​IN

​LIKE and match

​startsWith and endsWith

​hasToken and hasTokenOrNull

​hasAnyTokens and hasAllTokens

​hasPhrase

​has

​hasAny and hasAll

​mapContains

​mapContainsValue

​mapContainsKeyLike and mapContainsValueLike

​operator[]

​Indexing Array(String) columns

​Indexing Map columns

​Indexing JSON columns

​Indexes on specific subcolumns

​Path-based indexes with JSONAllPaths

​Value-based indexes with JSONAllValues

​Phrase search

​Example

​Performance Tuning

​Direct read

​LIKE/ILIKE queries

​Caching

​Tokens cache settings

​Header cache settings

​Posting lists cache settings

​Limitations

​Text Indexes vs Bloom-Filter-Based Indexes

​Implementation Details

​Example: Hackernews dataset

​ Using hasToken

​ Using hasAnyTokens

​ Using hasAllTokens

​ Compound search: OR, AND, NOT, …

​Related content

Full-text search with text indexes

Creating a Text Index

Using a Text Index

Supported functions

`=`

`IN`

`LIKE` and `match`

`startsWith` and `endsWith`

`hasToken` and `hasTokenOrNull`

`hasAnyTokens` and `hasAllTokens`

`hasPhrase`

`has`

`hasAny` and `hasAll`

`mapContains`

`mapContainsValue`

`mapContainsKeyLike` and `mapContainsValueLike`

`operator[]`

Indexing Array(String) columns

Indexing Map columns

Indexing JSON columns

Indexes on specific subcolumns

Path-based indexes with JSONAllPaths

Value-based indexes with JSONAllValues

Phrase search

Example

Performance Tuning

Direct read

LIKE/ILIKE queries

Caching

Tokens cache settings

Header cache settings

Posting lists cache settings

Limitations

Text Indexes vs Bloom-Filter-Based Indexes

Implementation Details

Example: Hackernews dataset

Using `hasToken`

Using `hasAnyTokens`

Using `hasAllTokens`

Compound search: OR, AND, NOT, …

Related content