Functions for splitting strings - ClickHouse Documentation

The documentation below is generated from the system.functions system table.

alphaTokens

Introduced in: v1.1.0 Selects substrings of consecutive bytes from the ranges a-z and A-Z and returns an array of the selected substrings. Syntax

alphaTokens(s[, max_substrings])

Aliases: splitByAlpha Arguments

s — The string to split. String
max_substrings — Optional. When max_substrings > 0, the number of returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Int64

Returned value Returns an array of selected substrings of s. Array(String) Examples Usage example

Query

SELECT alphaTokens('abca1abc');

Response

┌─alphaTokens('abca1abc')─┐
│ ['abca','abc']          │
└─────────────────────────┘

arrayStringConcat

Introduced in: v1.1.0 Concatenates string representations of values listed in the array with the provided separator, which is an optional parameter set to an empty string by default. Syntax

arrayStringConcat(arr[, separator])

Arguments

arr — The array to concatenate. Array(T)
separator — Optional. Separator string. By default an empty string. const String

Returned value Returns the concatenated string. String Examples Usage example

Query

SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString;

Response

┌─DateString──────────┐
│ 12/05/2021 12:50:00 │
└─────────────────────┘

extractAllGroupsVertical

Introduced in: v20.5.0 Matches all groups of a string using a regular expression and returns an array of arrays, where each array includes matching fragments from every group, grouped in order of appearance in the input string. Syntax

extractAllGroupsVertical(s, regexp)

Aliases: extractAllGroups Arguments

s — Input string to extract from. String or FixedString
regexp — Regular expression to match by. const String or const FixedString

Returned value Returns an array of arrays, where each inner array contains the captured groups from one match. Each match produces an array with elements corresponding to the capturing groups in the regular expression (group 1, group 2, etc.). If no matches are found, returns an empty array. Array(Array(String)) Examples Usage example

Query

WITH '< Server: nginx
< Date: Tue, 22 Jan 2019 00:26:14 GMT
< Content-Type: text/html; charset=UTF-8
< Connection: keep-alive
' AS s
SELECT extractAllGroupsVertical(s, '< ([\\w\\-]+): ([^\\r\\n]+)');

Response

[['Server','nginx'],['Date','Tue, 22 Jan 2019 00:26:14 GMT'],['Content-Type','text/html; charset=UTF-8'],['Connection','keep-alive']]

ngrams

Introduced in: v21.11.0 Splits a UTF-8 string into n-grams of length N. Syntax

ngrams(s, N)

Arguments

s — Input string. String or FixedString
N — The n-gram length. const UInt8/16/32/64

Returned value Returns an array with n-grams. Array(String) Examples Usage example

Query

SELECT ngrams('ClickHouse', 3);

Response

['Cli','lic','ick','ckH','kHo','Hou','ous','use']

reverseBySeparator

Introduced in: v26.2.0 Reverses the order of substrings in a string separated by a specified separator. This function splits the string by the separator, reverses the order of the resulting parts, and joins them back using the same separator. It is useful for parsing domain names, file paths, or other hierarchical data where you need to reverse the order of components. Examples:

reverseBySeparator(‘www.google.com’) returns ‘com.google.www’
reverseBySeparator(‘a/b/c’, ’/’) returns ‘c/b/a’
reverseBySeparator(‘x::y::z’, ’::’) returns ‘z::y::x’

Syntax

reverseBySeparator(string[, separator])

Arguments

string — The input string to reverse the order of its parts. String
separator — The separator string used to identify parts. If not provided, uses ’.’ (dot). Default: ’.’ String

Returned value Returns a string with substrings ordered from right to left of the original string, joined by the same separator. String Examples Basic domain reversal

Query

SELECT reverseBySeparator('www.google.com')

Response

'com.google.www'

Path reversal

Query

SELECT reverseBySeparator('a/b/c', '/')

Response

'c/b/a'

Custom separator

Query

SELECT reverseBySeparator('x::y::z', '::')

Response

'z::y::x'

Edge case with dots

Query

SELECT reverseBySeparator('.a.b.', '.')

Response

'.b.a.'

Single element

Query

SELECT reverseBySeparator('single')

Response

'single'

Empty separator

Query

SELECT reverseBySeparator('abcde', '')

Response

'edcba'

splitByChar

Introduced in: v1.1.0 Splits a string separated by a specified constant string separator of exactly one character into an array of substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Empty substrings may be selected when:

A separator occurs at the beginning or end of the string
There are multiple consecutive separators
The original string s is empty

Syntax

splitByChar(separator, s[, max_substrings])

Arguments

separator — The separator must be a single-byte character. String
s — The string to split. String
max_substrings — Optional. If max_substrings > 0, the returned array will contain at most max_substrings substrings, otherwise the function will return as many substrings as possible. The default value is 0. Int64

Returned value Returns an array of selected substrings. Array(String) Examples Usage example

Query

SELECT splitByChar(',', '1,2,3,abcde');

Response

┌─splitByChar(⋯2,3,abcde')─┐
│ ['1','2','3','abcde']    │
└──────────────────────────┘

splitByNonAlpha

Introduced in: v21.9.0 Splits a string separated by whitespace and punctuation characters into an array of substrings.

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Syntax

splitByNonAlpha(s[, max_substrings])

Arguments

s — The string to split. String
max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64

Returned value Returns an array of selected substrings of s. Array(String) Examples Usage example

Query

SELECT splitByNonAlpha('user@domain.com');

Response

['user','domain','com']

splitByRegexp

Introduced in: v21.6.0 Splits a string which is separated by the provided regular expression into an array of substrings. If the provided regular expression is empty, it will split the string into an array of single characters. If no match is found for the regular expression, the string won’t be split. Empty substrings may be selected when:

a non-empty regular expression match occurs at the beginning or end of the string
there are multiple consecutive non-empty regular expression matches
the original string string is empty while the regular expression is not empty.

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Syntax

splitByRegexp(regexp, s[, max_substrings])

Arguments

regexp — Regular expression. Constant. String or FixedString
s — The string to split. String
max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64

Returned value Returns an array of the selected substrings of s. Array(String) Examples Usage example

Query

SELECT splitByRegexp('\\d+', 'a12bc23de345f');

Response

┌─splitByRegex⋯c23de345f')─┐
│ ['a12bc23de345f']        │
└──────────────────────────┘

Empty regexp

Query

SELECT splitByRegexp('', 'abcde');

Response

┌─splitByRegexp('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByString

Introduced in: v1.1.0 Splits a string with a constant separator consisting of multiple characters into an array of substrings. If the string separator is empty, it will split the string s into an array of single characters. Empty substrings may be selected when:

A non-empty separator occurs at the beginning or end of the string
There are multiple consecutive non-empty separators
The original string s is empty while the separator is not empty

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Syntax

splitByString(separator, s[, max_substrings])

Arguments

separator — The separator. String
s — The string to split. String
max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64

Returned value Returns an array of selected substrings of s Array(String) Examples Usage example

Query

SELECT splitByString(', ', '1, 2 3, 4,5, abcde');

Response

┌─splitByStrin⋯4,5, abcde')─┐
│ ['1','2 3','4,5','abcde'] │
└───────────────────────────┘

Empty separator

Query

SELECT splitByString('', 'abcde');

Response

┌─splitByString('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByWhitespace

Introduced in: v21.9.0 Splits a string which is separated by whitespace characters into an array of substrings.

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Syntax

splitByWhitespace(s[, max_substrings])

Arguments

s — The string to split. String
max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64

Returned value Returns an array of the selected substrings of s. Array(String) Examples Usage example

Query

SELECT splitByWhitespace('  1!  a,  b.  ');

Response

['1!','a,','b.']

tokens

Introduced in: v21.11.0 Splits a string into tokens using the given tokenizer. Available tokenizers:

splitByNonAlpha splits strings along non-alphanumeric ASCII characters (also see function splitByNonAlpha).
splitByString(S) splits strings along certain user-defined separator strings S (also see function splitByString). The separators can be specified using an optional parameter, for example, tokens(value, 'splitByString', [', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', ' in the example). The default separator list, if not specified explicitly, is a single whitespace [' '].
asciiCJK splits strings into tokens using Unicode word boundary rules (similar to UAX #29). ASCII alphanumeric characters and underscores form tokens with connectors (: for letters, . and ' for same-type characters). Non-ASCII Unicode characters become single-character tokens.
ngrams(N) splits strings into equally large N-grams (also see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example, tokens(value, 'ngrams', 3). The default ngram size, if not specified explicitly, is 3.
sparseGrams(min_length, max_length, min_cutoff_length) splits strings into variable-length n-grams of at least min_length and at most max_length (inclusive) characters (also see function sparseGrams). Unless specified explicitly, min_length and max_length default to 3 and 100. If parameter min_cutoff_length is provided, only n-grams with length greater or equal than min_cutoff_length are returned. Compared to ngrams(N), the sparseGrams tokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example, tokens(value, 'sparseGrams', 3, 5, 4) internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.
array performs no tokenization, i.e. every row value is a token (also see function array).

In case of the splitByString tokenizer, if the tokens do not form a prefix code, you likely want that the matching prefers longer separators first. To do so, pass the separators in order of descending length. For example, with separators = ['%21', '%'] string %21abc would be tokenized as ['abc'], whereas separators = ['%', '%21'] would tokenize to ['21ac'] (which is likely not what you wanted). Syntax

tokens(value) -- 'splitByNonAlpha' tokenizer
tokens(value, 'splitByNonAlpha')
tokens(value, 'splitByString'[, separators])
tokens(value, 'asciiCJK')
tokens(value, 'ngrams'[, n])
tokens(value, 'sparseGrams'[, min_length, max_length[, min_cutoff_length]])
tokens(value, 'array')

Arguments

value — The input string. String or FixedString
tokenizer — The tokenizer to use. Valid arguments are splitByNonAlpha, splitByString, asciiCJK, ngrams, sparseGrams, and array. Optional, if not set explicitly, defaults to splitByNonAlpha. const String
n — Only relevant if argument tokenizer is ngrams: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to 3. const UInt8
separators — Only relevant if argument tokenizer is split: An optional parameter which defines the separator strings. If not set explicitly, defaults to [' ']. const Array(String)
min_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum gram length, defaults to 3. const UInt8
max_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the maximum gram length, defaults to 100. const UInt8
min_cutoff_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum cutoff length. const UInt8

Returned value Returns the resulting array of tokens from input string. Array Examples Default tokenizer

Query

SELECT tokens('test1,;\\\\ test2,;\\\\ test3,;\\\\   test4') AS tokens;

Response

['test1','test2','test3','test4']

Ngram tokenizer

Query

SELECT tokens('abc def', 'ngrams', 3) AS tokens;

Response

['abc','bc ','c d',' de','def']

tokensForLikePattern

Introduced in: v26.3.0 Splits a LIKE pattern string into tokens using the specified tokenizer. Unlike the tokens function, this function is aware of LIKE pattern semantics (such as leading and trailing wildcard characters) and applies tokenizer-specific rules to extract meaningful tokens for pattern matching. It supports the same argument sets as the tokens function; additional arguments after tokenizer are interpreted according to the selected tokenizer (for example, n for ngrams, separators for splitByString, and min_length / max_length [/ min_cutoff_length] for sparseGrams). This function is primarily intended for debugging and testing purposes, and is used internally to analyze tokenization behavior for LIKE patterns. Syntax

tokensForLikePattern(value[, tokenizer[, tokenizer_specific_arguments...]])

Arguments

value — The input string. String or FixedString
tokenizer — The tokenizer to use. Valid arguments are splitByNonAlpha, splitByString, asciiCJK, ngrams, sparseGrams, and array. Optional, if not set explicitly, defaults to splitByNonAlpha. const String
n — Only relevant if argument tokenizer is ngrams: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to 3. const UInt8
separators — Only relevant if argument tokenizer is split: An optional parameter which defines the separator strings. If not set explicitly, defaults to [' ']. const Array(String)
min_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum gram length, defaults to 3. const UInt8
max_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the maximum gram length, defaults to 100. const UInt8
min_cutoff_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum cutoff length. const UInt8

Returned value Returns the resulting array of tokens from input string. Array Examples Default tokenizer

Query

SELECT tokensForLikePattern('%test1,test2,test3%') AS tokens;

Response

['test2']

SQL Reference

Documentation Index

​alphaTokens

​arrayStringConcat

​extractAllGroupsVertical

​ngrams

​reverseBySeparator

​splitByChar

​splitByNonAlpha

​splitByRegexp

​splitByString

​splitByWhitespace

​tokens

​tokensForLikePattern

alphaTokens

arrayStringConcat

extractAllGroupsVertical

ngrams

reverseBySeparator

splitByChar

splitByNonAlpha

splitByRegexp

splitByString

splitByWhitespace

tokens

tokensForLikePattern