Skip to main content

Documentation Index

Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The documentation below is generated from the system.functions system table.

alphaTokens

Introduced in: v1.1.0 Selects substrings of consecutive bytes from the ranges a-z and A-Z and returns an array of the selected substrings. Syntax
alphaTokens(s[, max_substrings])
Aliases: splitByAlpha Arguments
  • s — The string to split. String
  • max_substrings — Optional. When max_substrings > 0, the number of returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Int64
Returned value Returns an array of selected substrings of s. Array(String) Examples Usage example
Query
SELECT alphaTokens('abca1abc');
Response
┌─alphaTokens('abca1abc')─┐
│ ['abca','abc']          │
└─────────────────────────┘

arrayStringConcat

Introduced in: v1.1.0 Concatenates string representations of values listed in the array with the provided separator, which is an optional parameter set to an empty string by default. Syntax
arrayStringConcat(arr[, separator])
Arguments
  • arr — The array to concatenate. Array(T)
  • separator — Optional. Separator string. By default an empty string. const String
Returned value Returns the concatenated string. String Examples Usage example
Query
SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString;
Response
┌─DateString──────────┐
│ 12/05/2021 12:50:00 │
└─────────────────────┘

extractAllGroupsVertical

Introduced in: v20.5.0 Matches all groups of a string using a regular expression and returns an array of arrays, where each array includes matching fragments from every group, grouped in order of appearance in the input string. Syntax
extractAllGroupsVertical(s, regexp)
Aliases: extractAllGroups Arguments Returned value Returns an array of arrays, where each inner array contains the captured groups from one match. Each match produces an array with elements corresponding to the capturing groups in the regular expression (group 1, group 2, etc.). If no matches are found, returns an empty array. Array(Array(String)) Examples Usage example
Query
WITH '< Server: nginx
< Date: Tue, 22 Jan 2019 00:26:14 GMT
< Content-Type: text/html; charset=UTF-8
< Connection: keep-alive
' AS s
SELECT extractAllGroupsVertical(s, '< ([\\w\\-]+): ([^\\r\\n]+)');
Response
[['Server','nginx'],['Date','Tue, 22 Jan 2019 00:26:14 GMT'],['Content-Type','text/html; charset=UTF-8'],['Connection','keep-alive']]

ngrams

Introduced in: v21.11.0 Splits a UTF-8 string into n-grams of length N. Syntax
ngrams(s, N)
Arguments Returned value Returns an array with n-grams. Array(String) Examples Usage example
Query
SELECT ngrams('ClickHouse', 3);
Response
['Cli','lic','ick','ckH','kHo','Hou','ous','use']

reverseBySeparator

Introduced in: v26.2.0 Reverses the order of substrings in a string separated by a specified separator. This function splits the string by the separator, reverses the order of the resulting parts, and joins them back using the same separator. It is useful for parsing domain names, file paths, or other hierarchical data where you need to reverse the order of components. Examples:
  • reverseBySeparator(‘www.google.com’) returns ‘com.google.www’
  • reverseBySeparator(‘a/b/c’, ’/’) returns ‘c/b/a’
  • reverseBySeparator(‘x::y::z’, ’::’) returns ‘z::y::x’
Syntax
reverseBySeparator(string[, separator])
Arguments
  • string — The input string to reverse the order of its parts. String
  • separator — The separator string used to identify parts. If not provided, uses ’.’ (dot). Default: ’.’ String
Returned value Returns a string with substrings ordered from right to left of the original string, joined by the same separator. String Examples Basic domain reversal
Query
SELECT reverseBySeparator('www.google.com')
Response
'com.google.www'
Path reversal
Query
SELECT reverseBySeparator('a/b/c', '/')
Response
'c/b/a'
Custom separator
Query
SELECT reverseBySeparator('x::y::z', '::')
Response
'z::y::x'
Edge case with dots
Query
SELECT reverseBySeparator('.a.b.', '.')
Response
'.b.a.'
Single element
Query
SELECT reverseBySeparator('single')
Response
'single'
Empty separator
Query
SELECT reverseBySeparator('abcde', '')
Response
'edcba'

splitByChar

Introduced in: v1.1.0 Splits a string separated by a specified constant string separator of exactly one character into an array of substrings. Empty substrings may be selected if the separator occurs at the beginning or end of the string, or if there are multiple consecutive separators.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.
Empty substrings may be selected when:
  • A separator occurs at the beginning or end of the string
  • There are multiple consecutive separators
  • The original string s is empty
Syntax
splitByChar(separator, s[, max_substrings])
Arguments
  • separator — The separator must be a single-byte character. String
  • s — The string to split. String
  • max_substrings — Optional. If max_substrings > 0, the returned array will contain at most max_substrings substrings, otherwise the function will return as many substrings as possible. The default value is 0. Int64
Returned value Returns an array of selected substrings. Array(String) Examples Usage example
Query
SELECT splitByChar(',', '1,2,3,abcde');
Response
┌─splitByChar(⋯2,3,abcde')─┐
│ ['1','2','3','abcde']    │
└──────────────────────────┘

splitByNonAlpha

Introduced in: v21.9.0 Splits a string separated by whitespace and punctuation characters into an array of substrings.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.
Syntax
splitByNonAlpha(s[, max_substrings])
Arguments
  • s — The string to split. String
  • max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64
Returned value Returns an array of selected substrings of s. Array(String) Examples Usage example
Query
SELECT splitByNonAlpha('user@domain.com');
Response
['user','domain','com']

splitByRegexp

Introduced in: v21.6.0 Splits a string which is separated by the provided regular expression into an array of substrings. If the provided regular expression is empty, it will split the string into an array of single characters. If no match is found for the regular expression, the string won’t be split. Empty substrings may be selected when:
  • a non-empty regular expression match occurs at the beginning or end of the string
  • there are multiple consecutive non-empty regular expression matches
  • the original string string is empty while the regular expression is not empty.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.
Syntax
splitByRegexp(regexp, s[, max_substrings])
Arguments
  • regexp — Regular expression. Constant. String or FixedString
  • s — The string to split. String
  • max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64
Returned value Returns an array of the selected substrings of s. Array(String) Examples Usage example
Query
SELECT splitByRegexp('\\d+', 'a12bc23de345f');
Response
┌─splitByRegex⋯c23de345f')─┐
│ ['a12bc23de345f']        │
└──────────────────────────┘
Empty regexp
Query
SELECT splitByRegexp('', 'abcde');
Response
┌─splitByRegexp('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByString

Introduced in: v1.1.0 Splits a string with a constant separator consisting of multiple characters into an array of substrings. If the string separator is empty, it will split the string s into an array of single characters. Empty substrings may be selected when:
  • A non-empty separator occurs at the beginning or end of the string
  • There are multiple consecutive non-empty separators
  • The original string s is empty while the separator is not empty
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.
Syntax
splitByString(separator, s[, max_substrings])
Arguments
  • separator — The separator. String
  • s — The string to split. String
  • max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64
Returned value Returns an array of selected substrings of s Array(String) Examples Usage example
Query
SELECT splitByString(', ', '1, 2 3, 4,5, abcde');
Response
┌─splitByStrin⋯4,5, abcde')─┐
│ ['1','2 3','4,5','abcde'] │
└───────────────────────────┘
Empty separator
Query
SELECT splitByString('', 'abcde');
Response
┌─splitByString('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByWhitespace

Introduced in: v21.9.0 Splits a string which is separated by whitespace characters into an array of substrings.
Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.
Syntax
splitByWhitespace(s[, max_substrings])
Arguments
  • s — The string to split. String
  • max_substrings — Optional. When max_substrings > 0, the returned substrings will be no more than max_substrings, otherwise the function will return as many substrings as possible. Default value: 0. Int64
Returned value Returns an array of the selected substrings of s. Array(String) Examples Usage example
Query
SELECT splitByWhitespace('  1!  a,  b.  ');
Response
['1!','a,','b.']

tokens

Introduced in: v21.11.0 Splits a string into tokens using the given tokenizer. Available tokenizers:
  • splitByNonAlpha splits strings along non-alphanumeric ASCII characters (also see function splitByNonAlpha).
  • splitByString(S) splits strings along certain user-defined separator strings S (also see function splitByString). The separators can be specified using an optional parameter, for example, tokens(value, 'splitByString', [', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', ' in the example). The default separator list, if not specified explicitly, is a single whitespace [' '].
  • asciiCJK splits strings into tokens using Unicode word boundary rules (similar to UAX #29). ASCII alphanumeric characters and underscores form tokens with connectors (: for letters, . and ' for same-type characters). Non-ASCII Unicode characters become single-character tokens.
  • ngrams(N) splits strings into equally large N-grams (also see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example, tokens(value, 'ngrams', 3). The default ngram size, if not specified explicitly, is 3.
  • sparseGrams(min_length, max_length, min_cutoff_length) splits strings into variable-length n-grams of at least min_length and at most max_length (inclusive) characters (also see function sparseGrams). Unless specified explicitly, min_length and max_length default to 3 and 100. If parameter min_cutoff_length is provided, only n-grams with length greater or equal than min_cutoff_length are returned. Compared to ngrams(N), the sparseGrams tokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example, tokens(value, 'sparseGrams', 3, 5, 4) internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.
  • array performs no tokenization, i.e. every row value is a token (also see function array).
In case of the splitByString tokenizer, if the tokens do not form a prefix code, you likely want that the matching prefers longer separators first. To do so, pass the separators in order of descending length. For example, with separators = ['%21', '%'] string %21abc would be tokenized as ['abc'], whereas separators = ['%', '%21'] would tokenize to ['21ac'] (which is likely not what you wanted). Syntax
tokens(value) -- 'splitByNonAlpha' tokenizer
tokens(value, 'splitByNonAlpha')
tokens(value, 'splitByString'[, separators])
tokens(value, 'asciiCJK')
tokens(value, 'ngrams'[, n])
tokens(value, 'sparseGrams'[, min_length, max_length[, min_cutoff_length]])
tokens(value, 'array')
Arguments
  • value — The input string. String or FixedString
  • tokenizer — The tokenizer to use. Valid arguments are splitByNonAlpha, splitByString, asciiCJK, ngrams, sparseGrams, and array. Optional, if not set explicitly, defaults to splitByNonAlpha. const String
  • n — Only relevant if argument tokenizer is ngrams: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to 3. const UInt8
  • separators — Only relevant if argument tokenizer is split: An optional parameter which defines the separator strings. If not set explicitly, defaults to [' ']. const Array(String)
  • min_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum gram length, defaults to 3. const UInt8
  • max_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the maximum gram length, defaults to 100. const UInt8
  • min_cutoff_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum cutoff length. const UInt8
Returned value Returns the resulting array of tokens from input string. Array Examples Default tokenizer
Query
SELECT tokens('test1,;\\\\ test2,;\\\\ test3,;\\\\   test4') AS tokens;
Response
['test1','test2','test3','test4']
Ngram tokenizer
Query
SELECT tokens('abc def', 'ngrams', 3) AS tokens;
Response
['abc','bc ','c d',' de','def']

tokensForLikePattern

Introduced in: v26.3.0 Splits a LIKE pattern string into tokens using the specified tokenizer. Unlike the tokens function, this function is aware of LIKE pattern semantics (such as leading and trailing wildcard characters) and applies tokenizer-specific rules to extract meaningful tokens for pattern matching. It supports the same argument sets as the tokens function; additional arguments after tokenizer are interpreted according to the selected tokenizer (for example, n for ngrams, separators for splitByString, and min_length / max_length [/ min_cutoff_length] for sparseGrams). This function is primarily intended for debugging and testing purposes, and is used internally to analyze tokenization behavior for LIKE patterns. Syntax
tokensForLikePattern(value[, tokenizer[, tokenizer_specific_arguments...]])
Arguments
  • value — The input string. String or FixedString
  • tokenizer — The tokenizer to use. Valid arguments are splitByNonAlpha, splitByString, asciiCJK, ngrams, sparseGrams, and array. Optional, if not set explicitly, defaults to splitByNonAlpha. const String
  • n — Only relevant if argument tokenizer is ngrams: An optional parameter which defines the length of the ngrams. If not set explicitly, defaults to 3. const UInt8
  • separators — Only relevant if argument tokenizer is split: An optional parameter which defines the separator strings. If not set explicitly, defaults to [' ']. const Array(String)
  • min_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum gram length, defaults to 3. const UInt8
  • max_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the maximum gram length, defaults to 100. const UInt8
  • min_cutoff_length — Only relevant if argument tokenizer is sparseGrams: An optional parameter which defines the minimum cutoff length. const UInt8
Returned value Returns the resulting array of tokens from input string. Array Examples Default tokenizer
Query
SELECT tokensForLikePattern('%test1,test2,test3%') AS tokens;
Response
['test2']