Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Machine learning functions
evalMLMethod
Prediction using fitted regression models usesevalMLMethod function. See link in linearRegression.
stochasticLinearRegression
The stochasticLinearRegression aggregate function implements stochastic gradient descent method using linear model and MSE loss function. UsesevalMLMethod to predict on new data.
stochasticLogisticRegression
The stochasticLogisticRegression aggregate function implements stochastic gradient descent method for binary classification problem. UsesevalMLMethod to predict on new data.
naiveBayesClassifier
Classifies input text using a Naive Bayes model with n-grams and Laplace smoothing. The model must be configured in ClickHouse before use. Syntaxmodel_name— Name of the pre-configured model. String The model must be defined in ClickHouse’s configuration files (see below).input_text— Text to classify. String Input is processed exactly as provided (case/punctuation preserved).
- Predicted class ID as an unsigned integer. UInt32 Class IDs correspond to categories defined during model construction.
0 might represent English, while 1 could indicate French - class meanings depend on your training data.
Implementation Details
Algorithm Uses Naive Bayes classification algorithm with Laplace smoothing to handle unseen n-grams based on n-gram probabilities based on this. Key Features- Supports n-grams of any size
- Three tokenization modes:
byte: Operates on raw bytes. Each byte is one token.codepoint: Operates on Unicode scalar values decoded from UTF‑8. Each codepoint is one token.token: Splits on runs of Unicode whitespace (regex \s+). Tokens are substrings of non‑whitespace; punctuation is part of the token if adjacent (e.g., “you?” is one token).
Model Configuration
You can find sample source code for creating a Naive Bayes model for language detection here. Additionally, sample models and their associated config files are available here. Here is an example configuration for a naive Bayes model in ClickHouse:| Parameter | Description | Example | Default |
|---|---|---|---|
| name | Unique model identifier | language_detection | Required |
| path | Full path to model binary | /etc/clickhouse-server/config.d/language_detection.bin | Required |
| mode | Tokenization method: - byte: Byte sequences- codepoint: Unicode characters- token: Word tokens | token | Required |
| n | N-gram size (token mode):- 1=single word- 2=word pairs- 3=word triplets | 2 | Required |
| alpha | Laplace smoothing factor used during classification to address n-grams that do not appear in the model | 0.5 | 1.0 |
| priors | Class probabilities (% of the documents belonging to a class) | 60% class 0, 40% class 1 | Equal distribution |
n=1 and token mode, the model might look like this:
n=3 and codepoint mode, it might look like:
- 4-byte
class_id(UInt, little-endian) - 4-byte
n-grambytes length (UInt, little-endian) - Raw
n-grambytes - 4-byte
count(UInt, little-endian)
mode and n. The following steps outline the preprocessing:
-
Add boundary markers at the start and end of each document based on tokenization mode:
- Byte:
0x01(start),0xFF(end) - Codepoint:
U+10FFFE(start),U+10FFFF(end) - Token:
<s>(start),</s>(end)
(n - 1)tokens are added at both the beginning and the end of the document. - Byte:
-
Example for
n=3intokenmode:- Document:
"ClickHouse is fast" - Processed as:
<s> <s> ClickHouse is fast </s> </s> - Generated trigrams:
<s> <s> ClickHouse<s> ClickHouse isClickHouse is fastis fast </s>fast </s> </s>
- Document:
byte and codepoint modes, it may be convenient to first tokenize the document into tokens (a list of bytes for byte mode and a list of codepoints for codepoint mode). Then, append n - 1 start tokens at the beginning and n - 1 end tokens at the end of the document. Finally, generate the n-grams and write them to the serialized file.