Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Exact and approximate vector search
The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as nearest neighbor search or, in short: vector search. Two general approaches exist for solving vector search:- Exact vector search calculates the distance between the given point and all points in the vector space. This ensures the best possible accuracy, i.e. the returned points are guaranteed to be the actual nearest neighbors. Since the vector space is explored exhaustively, exact vector search can be too slow for real-world use.
- Approximate vector search refers to a group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically “good enough” for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
vectors of array type, e.g. Array(Float64), Array(Float32), or Array(BFloat16).
The reference vector is a constant array and given as a common table expression.
<DistanceFunction> computes the distance between the reference point and all stored points.
Any of the available distance function can be used for that.
<N> specifies how many neighbors should be returned.
Exact vector search
An exact vector search can be performed using above SELECT query as is. The runtime of such queries is generally proportional to the number of stored vectors and their dimension, i.e. the number of array elements. Also, since ClickHouse performs a brute-force scan of all vectors, the runtime depends also on the number of threads by the query (see setting max_threads).Example
Approximate vector search
Vector Similarity Indexes
ClickHouse provides a special “vector similarity” index to perform approximate vector search.Vector similarity indexes are available in ClickHouse version 25.8 and higher.
If you run into problems, kindly open an issue in the ClickHouse repository.
Creating a Vector Similarity Index
A vector similarity index can be created on a new table like this:ALTER TABLE statement only causes the index to be built for future new data inserted into the table.
To build the index for existing data as well, you need to materialize it:
<distance_function> must be
L2Distance, the Euclidean distance, representing the length of a line between two points in Euclidean space,cosineDistance, the cosine distance, representing the angle between two non-zero vectors, ordotProduct, the dot product (inner product), representing the sum of element-wise products of two vectors. Equivalent tocosineDistanceon normalized data.
L2Distance is usually the best choice, otherwise cosineDistance is recommended to compensate for scale.
For distance functions
L2Distance and cosineDistance, a smaller value means a higher similarity, whereas for dotProduct, a higher value means a higher similarity.
As a result, vector indexes with L2Distance and cosineDistance can only be used by SELECT [...] ORDER BY [...] ASC queries (ASC is the default for ORDER BY), whereas vector indexes built for dotProduct can only be used by SELECT [...] ORDER BY [...] DESC queries.<dimensions> specifies the array cardinality (number of elements) in the underlying column.
If ClickHouse finds an array with a different cardinality during index creation, the index is discarded and an error is returned.
The optional GRANULARITY parameter <N> refers to the size of the index granules (see here).
Unlike regular skip indexes, which use a default index granularity of 1, vector similarity indexes use 100 million as default index granularity.
This value makes sure that only few indexes are build internally even for large parts.
We recommend changing the index granularity only for advanced users who understand the implications of what they are doing (see below).
Vector similarity indexes are generic in the sense that they can accommodate different approximate search method.
The actually used method is specified by parameter <type>.
As of now, the only available method is HNSW (academic paper), a popular and state-of-the-art technique for approximate vector search based on hierarchical proximity graphs.
If HNSW is used as type, users may optionally specify further HNSW-specific parameters:
<quantization>controls the quantization of the vectors in the proximity graph. Possible values aref64,f32,f16,bf16,i8, orb1. The default value isbf16. Note that this parameter does not affect the representation of the vectors in the underlying column.<hnsw_max_connections_per_layer>controls the number of neighbors per graph node, also known as HNSW hyperparameterM. The default value is32. Value0means using the default value.<hnsw_candidate_list_size_for_construction>controls the size of the dynamic candidate list during construction of the HNSW graph, also known as HNSW hyperparameteref_construction. The default value is128. Value0means using the default value.
- Vector similarity indexes can only be build on columns of type Array(Float32), Array(Float64), or Array(BFloat16). Arrays of nullable and low-cardinality floats such as
Array(Nullable(Float32))andArray(LowCardinality(Float32))are not allowed. - Vector similarity indexes must be build on single columns.
- Vector similarity indexes may be build on calculated expressions (e.g.,
INDEX index_name arraySort(vectors) TYPE vector_similarity([...])) but such indexes cannot be used for approximate neighbor search later on. - Vector similarity indexes require that all arrays in the underlying column have
<dimension>-many elements - this is checked during index creation. To detect violations of this requirement as early as possible, users can add a constraint for the vector column, e.g.,CONSTRAINT same_length CHECK length(vectors) = 256. - Likewise, array values in the underlying column must not be empty (
[]) or have a default value (also[]).
Using a Vector Similarity Index
To use vector similarity indexes, setting compatibility has be
'' (the default value), or '25.1' or newer.SELECT [...] SETTINGS hnsw_candidate_list_size_for_search = <value>).
The default value of the setting 256 works well in the majority of use cases.
Higher setting values mean better accuracy at the cost of slower performance.
If the query can use a vector similarity index, ClickHouse checks that the LIMIT <N> provided in SELECT queries is within reasonable bounds.
More specifically, an error is returned if <N> is bigger than the value of setting max_limit_for_vector_search_queries with default value 100.
Too large LIMIT values can slow down searches and usually indicate a usage error.
To check if a SELECT query uses a vector similarity index, you can prefix the query with EXPLAIN indexes = 1.
As an example, query
Skip and the name and type of the vector index (in the example, idx and vector_similarity).
In this case, the vector similarity index dropped two of four granules, i.e. 50% of the data.
The more granules can be dropped, the more effective index usage becomes.
Post-filtering and Pre-filtering
Users may optionally specify a WHERE clause with additional filter conditions for the SELECT query.
ClickHouse will evaluate these filter conditions using post-filtering or pre-filtering strategy.
In short, both strategies determine the order in which the filters are evaluated:
- Post-filtering means that the vector similarity index is evaluated first, afterwards ClickHouse evaluates the additional filter(s) specified in the
WHEREclause. - Pre-filtering means that the filter evaluation order is the other way round.
- Post-filtering has the general problem that it may return less than the number of rows requested in the
LIMIT <N>clause. This situation happens when one or more result rows returned by the vector similarity index fails to satisfy the additional filters. - Pre-filtering is generally an unsolved problem. Certain specialized vector databases provide pre-filtering algorithms but most relational databases (including ClickHouse) will fall back to exact neighbor search, i.e., a brute-force scan without index.
year and the following query is run:
- the filter condition eliminates at least one row within a part, the ClickHouse will fall back to pre-filtering for the “surviving” ranges within the part,
- the filter condition eliminates no rows within a part, the ClickHouse will perform post-filtering for the part.
auto which implements above heuristics) may be set to prefilter.
This is useful to force pre-filtering in cases where the additional filter conditions are extremely selective.
As an example, the following query may benefit from pre-filtering:
SETTINGS vector_search_filter_strategy = 'prefilter' to the query), ClickHouse first finds all books with a price of less than 2 dollar and then executes a brute-force vector search for the found books.
As an alternative approach to resolve above issue, setting vector_search_index_fetch_multiplier (default: 1.0, maximum: 1000.0) may be configured to a value > 1.0 (for example, 2.0).
The number of nearest neighbors fetched from the vector index is multiplied by the setting value and then the additional filter to be applied on those rows to return LIMIT-many rows.
As an example, we can query again but with multiplier 3.0:
vector_search_index_fetch_multiplier can mitigate the problem but in extreme cases (very selective WHERE condition), it is still possible that less than N requested rows returned.
Rescoring
Skip indexes in ClickHouse generally filter at the granule level, i.e. a lookup in a skip index (internally) returns a list of potentially matching granules which reduces the number of read data in the subsequent scan.
This works well for skip indexes in general but in the case of vector similarity indexes, it creates a “granularity mismatch”.
In more detail, the vector similarity index determines the row numbers of the N most similar vectors for a given reference vector, but it then needs to extrapolate these row numbers to granule numbers.
ClickHouse will then load these granules from disk, and repeat the distance calculation for all vectors in these granules.
This step is called rescoring and while it can theoretically improve accuracy - remember the vector similarity index returns only an approximate result, it is obvious not optimal in terms of performance.
ClickHouse therefore provides an optimization which disables rescoring and returns the most similar vectors and their distances directly from the index.
The optimization is enabled by default, see setting vector_search_with_rescoring.
The way it works at a high level is that ClickHouse makes the most similar vectors and their distances available as a virtual column _distances.
To see this, run a vector search query with EXPLAIN header = 1:
A query run without rescoring (
vector_search_with_rescoring = 0) and with parallel replicas enabled may fall back to rescoring.Performance tuning
Tuning compression In virtually all use cases, the vectors in the underlying column are dense and do not compress well. As a result, compression slows down inserts and reads into/from the vector column. We therefore recommend to disable compression. To do that, specifyCODEC(NONE) for the vector column like this:
system.text_log) indicate that the vector similarity index is being loaded.
If such messages appear repeatedly for different vector search queries, this indicates that the cache size is too low.
The vector similarity index cache stores vector index granules.
If individual vector index granules are bigger than the cache size, they will not be cached.
Therefore, please make sure to calculate the vector index size (based on the formula in “Estimating storage and memory consumption” or system.data_skipping_indices) and size the cache correspondingly.
| Quantization | Name | Storage per dimension |
|---|---|---|
| f32 | Single precision | 4 bytes |
| f16 | Half precision | 2 bytes |
| bf16 (default) | Half precision (brain float) | 2 bytes |
| i8 | Quarter precision | 1 byte |
| b1 | Binary | 1 bit |
f32).
However, on most datasets, half-precision brain float quantization (bf16) results in a negligible precision loss, therefore vector similarity indexes use this quantization technique by default.
Quarter precision (i8) and binary (b1) quantization causes appreciable precision loss in vector searches.
We recommend both quantizations only if the the size of the vector similarity index is significantly larger than the available DRAM size.
In this case, we also suggest enabling rescoring (vector_search_index_fetch_multiplier, vector_search_with_rescoring) to improve accuracy.
Binary quantization is only recommended for 1) normalized embeddings (i.e. vector length = 1, OpenAI models are usually normalized), and 2) if the cosine distance is used as distance function.
Binary quantization internally uses the Hamming distance to construct and search the proximity graph.
The rescoring step uses the original full-precision vectors stored in the table to identify the nearest neighbours via cosine distance.
Tuning data transfer
The reference vector in a vector search query is provided by the user and generally retrieved by making a call to a Large Language Model (LLM).
Typical Python code which runs a vector search in ClickHouse might look like this
search_v in above snippet) could have a very large dimension.
For example, OpenAI provides models that generate embeddings vectors with 1536 or even 3072 dimensions.
In above code, the ClickHouse Python driver substitutes the embedding vector by a human readable string and subsequently send the SELECT query entirely as a string.
Assuming the embedding vector consists of 1536 single-precision floating point values, the sent string reaches a length of 20 kB.
This creates a high CPU usage for tokenizing, parsing and performing thousands of string-to-float conversions.
Also, significant space is required in the ClickHouse server log file, causing bloat in system.query_log as well.
Note that most LLM models return an embedding vector as a list or NumPy array of native floats.
We therefore recommend Python applications to bind the reference vector parameter in binary form by using the following style:
system.query_log.
Administration and monitoring
The on-disk size of vector similarity indexes can be obtained from system.data_skipping_indices:Differences to regular skipping indexes
As all regular skipping indexes, vector similarity indexes are constructed over granules and each indexed block consists ofGRANULARITY = [N]-many granules ([N] = 1 by default for normal skipping indexes).
For example, if the primary index granularity of the table is 8192 (setting index_granularity = 8192) and GRANULARITY = 2, then each indexed block will contain 16384 rows.
However, data structures and algorithms for approximate neighbor search are inherently row-oriented.
They store a compact representation of a set of rows and also return rows for vector search queries.
This causes some rather unintuitive differences in the way vector similarity indexes behave compared to normal skipping indexes.
When a user defines a vector similarity index on a column, ClickHouse internally creates a vector similarity “sub-index” for each index block.
The sub-index is “local” in the sense that it only knows about the rows of its containing index block.
In the previous example and assuming that a column has 65536 rows, we obtain four index blocks (spanning eight granules) and a vector similarity sub-index for each index block.
A sub-index is theoretically able to return the rows with the N closest points within its index block directly.
However, since ClickHouse loads data from disk to memory at the granularity of granules, sub-indexes extrapolate matching rows to granule granularity.
This is different from regular skipping indexes which skip data at the granularity of index blocks.
The GRANULARITY parameter determines how many vector similarity sub-indexes are created.
Bigger GRANULARITY values mean fewer but larger vector similarity sub-indexes, up to the point where a column (or a column’s data part) has only a single sub-index.
In that case, the sub-index has a “global” view of all column rows and can directly return all granules of the column (part) with relevant rows (there are at most LIMIT [N]-many such granules).
In a second step, ClickHouse will load these granules and identify the actually best rows by performing a brute-force distance calculation over all rows of the granules.
With a small GRANULARITY value, each of the sub-indexes returns up to LIMIT N-many granules.
As a result, more granules need to be loaded and post-filtered.
Note that the search accuracy is with both cases equally good, only the processing performance differs.
It is generally recommended to use a large GRANULARITY for vector similarity indexes and fall back to a smaller GRANULARITY values only in case of problems like excessive memory consumption of the vector similarity structures.
If no GRANULARITY was specified for vector similarity indexes, the default value is 100 million.
Example
Queries:Quantized Bit (QBit)
One common approach to speed up exact vector search is to use a lower-precision float data type. For example, if vectors are stored asArray(BFloat16) instead of Array(Float32), the data size is reduced by half, and query runtimes are expected to decrease proportionally.
This method is known as quantization. While it speeds up computation, it may reduce result accuracy despite performing an exhaustive scan of all vectors.
With traditional quantization, we lose precision both during search and when storing the data. In the example above, we would store BFloat16 instead of Float32, meaning we can never perform a more accurate search later, even if desired. One alternative approach is to store two copies of the data: quantized and full-precision. While this works, it requires redundant storage. Consider a scenario where we have Float64 as original data and want to run searches with different precision (16-bit, 32-bit, or full 64-bit). We would need to store three separate copies of the data.
ClickHouse offers the Quantized Bit (QBit) data type that addresses these limitations by:
- Storing the original full-precision data.
- Allowing quantization precision to be specified at query time.
QBit type, use the following syntax:
element_type– the type of each vector element. Supported types areBFloat16,Float32, andFloat64dimension– the number of elements in each vector
Creating a QBit Table and Adding Data
Vector Search with QBit
Let’s find the nearest neighbors to a vector representing word ‘lemon’ using L2 distance. The third parameter in the distance function specifies the precision in bits - higher values provide more accuracy but require more computation.
You can find all available distance functions for QBit here.
Full precision search (64-bit):
Performance Considerations
The performance benefit ofQBit comes from reduced I/O operations, as less data needs to be read from storage when using lower precision. Moreover, when the QBit contains Float32 data, if the precision parameter is 16 or below, there will be additional benefits from reduced computation. The precision parameter directly controls the trade-off between accuracy and speed:
- Higher precision (closer to the original data width): More accurate results, slower queries
- Lower precision: Faster queries with approximate results, reduced memory usage