Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Designing your schema
While schema inference can be used to establish an initial schema for JSON data and query JSON data files in place, e.g., in S3, you should aim to establish an optimized versioned schema for your data. We discuss the recommended approach for modeling JSON structures below.Static vs dynamic JSON
The principal task on defining a schema for JSON is to determine the appropriate type for each key’s value. We recommended users apply the following rules recursively on each key in the JSON hierarchy to determine the appropriate type for each key.- Primitive types - If the key’s value is a primitive type, irrespective of whether it is part of a sub-object or on the root, ensure you select its type according to general schema design best practices and type optimization rules. Arrays of primitives, such as
phone_numbersbelow, can be modeled asArray(<type>)e.g.,Array(String). - Static vs dynamic - If the key’s value is a complex object i.e. either an object or an array of objects, establish whether it is subject to change. Objects that rarely have new keys, where the addition of a new key can be predicted and handled with a schema change via
ALTER TABLE ADD COLUMN, can be considered static. This includes objects where only a subset of the keys may be provided on some JSON documents. Objects where new keys are added frequently and/or aren’t predictable should be considered dynamic. The exception here is structures with hundreds or thousands of sub keys which can be considered dynamic for convenience purposes.
- The root keys
name,username,email,websitecan be represented as typeString. The columnphone_numbersis an Array primitive of typeArray(String), withdobandidtypeDateandUInt32respectively. - New keys won’t be added to the
addressobject (only new address objects), and it can thus be considered static. If we recurse, all of the sub-columns can be considered primitives (and typeString) exceptgeo. This is also a static structure with twoFloat32columns,latandlon. - The
tagscolumn is dynamic. We assume new arbitrary tags can be added to this object of any type and structure. - The
companyobject is static and will always contain at most the 3 keys specified. The subkeysnameandcatchPhraseare of typeString. The keylabelsis dynamic. We assume new arbitrary tags can be added to this object. Values will always be key-value pairs of type string.
Structures with hundreds or thousands of static keys can be considered dynamic, as it is rarely realistic to statically declare the columns for these. However, where possible skip paths which aren’t needed to save both storage and inference overhead.
Handling static structures
We recommend static structures are handled using named tuples i.e.Tuple. Arrays of objects can be held using arrays of tuples i.e. Array(Tuple). Within tuples themselves, columns and their respective types should be defined using the same rules. This can result in nested Tuples to represent nested objects as shown below.
To illustrate this, we use the earlier JSON person example, omitting the dynamic objects:
company column is defined as a Tuple(catchPhrase String, name String). The address key uses an Array(Tuple), with a nested Tuple to represent the geo column.
JSON can be inserted into this table in its current structure:
address.street column is returned as an Array. To query a specific object inside an array by position, the array offset should be specified after the column name. For example, to access the street from the first address:
24.12:
Handling default values
Even if JSON objects are structured, they’re often sparse with only a subset of the known keys provided. Fortunately, theTuple type doesn’t require all columns in the JSON payload. If not provided, default values will be used.
Consider our earlier people table and the following sparse JSON, missing the keys suite, geo, phone_numbers, and catchPhrase.
Differentiating empty and nullIf you need to differentiate between a value being empty and not provided, the Nullable type can be used. This should be avoided unless absolutely required, as it will negatively impact storage and query performance on these columns.
Handling new columns
While a structured approach is simplest when the JSON keys are static, this approach can still be used if the changes to the schema can be planned, i.e., new keys are known in advance, and the schema can be modified accordingly. Note that ClickHouse will, by default, ignore JSON keys that are provided in the payload and aren’t present in the schema. Consider the following modified JSON payload with the addition of anickname key:
nickname key ignored:
ALTER TABLE ADD COLUMN command. A default can be specified via the DEFAULT clause, which will be used if it isn’t specified during the subsequent inserts. Rows for which this value isn’t present (as they were inserted prior to its creation) will also return this default value. If no DEFAULT value is specified, the default value for the type will be used.
For example:
Handling semi-structured/dynamic structures
If JSON data is semi-structured where keys can be dynamically added and/or have multiple types, theJSON type is recommended.
More specifically, use the JSON type when your data:
- Has unpredictable keys that can change over time.
- Contains values with varying types (e.g., a path might sometimes contain a string, sometimes a number).
- Requires schema flexibility where strict typing isn’t viable.
- You have hundreds or even thousands of paths which are static but simply not realistic to declare explicitly. This tends to be a rare.
company.labels object was determined to be dynamic.
Let’s suppose that company.labels contains arbitrary keys. Additionally, the type for any key in this structure may not be consistent between rows. For example:
company.labels column between objects, with respect to keys and types, we have several options to model this data:
- Single JSON column - represents the entire schema as a single
JSONcolumn, allowing all structures to be dynamic beneath this. - Targeted JSON column - only use the
JSONtype for thecompany.labelscolumn, retaining the structured schema used above for all other columns.
- Data validation – enforcing a strict schema avoids the risk of column explosion, outside of specific structures.
- Avoids risk of column explosion - Although the JSON type scales to potentially thousands of columns, where subcolumns are stored as dedicated columns, this can lead to a column file explosion where an excessive number of column files are created that impacts performance. To mitigate this, the underlying Dynamic type used by JSON offers a
max_dynamic_pathsparameter, which limits the number of unique paths stored as separate column files. Once the threshold is reached, additional paths are stored in a shared column file using a compact encoded format, maintaining performance and storage efficiency while supporting flexible data ingestion. Accessing this shared column file is, however, not as performant. Note, however, that the JSON column can be used with type hints. “Hinted” columns will deliver the same performance as dedicated columns. - Simpler introspection of paths and types - Although the JSON type supports introspection functions to determine the types and paths that have been inferred, static structures can be simpler to explore e.g. with
DESCRIBE.
Single JSON column
This approach is useful for prototyping and data engineering tasks. For production, try useJSON only for dynamic sub structures where necessary.
Performance considerationsA single JSON column can be optimized by skipping (not storing) JSON paths that aren’t required and by using type hints. Type hints allow the user to explicitly define the type for a sub-column, thereby skipping inference and indirection processing at query time. This can be used to deliver the same performance as if an explicit schema was used. See “Using type hints and skipping paths” for further details.
We provide a type hint for the
username column in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won’t be null and ensures it knows which username sub-column to use (there may be multiple for each type, so this is ambiguous otherwise).JSONAsObject format:
. notation e.g.
NULL.
Additionally, a separate sub column is created for paths with the same type. For example, a subcolumn exists for company.labels.type of both String and Array(Nullable(String)). While both will be returned where possible, we can target specific sub-columns using .: syntax:
^ is required. This is a design choice to avoid reading a high number of columns - unless explicitly requested. Objects accessed without ^ will return NULL as shown below:
Targeted JSON column
While useful in prototyping and data engineering challenges, we recommend using an explicit schema in production where possible. Our previous example can be modeled with a singleJSON column for the company.labels column.
JSONEachRow format:
company.labels column.
Using type hints and skipping paths
Type hints allow us to specify the type for a path and its sub-column, preventing unnecessary type inference. Consider the following example where we specify the types for the JSON keysdissolved, employees, and founded within the JSON column company.labels
SKIP and SKIP REGEXP parameters in order to minimize storage and avoid unnecessary inference on unneeded paths. For example, suppose we use a single JSON column for the above data. We can skip the address and company paths:
Optimizing performance with type hints
Type hints offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely, as well as allowing optimal primitive types to be specified. JSON paths with type hints are always stored just like traditional columns, bypassing the need for discriminator columns or dynamic resolution during query time. This means that with well-defined type hints, nested JSON keys achieve the same performance and efficiency as if they were modeled as top-level columns from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.Configuring dynamic paths
ClickHouse stores each JSON path as a subcolumn in a true columnar layout, enabling the same performance benefits seen with traditional columns—such as compression, SIMD-accelerated processing, and minimal disk I/O. Each unique path and type combination in your JSON data can become its own column file on disk. For example, when two JSON paths are inserted with differing types, ClickHouse stores the values of each concrete type in distinct sub-columns. These sub-columns can be accessed independently, minimizing unnecessary I/O. Note that when querying a column with multiple types, its values are still returned as a single columnar response. Additionally, by leveraging offsets, ClickHouse ensures that these sub-columns remain dense, with no default values stored for absent JSON paths. This approach maximizes compression and further reduces I/O. However, in scenarios with high-cardinality or highly variable JSON structures—such as telemetry pipelines, logs, or machine-learning feature stores - this behavior can lead to an explosion of column files. Each new unique JSON path results in a new column file, and each type variant under that path results in an additional column file. While this is optimal for read performance, it introduces operational challenges: file descriptor exhaustion, increased memory usage, and slower merges due to a high number of small files. To mitigate this, ClickHouse introduces the concept of an overflow subcolumn: once the number of distinct JSON paths exceeds a threshold, additional paths are stored in a single shared file using a compact encoded format. This file is still queryable but doesn’t benefit from the same performance characteristics as dedicated subcolumns. This threshold is controlled by themax_dynamic_paths parameter in the JSON type declaration.
SKIP parameters to restrict what’s stored.
For users curious about the implementation of this new column type, we recommend reading our detailed blog post “A New Powerful JSON Data Type for ClickHouse”.