Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Enabling Feature Flags
To enable feature flags on the cluster, you can edit theclickhousecluster resource directly and update/add the values under spec.featureFlags. Note that changing some feature flags may initiate a rolling restart of the cluster.
Backups on ClickHouse Private
Prerequisites
- An S3 bucket created in the same region as your cluster
- The IAM role tied to the Kubernetes service account must have read, write, list access to this bucket
- Two feature flags must be enabled (set to
true) on your cluster:userInitiatedBackupsEnabledandenableUseEnvironmentalCredentialsByDefault
Performing a Backup
From one of the server pods, you can run the following command to issue a backup job in the background:$REGIONis the S3 region, egus-west-2$S3_BUCKETis the name of the S3 bucket created to hold the backups$CLUSTER_S3_PREFIXis the S3 key prefix of the cluster (should be distinct per cluster) when the cluster was created. You can retrieve this from theclickhouseclusterresource of your cluster underspec.s3.keyPrefix. It should be something likech-s3-{uuid}$BACKUP_IDis a unique identifier for the backup being taken. You can use any UUID for this value.
Backup Status
Once the backup command above is issued, it will run in the background asynchronously. You can check the status by querying thesystem.backups table for the id you provided. It’s important to verify that the status field is not in some error state. If it is, the backup has failed and will need to be reissued once the underlying error is resolved. Querying this table will also tell you when the backup has successfully completed.
Incremental Backups
You can also perform incremental backups where a previous backup is used as a starting point as to avoid recopying the same data on each backup. You can read more about it here: https://clickhouse.com/docs/operations/backup#take-an-incremental-backup.Restoring from a Backup
To restore from a backup, run the command below in the cluster that you want to restore into:$RESTORE_ID is any unique identifier you want to give this restore operation.
Operator
Overview
The clickhouse-operator is responsible for the provisioning and reconciliation of the registered clickhousecluster custom resources and cluster lifecycle, including:- deploying and terminating server and keeper components
- controlling cluster state (running/stopping/etc.)
- processing backup requests
- cleaning PVCs
- horizontally scaling cluster
Architecture
Instance Lifecycle

MultiStatefulSet (aka MultiSTS)
MultiStatefulSet is a feature of the clickhouse-operator which enables the clickhouse server pods to run with 1 StatefulSet being the owner of only 1 Clickhouse Server Pod (aka a Replica). This is different from SingleSTS in the sense that within SingleSTS, 1 Statefulset owns all replicas. SingleSTS This is how a Single StatefulSet looks like:

ReplicaStateMap - How multiple statefulsets are tracked
MultiSTS replicas cannot rely on ordinals as a deterministic way to understand a pod’s age / lifecycle. Because we need a way to track the state of each statefulset, we store this inside a map as part of the CR’s status. A sample replicaStateMap can look like following:- The state can be Pending, Ready, Stopped or Condemned.
- ReplicaStateMap also marks 1 pod at any given time as the pod for which IsBackupPod will be true. Predictably, this pod is where backup will run.
- A backup replica is never marked as Condemned
- Once a replica’s name gets added to the map, the operator will ensure it goes from Pending to Ready state.
Parallel vs Rolling Reconciliation
The operator will do Parallel Pod Management if the only change in StatefulSet spec has to do with changing replica count. If the statefulset specs require something other than resizing of replica count, then we no longer rely on parallel pod management. A second upgrade loop kicks in. The upgrade loop will look at the PDB (maxUnavailable) and start reconciling statefulsets one-by-one. It will ensure we never exceed the disruption budget.Horizontal Scaling in MultiSTS mode
Scaling Out in MultiSTS mode is simple. As soon as a new replica name gets added to the ReplicaStateMap, the operator will ensure that StatefulSet gets created and reconciled. Subsequent reconcile loops will ensure that the newly created replica is up-to-date. Scaling In is quite involved and complex. Because a replica which is caling in might still be receiving traffic, we follow multiple steps to ensure the replica scales in. Condemned Replicas We introduce the concept of a Condemned Replica. When the actual replica count has exceeded the desired count, it means we need to scale in. The replicas that get marked for deletion are condemned replicas. We change their state from Ready to Condemned inside the ReplicaStateMap. This ensures that further down the line, we remember which replicas we need to safely delete and remove from the map. Scale-In Now that we understand condemned replicas, here is the flow of Scaling In:- Remove the Topology Key (so this replica is no longer part of our TopologySpreadConstraint’s Skew calculations).
- Make sure all statefulsets are in Ready. If not, we will not scaleIn (and continue to re-queue).
- Execute SYSTEM SYNC REPLICA table LIGHTWEIGHT for each replicated db for each table. Note: This command is executed on the replica marked for backups (since that replica is never condemned).
- Delete the condemned statefulsets, and remove them from the ReplicaStateMap.
- Execute SYSTEM DROP REPLICA $replicaName on Ready replicas to remove their information from Keeper.
Key Metrics
last_cluster_reconcile- Gauge metric of the last time the
app(CR name, egc-default-xx-01) was reconciled - Use this metric to determine if reconciles are occurring regularly
- Example alertmanager alert definition
- Gauge metric of the last time the
controller_runtime_reconcile_errors_total- Counter metric of the total number of reconciliation errors per controller
- Use this metric in conjunction with with
controller_runtime_reconcile_totalto determine the error rate of reconciliation - Example alertmanager alert definition
Common Issues
Changes were made to the CR but they aren’t being applied
If ClickhouseCluster CR is not reconciled for a long time. This probably means either,- The operator is crashlooping. Check operator log.
- Clickhouse pods are crashlooping. Check keeper and server pods to find reason why it’s happening
- clickhouse.com/skip-reconcile annotation in the CR.
Drop a server replica
Use thisif you want to remove a replica from the cluster without scaling in the cluster.Add skip-reconcile
Skip ClickhouseCluster reconcile req ... because it has clickhouse.com/skip-reconcile annotation"
Remove From Replica State Map
We are going to remove the replica from the replica-state-map. You can see it being tracked here;Delete Statefulset
Remove skip-reconcile
Verify
Log in to your ClickHouse cluster, and ensure the replica has been removed. If any leases this replica holds have not expired, the operator will retry removal. It should be cleaned up in 5 minutes.Server pod is hanging from the termination
Why is the CR not in a healthy “Running” state?
Use kubectl to check Instance statusesClickHouse Server
Overview
The clickhouse-server component is the main ClickHouse process that ingests, queries, stores, and processes data.Key Metrics
The Granfana ClickHouse mixin provides access to many ClickHouse metrics in a prebuilt dashboard. Note that there is an existingprometheus.io/* set of annotations on the ClickHouse server pods. These will expose some metrics, but will not give you the ClickHouse_CustomMetrics_* defined below. You should plan on setting up the :8123/metrics Prometheus endpoint as a scrape target on each of the server pods via a PodMonitor or equivalent. This endpoint requires authentication and should be authenticated with a dedicated user with read-only privileges.
ClickHouse_CustomMetric_NumberOfBrokenDetachedParts- Gauge metric indicating the number of broken detached parts.
- Example alertmanager alert definition
ClickHouse_CustomMetric_LostPartCount- Gauge metric indicating the number of lost parts which indicates data loss. False positives are possible.
- Example alertmanager alert definition
ClickHouseErrorMetric_*- Counter metric indicating number of metrics of given error type.
- Example
ClickHouseErrorMetric_CANNOT_WRITE_TO_FILE_DESCRIPTORalertmanager alert definition - Example
ClickHouseErrorMetric_CHECKSUM_DOESNT_MATCHalertmanager alert definition - Example
ClickHouseErrorMetric_CORRUPTED_DATAalertmanager alert definition - Example
ClickHouseErrorMetric_LOGICAL_ERRORalertmanager alert definition - Example
ClickHouseErrorMetric_NOT_ENOUGH_DISK_SPACEalertmanager alert definition - Example
ClickHouseErrorMetric_POTENTIALLY_BROKEN_DATA_PARTalertmanager alert definition - Example
ClickHouseErrorMetric_REPLICA_ALREADY_EXISTSalertmanager alert definition
ClickHouseMetrics_IsServerShuttingDown- Gauge metric indicating if ClickHouse server is shutting down
- Example alertmanager alert definition
ClickHouse_CustomMetric_TableReadOnlyDurationSeconds- Timing gauge indicating how long a table has been in READONLY mode.
- Example alertmanager alert definition
Common Issues
Crashlooping Server Pods
Check the ClickHouse server pod logs. They should explain why the process is crashing. If it’s a result of something like memory pressure and Kubernetes is terminating the pod, check the Kubernetes events for more information.Check CH Metrics with SQL
Replication Queue Size per Table.
Trigger: If this number is bigger than 100 for any table, we have to be alerted.Replication Queue Oldest Entry per Table.
Trigger: If this number is older than 1 day, we have to be alerted.ClickHouse Keeper
Overview
The clickhouse-keeper component is a ZooKeeper-compatible distributed service that manages the distributed coordination between clickhouse-server replicas and is responsible for storing the metadata of the ClickHouse data. A PodMonitor for:8001/metrics should be created for keeper if you wish to capture metrics from keeper pods.
Key Metrics
TODOCommon Issues
High ZNode Count
TODOAlerting
In general, standard alerts should be set for things like crashlooping pods, unschedulable pods, and other infrastructure-related issues that may be particular to your environment. Below are examples of recommended alerting on the various components mentioned above.Operator
Operator not reconciling Alert
clickhouse.com/skip-reconcile annotation your CRs as described here.
Operator reconciliation error Alert
ClickHouse Server
Broken Detached Parts
s3disk or s3diskWithCache), having some small number of broken detached parts may not always indicate an incident because we may create files for parts but not have time to write to them during hard restarts. Hence, the 100 threshold.
Mitigation:
First, wait some time to see if ClickHouseDataLoss has been triggered. If ClickHouseDataLoss has been triggered, proceed with investigating and mitigating it instead and, once fixed, verify that there are no more broken detached parts. Otherwise, reach out to ClickHouse support.
Data Loss
Understanding what data parts are lost
The alert for data loss uses lost_part_count in system.replicas table. To understand how many parts were lost and in which tables you can use the following query:Part * is lost forever.
If you’re investigating a possible data loss that happened a long time ago, you should also look for logs like Dropping table with non-zero lost_part_count equal to .
Finding logs related to lost parts
There are several options where you can check logs:/var/log/clickhouse-server/directory on the pod contains archives with the most recent logs. You can use zgrep to lookup for log messages. If there is a lot of activity on ClickHouse server log files may rotate very fast.- system.text_log table. TTL for this table on the cloud is 30 days. You may use the following SQL query:
Understanding the history of the lost parts
After collecting the list of lost parts the next step is to understand what happened to these parts. Pick any data part from the list. Find all logs related to it:Part * is lost forever. Note: All log messages after part is lost forever are irrelevant (so if you see that the part was finally found on some replica - it’s actually an empty part that was created to replace the lost one).
Check if the part should have been dropped anyway (in that case there is a high chance of a false positive):
- Check if the table has TTL, and check if the lost part should have been dropped anyway due to TTL.
- Check system.query_log if there were TRUNCATE or DROP PARTITION queries that should have dropped the lost parts.
- If the part was detached as broken - try to figure out why it was broken.
- If you see The specified key does not exist, you should search for all logs with the blob name, and find when it was removed and why. Also, check log messages about zero-copy locks.
part is lost forever errors on the instance happened in the same table around the same time, it highly likely has the same reason. If not, pick a part from another group, and repeat (it might be lost for a different reason).
Cannot Write to File Descriptor
Checksum Doesn’t Match
Corrupted Data
Logical Errors
Not Enough Space
Broken Parts Detected on Select
Replica Already Exists
ClickHouse server Stuck Shutting Down
Table Replicas Read Only
Incident Runbooks
Data loss/corruption. ClickHouseBrokenPartDetectedOnSelect
Reason
ClickHouseBrokenPartDetectedOnSelect is triggered when SMT data part read fails with a (probably) non-retriable error.
The alert is triggered when the POTENTIALLY_BROKEN_DATA_PART exception is thrown.
Mitigation
Examine the logs for thePOTENTIALLY_BROKEN_DATA_PART exception. If the alert has been triggered, it must be there. If it is not there for some reason, you may also check system.errors.
It should then be apparent what exactly went wrong from the exception message and the stack trace.
SMT. ClickHouseTableReplicasReadOnly
Reason
ClickHouseTableReplicasReadOnly is triggered if a table has been in read-only for at least one hour.
This now excludes tables in *_broken_replicated_tables and *_broken_tables databases.
Could be a DROP gone badly.
Mitigation
Check if there are still read-only tables anywhere in the cluster:StorageSharedMergeTree::shutdown but for some reason decided to keep the storage object and not destruct it. To confirm / investigate the reason for this, you can search the text logs with logger name as table name.
Sometimes the problem is trivial and can be fixed with a simple replica restart.
First, try running SYSTEM RESTART REPLICA for the affected tables. You can get the table names from the query mentioned above.
SMT. ClickHouseReplicaAlreadyExists
Reason
ClickHouseReplicaAlreadyExists is triggered whenever an exception with the REPLICA_ALREADY_EXISTS error occurs on the instance.
Such exceptions occur when we fail to create a replica of a replicated table (SMT or RMT) because an existing replica is already associated with the path.
Mitigation
This is unlikely to be caused by a user error. In the past, the user could try to create two tables using the same ZooKeeper path. But now we prohibit such behaviour by using thedatabase_replicated_allow_explicit_uuid
This is likely to be a bug in the Replicated database or Shared Catalog.
Misc. ClickHouseCannotWriteToFileDescriptor
Reason
ClickHouseCannotWriteToFileDescriptor is triggered when an exception with the CANNOT_WRITE_TO_FILE_DESCRIPTOR error occurs on the instance.
The exception is thrown when there is not enough space on the cache disk for a new cache entry or for external data processing (e.g., external aggregation, external joins).
There is a misconfiguration issue and a disk w with less space than was created with less space than was requested in the CR config.
Mitigation
Important: There is a known issue with tracking the cache disk usage if thejoin_algorithm = 'partial_merge' query setting is specified. So check on this first.
To confirm if the issue is in misconfiguration or in disk usage tracking, do the following:
- Run
kubectl exec -n <namespace> -it <pod> -- /bin/bashto connect to the pod. - Run
df -hto see the cache disk size. - Run
select path, max_size from system.filesystem_cache_settingsto see the required cache disk size. Note that we normally have different caches (e.g.,s3diskWithCache,diskPlainRewritableForSystemTablesWithCache) sharing the same path (i.e.,/mnt/clickhouse-cache/sharedS3DiskCache). - If the actual disk size is smaller, the issue is in misconfiguration. Reach out to the Data Plane Operator team.
- Otherwise, it’s likely a bug in tracking the cache disk usage. To investigate this, you can try searching through
system.filesystem_cache.