Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
If you’re using ClickHouse Cloud on Google Cloud, this page doesn’t apply as your services will already be using Google Cloud Storage. If you’re looking to
SELECT or INSERT data from GCS, please see the gcs table function.GCS backed MergeTree
Creating a disk
To utilize a GCS bucket as a disk, we must first declare it within the ClickHouse configuration in a file underconf.d. An example of a GCS disk declaration is shown below. This configuration includes multiple sections to configure the GCS “disk”, the cache, and the policy that is specified in DDL queries when tables are to be created on the GCS disk. Each of these are described below.
Storage configuration > disks > gcs
This part of the configuration is shown in the highlighted section and specifies that:- Batch deletes aren’t to be performed. GCS doesn’t currently support batch deletes, so the autodetect is disabled to suppress error messages.
- The type of the disk is
s3because the S3 API is in use. - The endpoint as provided by GCS
- The service account HMAC key and secret
- The metadata path on the local disk
Storage configuration > disks > cache
The example configuration highlighted below enables a 10Gi memory cache for the diskgcs.
Storage configuration > policies > gcs_main
Storage configuration policies allow choosing where data is stored. The policy highlighted below allows data to be stored on the diskgcs by specifying the policy gcs_main. For example, CREATE TABLE ... SETTINGS storage_policy='gcs_main'.
Creating a table
Assuming you have configured your disk to use a bucket with write access, you should be able to create a table such as in the example below. For purposes of brevity, we use a subset of the NYC taxi columns and stream data directly to the GCS-backed table:Handling replication
Replication with GCS disks can be accomplished by using theReplicatedMergeTree table engine. See the replicating a single shard across two GCP regions using GCS guide for details.
Learn more
The Cloud Storage XML API is interoperable with some tools and libraries that work with services such as Amazon Simple Storage Service (Amazon S3). For further information on tuning threads, see Optimizing for Performance.Using Google Cloud Storage (GCS)
Plan the deployment
This tutorial is written to describe a replicated ClickHouse deployment running in Google Cloud and using Google Cloud Storage (GCS) as the ClickHouse storage disk “type”. In the tutorial, you will deploy ClickHouse server nodes in Google Cloud Engine VMs, each with an associated GCS bucket for storage. Replication is coordinated by a set of ClickHouse Keeper nodes, also deployed as VMs. Sample requirements for high availability:- Two ClickHouse server nodes, in two GCP regions
- Two GCS buckets, deployed in the same regions as the two ClickHouse server nodes
- Three ClickHouse Keeper nodes, two of them are deployed in the same regions as the ClickHouse server nodes. The third can be in the same region as one of the first two Keeper nodes, but in a different availability zone.
Prepare virtual machines
Deploy five VMS in three regions:| Region | ClickHouse Server | Bucket | ClickHouse Keeper |
|---|---|---|---|
| 1 | chnode1 | bucket_regionname | keepernode1 |
| 2 | chnode2 | bucket_regionname | keepernode2 |
3 * | keepernode3 |
* This can be a different availability zone in the same region as 1 or 2.
Deploy ClickHouse
Deploy ClickHouse on two hosts, in the sample configurations these are namedchnode1, chnode2.
Place chnode1 in one GCP region, and chnode2 in a second. In this guide us-east1 and us-east4 are used for the compute engine VMs, and also for GCS buckets.
Don’t start
clickhouse server until after it is configured. Just install it.Deploy ClickHouse Keeper
Deploy ClickHouse Keeper on three hosts, in the sample configurations these are namedkeepernode1, keepernode2, and keepernode3. keepernode1 can be deployed in the same region as chnode1, keepernode2 with chnode2, and keepernode3 in either region, but in a different availability zone from the ClickHouse node in that region.
Refer to the installation instructions when performing the deployment steps on the ClickHouse Keeper nodes.
Create two buckets
The two ClickHouse servers will be located in different regions for high availability. Each will have a GCS bucket in the same region. In Cloud Storage > Buckets choose CREATE BUCKET. For this tutorial two buckets are created, one in each ofus-east1 and us-east4. The buckets are single region, standard storage class, and not public. When prompted, enable public access prevention. Don’t create folders, they will be created when ClickHouse writes to the storage.
If you need step-by-step instructions to create buckets and an HMAC key, then expand Create GCS buckets and an HMAC key and follow along:
Create GCS buckets and an HMAC key
Create GCS buckets and an HMAC key
ch_bucket_us_east1
ch_bucket_us_east4
Generate an access key
Create a service account HMAC key and secret
Open Cloud Storage > Settings > Interoperability and either choose an existing Access key, or CREATE A KEY FOR A SERVICE ACCOUNT. This guide covers the path for creating a new key for a new service account.Add a new service account
If this is a project with no existing service account, CREATE NEW ACCOUNT.There are three steps to creating the service account, in the first step give the account a meaningful name, ID, and description.In the Interoperability settings dialog the IAM role Storage Object Admin role is recommended; select that role in step two.Step three is optional and not used in this guide. You may allow users to have these privileges based on your policies.The service account HMAC key will be displayed. Save this information, as it will be used in the ClickHouse configuration.Configure ClickHouse Keeper
All of the ClickHouse Keeper nodes have the same configuration file except for theserver_id line (first highlighted line below). Modify the file with the hostnames for your ClickHouse Keeper servers, and on each of the servers set the server_id to match the appropriate server entry in the raft_configuration. Since this example has server_id set to 3, we have highlighted the matching lines in the raft_configuration.
- Edit the file with your hostnames, and make sure that they resolve from the ClickHouse server nodes and the Keeper nodes
- Copy the file into place (
/etc/clickhouse-keeper/keeper_config.xmlon each of the Keeper servers - Edit the
server_idon each machine, based on its entry number in theraft_configuration
/etc/clickhouse-keeper/keeper_config.xml
Configure ClickHouse server
best practiceSome of the steps in this guide will ask you to place a configuration file in
/etc/clickhouse-server/config.d/. This is the default location on Linux systems for configuration override files. When you put these files into that directory ClickHouse will merge the content with the default configuration. By placing these files in the config.d directory you will avoid losing your configuration during an upgrade.Networking
By default, ClickHouse listens on the loopback interface, in a replicated setup networking between machines is necessary. Listen on all interfaces:/etc/clickhouse-server/config.d/network.xml
Remote ClickHouse Keeper servers
Replication is coordinated by ClickHouse Keeper. This configuration file identifies the ClickHouse Keeper nodes by hostname and port number.- Edit the hostnames to match your Keeper hosts
/etc/clickhouse-server/config.d/use-keeper.xml
Remote ClickHouse servers
This file configures the hostname and port of each ClickHouse server in the cluster. The default configuration file contains sample cluster definitions, in order to show only the clusters that are completely configured the tagreplace="true" is added to the remote_servers entry so that when this configuration is merged with the default it replaces the remote_servers section instead of adding to it.
- Edit the file with your hostnames, and make sure that they resolve from the ClickHouse server nodes
/etc/clickhouse-server/config.d/remote-servers.xml
Replica identification
This file configures settings related to the ClickHouse Keeper path. Specifically the macros used to identify which replica the data is part of. On one server the replica should be specified asreplica_1, and on the other server replica_2. The names can be changed, based on our example of one replica being stored in South Carolina and the other in Northern Virginia the values could be carolina and virginia; just make sure that they’re different on each machine.
/etc/clickhouse-server/config.d/macros.xml
Storage in GCS
ClickHouse storage configuration includesdisks and policies. The disk being configured below is named gcs, and is of type s3. The type is s3 because ClickHouse accesses the GCS bucket as if it was an AWS S3 bucket. Two copies of this configuration will be needed, one for each of the ClickHouse server nodes.
These substitutions should be made in the configuration below.
These substitutions differ between the two ClickHouse server nodes:
REPLICA 1 BUCKETshould be set to the name of the bucket in the same region as the serverREPLICA 1 FOLDERshould be changed toreplica_1on one of the servers, andreplica_2on the other
- The
access_key_idshould be set to the HMAC Key generated earlier - The
secret_access_keyshould be set to HMAC Secret generated earlier
/etc/clickhouse-server/config.d/storage.xml
Start ClickHouse Keeper
Use the commands for your operating system, for example:Check ClickHouse Keeper status
Send commands to the ClickHouse Keeper withnetcat. For example, mntr returns the state of the ClickHouse Keeper cluster. If you run the command on each of the Keeper nodes you will see that one is a leader, and the other two are followers:
Start ClickHouse server
Onchnode1 and chnode run:
Verification
Verify disk configuration
system.disks should contain records for each disk:
- default
- gcs
- cache
Verify that tables created on the cluster are created on both nodes
Verify that data can be inserted
Verify that the storage policy gcs_main is used for the table.
Verify in Google Cloud console
Looking at the buckets you will see that a folder was created in each bucket with the name that was used in thestorage.xml configuration file. Expand the folders and you will see many files, representing the data partitions.