Integrating Apache Spark with ClickHouse

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. There are two main ways to connect Apache Spark and ClickHouse:

Spark Connector - The Spark connector implements the DataSourceV2 and has its own Catalog management. As of today, this is the recommended way to integrate ClickHouse and Spark.
Spark JDBC - Integrate Spark and ClickHouse using a JDBC data source.

Both solutions have been successfully tested and are fully compatible with various APIs, including Java, Scala, PySpark, and Spark SQL.

Spark Runtime Environments

Standard Spark runtimes

The Spark Connector works out of the box on environments that closely follow the upstream Apache Spark runtime, such as Amazon EMR or a Kubernetes-based Spark deployments.

Managed Spark platforms

Platforms such as AWS Glue and Databricks introduce additional abstractions and environment-specific behavior. While the core integration remains the same, they may require dedicated configuration and setup steps. See the respective documentation pages for details.

Data sources

Data visualization

Data ingestion

Tools

Spark Runtime Environments

Standard Spark runtimes

Managed Spark platforms

Data sources

Data visualization

Data ingestion

Tools

Documentation Index

​Spark Runtime Environments

​Standard Spark runtimes

​Managed Spark platforms

Spark Runtime Environments

Standard Spark runtimes

Managed Spark platforms