Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The machine learning data layer
You’ve probably heard the lore that 80% of a machine learning practitioner’s time is spent cleaning data. Regardless of whether this myth holds true or not, what does remain true is that data is at the heart of the machine learning problem, from start to finish. Whether you’re building RAG pipelines, fine-tuning, training your own model, or evaluating model performance, data is the root of each problem. Managing data can be tricky, and as a byproduct, the space has experienced a proliferation of tools that are designed to boost productivity by solving a specific slice of a machine learning data problem. Oftentimes, this takes shape as a layer of abstraction around a more general-purpose solution with an opinionated interface that, on the surface, makes it easier to apply to the specific sub problem at hand. In effect, this reduces the flexibility that exists with a general-purpose solution in favor of ease-of-use and simplicity of a specific task. There are several drawbacks to this approach. A cascading suite of specialized tools, products, and services, in contrast with a general-purpose solution coupled with supporting application code, presents the risk of greater architectural complexity and data costs than necessary. It’s easy to accidentally find yourself with an endless list of tools and services, each used for just a single step. There are two common dimensions to these risks:- Learning, maintenance, and switching costs
- Data duplication and transfer costs
Data exploration
After defining the machine learning problem, goals, and success criteria, a common first step is to explore the relevant data that will be used for model training and evaluation. During this step, data is analyzed to understand its characteristics, distributions, and relationships. This process of evaluation and understanding is an iterative one, often resulting in a series of ad-hoc queries being executed across datasets, where query responsiveness is critical (along with other factors such as cost-efficiency and accuracy). As companies store increasing amounts of data to leverage for machine learning purposes, the problem of examining the data you have becomes harder. This is because analytics and evaluation queries often become tediously or prohibitively slow at scale with traditional data systems. Some of the big players impose significantly increased costs to bring down query times, and discourage ad-hoc evaluation by way of charging per query or by number of bytes scanned. Engineers may resort to pulling subsets of data down to their local machines as a compromise for these limitations. ClickHouse, on the other hand, is a real-time data warehouse, so users benefit from industry-leading query speeds for analytical computations. Further, ClickHouse delivers high performance from the start, and doesn’t gate critical query-accelerating features behind higher pricing tiers. ClickHouse can also query data directly from object storage or data lakes, with support for common formats such as Iceberg, Delta Lake, and Hudi. This means that no matter where your data lives, ClickHouse can serve as a unifying access and computation layer for your machine learning workloads. ClickHouse also has an extensive suite of pre-built statistical and aggregation functions that scale over petabytes of data, making it easy to write and maintain simple SQL that executes complex computations. With support for the most granular precision data types and codecs, you don’t need to worry about reducing the granularity of your data. While you can transform data directly in ClickHouse or prior to insertion using SQL queries, ClickHouse can also be used in programming environments such as Python via chDB. This allows embedded ClickHouse to be exposed as a Python module and used to transform and manipulate large data frames within notebooks. Data engineers can therefore perform transformation work to be performed client-side, with results potentially materialized as feature tables in a centralized ClickHouse instance.Data preparation and feature extraction
Data is then prepared: cleaned, transformed, and used to extract the features by which the model will be trained and evaluated. This component is sometimes called a feature generation or extraction pipeline, and is another slice of the machine learning data layer where new tools are often introduced. MLOps players like Neptune and Hopsworks provide examples of the host of different data transformation products that are used to orchestrate pipelines like these. However, because they’re separate tools from the database they’re operating on, they can be brittle, and can cause disruptions that need to be manually rectified. In contrast, data transformations are easily accomplished directly in ClickHouse through materialized views. These are automatically triggered when new data is inserted into ClickHouse source tables and are used to easily extract, transform, and modify data as it arrives - eliminating the need to build and monitor bespoke pipelines yourself. When these transformations require aggregations over a complete dataset that may not fit into memory, leveraging ClickHouse ensures you don’t have to try and retrofit this step to work with data frames on your local machine. For those datasets that are more convenient to evaluate locally, ClickHouse local is a great alternative, along with chDB, that allow you to leverage ClickHouse with standard Python data libraries like Pandas.Training and evaluation
At this point, features will have been split into training, validation, and test sets. These data sets are versioned, and then utilized by their respective stages. It is common in this phase of the pipeline to introduce yet another specialized tool to the machine learning data layer - the feature store. A feature store is most commonly a layer of abstraction around a database that provides convenience features specific to managing data for model training, inference, and evaluation. Examples of these convenience features include versioning, access management, and automatically translating the definition of features to SQL statements. For feature stores, ClickHouse can act as a: Data source - With the ability to query or ingest data in over 70 different file formats, including data lake formats such as Iceberg and Delta Lake, ClickHouse makes an ideal long-term store holding or querying data. By separating storage and compute using object storage, ClickHouse Cloud additionally allows data to be held indefinitely - with compute scaled down or made completely idle to minimize costs. Flexible codecs, coupled with column-oriented storage and ordering of data on disk, maximize compression rates, thus minimizing the required storage. You can easily combine ClickHouse with data lakes, with built-in functions to query data in place on object storage. Transformation engine - SQL provides a natural means of declaring data transformations. When extended with ClickHouse’s analytical and statistical functions, these transformations become succinct and optimized. As well as applying to either ClickHouse tables, in cases where ClickHouse is used as a data store, table functions allow SQL queries to be written against data stored in formats such as Parquet, on-disk or object storage, or even other data stores such as Postgres and MySQL. A completely parallelization query execution engine, combined with a column-oriented storage format, allows ClickHouse to perform aggregations over PBs of data in seconds - unlike transformations on in memory data frames, users aren’t memory-bound. Furthermore, materialized views allow data to be transformed at insert time, thus overloading compute to data load time from query time. These views can exploit the same range of analytical and statistical functions ideal for data analysis and summarization. Should any of ClickHouse’s existing analytical functions be insufficient or custom libraries need to be integrated, you can also utilize User Defined Functions (UDFs).Offline feature store
An offline feature store is used for model training. This generally means that the features themselves are produced through batch-process data transformation pipelines (as described in the above section), and there are typically no strict latency requirements on the availability of those features. With capabilities to read data from multiple sources and apply transformations via SQL queries, the results of these queries can also be persisted in ClickHouse viaINSERT INTO SELECT statements.
With transformations often grouped by an entity ID and returning a number of columns as results, ClickHouse’s schema inference can automatically detect the required types from these results and produce an appropriate table schema to store them.
Functions for generating random numbers and statistical sampling allow data to be efficiently iterated and scaled at millions of rows per second for feeding to model training pipelines.
Often, features are represented in tables with a timestamp indicating the value for an entity and feature at a specific point in time.
As described earlier, training pipelines often need the state of features at specific points in time and in groups. ClickHouse’s sparse indices allow fast filtering of data to satisfy point-in-time queries and feature selection filters. While other technologies such as Spark, Redshift, and BigQuery rely on slow stateful windowed approaches to identify the state of features at a specific point in time, ClickHouse supports the ASOF (as-of-this-time) LEFT JOIN query and argMax function.
In addition to simplifying syntax, this approach is highly performant on large datasets through the use of a sort and merge algorithm.
This allows feature groups to be built quickly, reducing data preparation time prior to training.
Online feature store
Online feature stores are used to store the latest version of features used for inference and are applied in real-time. This means that these features need to be calculated with minimal latency, as they’re used as part of a real-time machine learning service. As a real-time analytics database, ClickHouse can serve highly concurrent query workloads at low latency. While this requires data to be typically denormalized, this aligns with the storage of feature groups used at both training and inference time. Importantly, ClickHouse is able to deliver this query performance while being subject to high write workloads thanks to its log-structured merge tree. These properties are required in an online store to keep features up-to-date. Since the features are already available within the offline store, they can easily be materialized to new tables within either the same ClickHouse cluster or a different instance via existing capabilities, e.g.remoteSecure.
Integrations with Kafka, through either an exactly-once Kafka Connect offering or via ClickPipes in ClickHouse Cloud, also make consuming streaming data from streaming sources simple and reliable.
Many modern systems require both offline and online stores, and it is easy to jump to the conclusion that two specialized feature stores are required here.
However, this introduces the additional complexity of keeping both of these stores in sync, which of course also includes the cost of replicating data between them.
A real-time data warehouse like ClickHouse is a single system that can power both offline and online feature management.
ClickHouse efficiently processes streaming and historical data, and has the unlimited scale, performance, and concurrency needed to be relied upon when serving features for real-time inference and offline training.
In considering the tradeoffs between using a feature store product in this stage versus leveraging a real-time data warehouse directly, it’s worth emphasizing that convenience features such as versioning can be achieved through age-old database paradigms such as table or schema design.
Other functionality, such as converting feature definitions to SQL statements, may provide greater flexibility as part of the application or business logic, rather than existing in an opinionated layer of abstraction.