Documentation Index
Fetch the complete documentation index at: https://private-7c7dfe99-page-updates.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
ClickHouse history
ClickHouse was initially developed to power Yandex.Metrica, the second-largest web analytics platform in the world, and continues to be its core component. With more than 13 trillion records in the database and more than 20 billion events daily, ClickHouse allows generating custom reports on the fly directly from non-aggregated data. This article briefly covers the goals of ClickHouse in the early stages of its development. Yandex.Metrica builds customized reports on the fly based on hits and sessions, with arbitrary segments defined by the user. Doing so often requires building complex aggregates, such as the number of unique users, with new data for building reports arriving in real-time. As of April 2014, Yandex.Metrica was tracking about 12 billion events (page views and clicks) daily. All these events needed to be stored, in order to build custom reports. A single query may have required scanning millions of rows within a few hundred milliseconds, or hundreds of millions of rows in just a few seconds.Usage in Yandex.Metrica and other Yandex services
ClickHouse serves multiple purposes in Yandex.Metrica. Its main task is to build reports in online mode using non-aggregated data. It uses a cluster of 374 servers, which store over 20.3 trillion rows in the database. The volume of compressed data is about 2 PB, without accounting for duplicates and replicas. The volume of uncompressed data (in TSV format) would be approximately 17 PB. ClickHouse also plays a key role in the following processes:- Storing data for Session Replay from Yandex.Metrica.
- Processing intermediate data.
- Building global reports with Analytics.
- Running queries for debugging the Yandex.Metrica engine.
- Analyzing logs from the API and the user interface.
Aggregated and non-aggregated data
There is a widespread opinion that to calculate statistics effectively, you must aggregate data since this reduces the volume of data. However, data aggregation comes with a lot of limitations:- You must have a pre-defined list of required reports.
- The user can’t make custom reports.
- When aggregating over a large number of distinct keys, the data volume is barely reduced, so aggregation is useless.
- For a large number of reports, there are too many aggregation variations (combinatorial explosion).
- When aggregating keys with high cardinality (such as URLs), the volume of data isn’t reduced by much (less than twofold).
- For this reason, the volume of data with aggregation might grow instead of shrink.
- Users don’t view all the reports we generate for them. A large portion of those calculations are useless.
- The logical integrity of the data may be violated for various aggregations.