ClickHouse is renowned for its blazing-fast OLAP (Online Analytical Processing) capabilities. While many know the default MergeTree
engine, ClickHouse provides specialized variants optimized for particular workloads. Among these, AggregatingMergeTree
stands out for scenarios involving pre-aggregated data. This post explores the AggregatingMergeTree
engine in depth, explains how it works, where it shines, and how it compares with other engines like ReplacingMergeTree
.
What Is AggregatingMergeTree?
AggregatingMergeTree
is a table engine in ClickHouse designed for efficiently storing and merging aggregated data. Unlike regular MergeTree
tables that store raw or unprocessed data, AggregatingMergeTree
stores states of aggregate functions (like sum
, avg
, count
, etc.) and merges them periodically during background merges.
This makes it particularly powerful for pre-aggregated OLAP workloads, where the same queries with GROUP BY and aggregations are run repeatedly. You can use AggregatingMergeTree
tables for incremental data aggregation, including for aggregated materialized views. It is appropriate to use AggregatingMergeTree
if it reduces the number of rows by orders.
How Does It Work Internally?
ClickHouse replaces all rows with the same primary key (or more accurately, with the same sorting key) with a single row (within a single data part) that stores a combination of states of aggregate functions.
You can use AggregatingMergeTree
tables for incremental data aggregation, including for aggregated materialized views.
The engine processes all columns with the following types:
The SimpleAggregateFunction
data type stores the intermediate state of an aggregate function, but not its full state as the AggregateFunction
type does. For example, sum
is a SimpleAggregateFunction. We don’t need to store all the states to get the final sum. The latest summed value alone is sufficient. When a new state/value is inserted, the last summed value can be updated and that alone can be stored. However, this is not the case with AggregateFunctions, such as avg
. We need to store the sum
state and the count
state, so that avg
could be calculated.
The SimpleAggregateFunction
type has better performance than the AggregateFunction
for the same aggregate functions.
This optimization can be applied to functions for which the following property holds:
the result of applying a function f to a row set S1 UNION ALL S2 can be obtained by applying f to parts of the row set separately, and then again applying f to the results: f(S1 UNION ALL S2) = f(f(S1) UNION ALL f(S2)).
This property guarantees that partial aggregation results are enough to compute the combined one, so we do not have to store and process any extra data. For example, the result of the min or max functions require no extra steps to calculate the final result from the intermediate steps, whereas the avg function requires keeping track of a sum and a count, which will be divided to get the average in a final Merge step which combines the intermediate states.
Working Example
Let’s consider a common aggregation scenario: you want to analyze daily user activity across a website. Normally, you might store raw clickstream data and run a query like:
SELECT user_id, count() FROM events GROUP BY user_id;
This works well — until your table grows to billions of rows.
With AggregatingMergeTree
, you store the intermediate aggregation state and retrieve the aggregated metrics blazingly fast.
Example Table Definition:
A. Raw Table:
The raw table stores the data being streamed from a source, say, a Kafka server. Data may be generated every second, 5 seconds, 10 seconds etc. There is also a possibility of data reaching the raw table with some latency due to sluggish Network.
CREATE TABLE user_activity_raw
(
``user_id UInt64,
``session_id UInt64,
``timestamp DateTime(‘UTC’)
)
ENGINE = MergeTree()
ORDER BY user_id;
B. State Table:
The state table stores the state required for desired aggregation over a defined partition of fields, defined by ORDER BY
in the DLL. In this particular case, page visits of a particular user keeps getting updated as the data is being streamed. Since, we are using SimpleAggregateFunctions
in the following definition, each data insertion would ideally update the fields num_visits
and last_visit
. For example, say, at time t0
the number of total visits for a user Ui
was ni0
and after a small interval ∆t additional data stream indicated that the number of visits increased by ∆n i.e. at time t0+∆t
, the number of visits increased to ni0+∆n
, the same will get updated in the state table automatically without having to do the calculation manually. That the additional data caused this change, is managed by the Materialized View which updates the State table.
CREATE TABLE user_activity_agg
(
``user_id UInt64,
``num_visits SimpleAggregateFunction(sum, UInt64),
``last_visit SimpleAggregateFunction(max, DateTime(‘UTC’))
)
ENGINE = AggregatingMergeTree()
ORDER BY user_id;
C. Materialized View:
The Materialized Views operate only one the incremental chunk of data being inserted to the Raw Table. A copy of the data being inserted to the Raw Table is also processed by the materialized view. In this example, say the data block inserted corresponds to ∆t interval, then the Materialized view executes the defined query ONLY on this data block and NOT on the entire State table.
CREATE MATERIALIZED VIEW user_activity_mv TO user_activity_agg AS
SELECT
user_id,
COUNT(session_id) AS num_visits,
max(timestamp) AS last_visit
FROM
user_activity_raw
GROUP BY
user_id;
To query the aggregated data:
And this is the reason why when we have to aggregate the data over the state table, we need to execute the same aggregation query over the State Table itself. This causes the data to forcefully get aggregated over the desired partitions, if they have not already been aggregated.
SELECT
a.user_id,
sum(a.num_visits) AS num_visits,
max(a.last_visit) AS last_visit
FROM
user_activity_agg a
GROUP BY
a.user_id;
Idempotency: Comparision with ReplacingMergeTree
Now, let’s say that you need an Idempotency in data while keeping the data pipeline robust enough to handle delay in data transfer from a remote site to cloud (due to, say, poor network). In order to account for this unavoidable latency, say, we designed the pipeline to carry out aggregations over data for last n
minutes and store the aggregated data for final consumption by an app. To prevent possible data duplication in the final aggregated table and maintain idempotency, we have two options in Clickhouse as it provides to powerful Engines – ReplacingMergeTree
and AggregatingMergeTree
.
ReplacingMergeTree
removes duplicate entries with the same sorting key value (ORDER BY table section, not PRIMARY KEY).
Data deduplication occurs only during a merge. Merging occurs in the background at an unknown time, so you can't plan for it. Some of the data may remain unprocessed. Although you can run an unscheduled merge using the OPTIMIZE query, do not count on using it, because the OPTIMIZE query will read and write a large amount of data.
Thus, ReplacingMergeTree is suitable for clearing out duplicate data in the background in order to save space, but it does not guarantee the absence of duplicates.
But what’s wrong with ReplacingMergeTree
? The ReplacingMergeTree Engine identifies duplicate rows at merge time, using the values of the ORDER BY columns (used to create the table) as a unique identifier, and retains only the highest version. This, however, offers eventual correctness only - it does not guarantee rows will be deduplicated, and you should not rely on it. Queries can, therefore, produce incorrect answers due to update and delete rows being considered in queries.
To obtain correct answers, users will need to complement background merges with query time deduplication and deletion removal. This can be achieved using the FINAL operator. This means that at times, when there is a hight throughput of data to the table, it is possible that the table may contain millions of duplicate records. Because the de-duplication happens at merge time, it may take longer to de-duplicate the data.
From our experience with both the engines on large datasets, we have observed that this is exactly where AggregatingMergeTree
(with SimpleAggregateFunction
) scores over ReplacingMergeTree
for deduplication.