ClickHouse

Load Test GlassFlow for ClickHouse: Real-Time Deduplication at Scale

Benchmarking GlassFlow: Fast, reliable deduplication at 20M events

Written by Ashish Bagri06/06/2025, 08.07
hero about image

Load Test GlassFlow for ClickHouse: Real-Time Deduplication at Scale

By Ashish Bagri, Co-founder & CTO of GlassFlow

TL;DR

  • We tested GlassFlow on a real-world deduplication pipeline with Kafka and ClickHouse.
  • It handled 55,00 records/sec published by Kafka and processed 9,000+ records/sec on a MacBook Pro, with sub-0.12ms latency.
  • No crashes, no message loss, no disordering. Even with 20M records and 12 concurrent publishers, it remained robust.
  • Want to try it yourself? The full test setup is open source: https://github.com/glassflow/clickhouse-etl-loadtest and the docs https://docs.glassflow.dev/load-test/setup

Why this test?

ClickHouse is incredible at fast analytics. But when building real-time pipelines from Kafka to ClickHouse, many teams run into the same issues: analytics results are incorrect or too delayed to support real-time use cases.

The root cause? Data duplications and slow joins. They are often introduced by retries, offset reprocessing, or downstream enrichment. These problems can affect both correctness and performance.

That’s why we built GlassFlow: A real-time streaming ETL engine designed to process Kafka streams before data hits ClickHouse.

After launching the product, we often received the question, “How does it perform at high loads?

With this post, we want to give a clear and reproducible answer to that. This article walks through what we tested, how we set it up, and what we found when testing deduplications with GlassFlow.

What is GlassFlow?

What is GlassFlow image (1).png

GlassFlow is an open-source streaming ETL service developed specifically for ClickHouse. It is a real-time stream processing solution designed to simplify data pipeline creation and management between Kafka and ClickHouse. It supports:

  • Real-time deduplication (configurable window, event ID based)
  • Stream joins between topics
  • Exactly-once semantics
  • Native ClickHouse sink with efficient batching and buffering

GlassFlow handles the hard parts: state, ordering, retries and batching.

More about GlassFlow at our prev HN post https://news.ycombinator.com/item?id=43953722

Test Assumptions

Before we dive in, here’s what you should know about how we ran the test.

Data Used: Simulating a Real-World Use Case

For this benchmark, we use synthetic data that simulates a real-world use case: logging user events in an application.

Each record represents an event triggered by a user, similar to what you'd see in analytics or activity tracking systems.

Here's the schema:

FieldTypeDescription
event_idUUID (v4)Unique ID for the event
user_idUUID (v4)Unique ID for the user
nameStringFull name of the user
emailStringUser’s email address
created_atDatetime (%Y-%m-%d %H:%M:%S)Timestamp of when the event occurred

This structure helps simulate insert-heavy workloads and time-based queries—perfect for testing how GlassFlow performs with ClickHouse in a realistic, high-volume setting.

Infrastructure Setup

Infra Setup (1).png

For this benchmark, we will be running the load test locally using Docker to simulate the entire data pipeline. The setup included:

  • Kafka: Running in a Docker container to handle event streaming.
  • ClickHouse: Also containerized, serving as the storage layer.
  • GlassFlow ETL: Deployed in Docker, responsible for processing messages from Kafka and writing them to ClickHouse.

While the setup supports running against cloud-hosted Kafka and ClickHouse, we chose to keep everything local to maintain control over the environment and ensure consistent test conditions.

Each test run automatically creates the necessary Kafka topics and ClickHouse tables before starting, and cleans them up afterward. This keeps the environment clean between runs and ensures reproducible results.

Resource Used for Testing

The load tests were conducted on a MacBook Pro with the following specifications:

SpecificationDetails
Model NameMacBook Pro
Model IdentifierMac14,5
Model NumberMPHG3D/A
ChipApple M2 Max
Total Number of Cores12 (8 performance and 4 efficiency)
Memory32 GB

Additional Assumptions

Furthermore, to push our implementation to the limits, we do the following:

  1. We use an example where we have incoming data with some amount of duplication (10%, to be exact) and we need to deduplicate it.
  2. We perform incremental tests with growing data volume at each step (starting from 5 million records moving our way up to 20 million records).
  3. Apart from this, we also change several parameters and see how that impacts our overall performance.

So, let’s start with the actual test.

Running the Actual Load Test

We created a load test repo so you can run this benchmark yourself in minutes (check it out here). Using this, we ran a series of local load tests that mimicked a real-time streaming setup. The goal was simple: push a steady stream of user event data through a Kafka → GlassFlow → ClickHouse pipeline and observe how well it performs with meaningful data transformations applied along the way.

Pipeline Configuration

Pipeline Config (1).png

The setup followed a typical streaming architecture:

  • Kafka handled the event stream, fed by synthetic user activity.
  • GlassFlow processed the stream in real time, applying transformations before passing it downstream.
  • ClickHouse served as the destination where all processed data was written and later queried.

Each test run spun up its own Kafka topics and ClickHouse tables automatically. Everything was cleaned up once the run was complete, leaving no leftover state. This kept the environment fresh and the results reliable.

Transformations Applied

glassflow-clickhouse (1).jpg

As discussed in the previous section, to make the test more realistic, we applied a deduplication transformation using the event_id field. The goal was to simulate a scenario where events could be sent more than once due to retries or upstream glitches. The deduplication logic looked for repeated events within an 8-hour window and dropped the duplicates before they hit ClickHouse.

No complex joins or filters were applied in this run, keeping the focus on how well GlassFlow could handle high event volumes and real-time processing with exactly-once semantics.

Monitoring and Observability Setup

Throughout the test, we kept a close eye on key performance metrics:

  • Throughput — Events processed per second, from Kafka to ClickHouse.
  • Latency — Time taken from ingestion to storage.
  • Kafka Lag — How far behind the processor was from the latest Kafka event.
  • CPU & Memory Usage — For each component in the pipeline.

These were visualized using pre-built Grafana dashboards that gave a live view into system behavior. It was especially useful for spotting bottlenecks and confirming whether back pressure or resource constraints were kicking in.


Test Execution

We ran multiple test iterations, each processing between 5 to 20 million records, with parallelism levels ranging from 2 to 12 workers. Around 10% of the events were duplicates, which tested the deduplication mechanism effectively. Additionally, we setup various configurable parameters that allowed us to test the limits of GlassFlow:

ParameterRequired/OptionalDescriptionExample Range/ValuesDefault
num_processesRequiredNumber of parallel processes1-N (step: 1)-
total_recordsRequiredTotal number of records to generate5,000,000-20,000,000 (step: 500,000)-
duplication_rateOptionalRate of duplicate records0.1 (10% duplicates)0.1
deduplication_windowOptionalTime window for deduplication[“1h”, “4h”]“8h”
max_batch_sizeOptionalMax batch size for the sink[5000]5000
max_delay_timeOptionalMax delay time for the sink[”10s”]”10s”

For each parameter, you can either define a fixed value and go a step further and define a range to run multiple combinations of the test using the configured values. Here is a sample of configuration that you can setup when using our repository:

Each test ran until all records were processed, and the pipeline drained completely. By the end, we had a clear picture of how throughput and latency scaled with load—and how stable the system remained under pressure.

With the setup complete, let’s look at the results.

It’s Result Time!

We ran this benchmark by using the same GlassFlow pipeline across all the sets and setting different parameters as shown above. Here are the GlassFlow pipeline configurations we use:

ParameterValue
Duplication Rate0.1
Deduplication Window8h
Max Delay Time10s
Max Batch Size (GlassFlow Sink - Clickhouse)5000

Now, as we discussed above, we look at a particular performance metrics to gauge how GlassFlow performs. Across all our tests, both the CPU and memory usage on our Mac remained stable and efficient even during extended test runs.

So, here are the results that we obtained:

Variant ID#records (millions)#Kafka Publishers (num_processes)Source RPS in Kafka (records/s)GlassFlow RPS (records/s)Average Latency (ms)Lag (sec)
load_9fb6b2c95.02870585470.11710.1
load_0b8b8a7010.02877386530.115615.04
load_a7e0c0df15.02880487480.114310.04
load_bd0fdf3920.02873785560.116947.74
load_1542aa3b5.041767991890.1088260.55
load_a85a4c4210.041773894290.1061495.97
load_5efd111b15.041767993410.1071756.49
load_23da167d20.041753493770.1066991.77
load_883b39a05.062599588690.1128370.57
load_b083f89f10.062622691480.1093710.97
load_462558f415.062632891910.10881061.44
load_254adf2920.062601083910.11921613.62
load_0c3fdefc5.083438488950.1124415.78
load_3942530b10.083377987470.1143846.26
load_d2c1783c15.083440990670.11031217.37
load_febf151f20.083513591210.10961622.75
load_993c0bc55.0104025687570.1142445.76
load_022e44e510.0103871586870.1151891.8
load_0adbae8315.0103982086940.1151347.66
load_77d67ac720.0104045884010.1191885.24
load_af1205205.0123769180680.124485.95
load_c942493110.0124574386100.1161941.66
load_ee837ca615.0124553986050.11621412.48
load_ac40b14320.0124900588780.11261843.61
load_675d04f35.0124038284670.1181465.66
load_28956d5010.0125582980180.12471066.62
💡

Note: The last two tests (load_675d04f3 and load_28956d50) use a higher records per second value to see how it would impact the performance.

Well, before we analyze these results, let’s take a look at few visualizations we created to get a better idea of how GlassFlow actually performed:

image (6).png

image (7).png

After running a series of sustained load tests, the results gave a clear picture of how GlassFlow behaves under pressure—and the performance was impressive across the board. Here's what stood out:

  1. Throughout the test, the system remained rock-solid—even when pushing up to 55,000 records per second into Kafka. There were no crashes, memory leaks, or failures. GlassFlow handled deduplication flawlessly, consistently filtering out repeated events without missing a beat. No message loss or disordering was observed, which speaks volumes about the reliability of the pipeline.
  2. GlassFlow’s processing rate remained stable under varying loads. In the current setup (running inside a Docker container on a local machine), the system consistently processed upwards of over 9,000 records per second.

However, this peak appears to be more a reflection of available system resources—CPU and memory—rather than a limitation of GlassFlow itself. With more powerful hardware or a scaled-out deployment (cloud deployment, for instance), it's likely this ceiling could be pushed higher. 3. Lag in the pipeline measured as the time difference between event ingestion into Kafka and its appearance in ClickHouse was closely tied to two factors:

  • Ingestion Rate: Higher Kafka ingestion RPS naturally led to higher lag, especially when it exceeded the 9,000 RPS GlassFlow could sustain.
  • Volume of Data: For a fixed RPS, increasing the total number of events extended the lag over time, which was expected as the buffer filled up.

In other words, once Kafka was producing faster than GlassFlow could consume, the lag started to climb. This is normal in streaming systems and highlights where autoscaling or distributed processing would come into play in a production setup.

So, to summarize the above interpretations, here are my final takeaways:

  • GlassFlow remained stable and consistent under high event rates.
  • Processing throughput maxed out at ~9K RPS, limited by local machine resources.
  • Processing latency remained extremely low (<0.12ms). Even at peak load and max event volume (20M records), latency didn’t spike.
  • Lag increased proportionally with ingestion rates and event volume—no surprises, but a great signal for where scaling would help.

Hence, it’s fair to say that these results give us a lot of confidence in using GlassFlow for real-time event pipelines, especially when paired with a scalable backend like ClickHouse.

Conclusion

The above test proves that GlassFlow is indeed a great tool for real-time stream processing with ClickHouse and it seamlessly integrates with Kafka. Deduplication does not compromise performance, making GlassFlow suitable for correctness-critical analytics use cases.

Now, it’s time for you to get your hands dirty and create your own tests using our load test repository. Here is the link to the repo again for your reference: https://github.com/glassflow/clickhouse-etl-loadtest.

Did you like this article? Share it now!

You might also like

Try it now

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

GitHub Repo