If you’ve ever tried to build data pipelines that need to process data in real-time from Kafka to ClickHouse, there is a good chance that you have experienced the same challenges as many teams do: Your analytics results are incorrect, or they are too slow to serve the real-time use case.
The reason? Data duplications and slow joins.
Duplications in that case are usually records (with all fields) received multiple times (i.e., the full row). In other scenarios, duplicates could share a key field (e.g., user_id) but differ in other fields. These partial duplicates are harder to detect and clean without context.
Typically, the most common reasons for duplicates happening are:
Retries (producer or connector): Producers or connectors (like Kafka Connect) sometimes retry tasks after failure (e.g., network failures). In this case, they may resend the exact same message, resulting in duplicates.
Consumer offset management: If a consumer doesn't commit offsets correctly (e.g., after a crash), it may re-read and re-process old events.
Let’s walk through how pairing GlassFlow with Altinity.Cloud makes real-time streaming simpler, cleaner and faster.
Why Real-Time Streaming is such a challenge
Modern pipelines often pull event data from multiple systems, such as CRMs, order systems, clickstreams, etc., and ingest them into Kafka. When duplicates land in Kafka, they won’t be solved there. Kafka has an “at-least-once” guarantee that ensures delivery but not uniqueness.
ClickHouse is an incredibly fast database for real-time analytics, and it includes built-in features to help manage duplicate data.
The most commonly used solution for deduplication is ReplacingMergeTree (RMT). It’s a powerful engine that can deduplicate rows during background merges, making it suitable for many workloads. However, those background merges can lag behind ingestion rates for high-throughput streaming scenarios, leading to temporary inconsistencies in query results. One option to solve the merging time is to use the FINAL keyword when selecting. It ensures clean results at read time but can slow down queries on large datasets. ClickHouse also supports insert-time deduplication, but it’s limited to a short window and isn’t designed for long-term, stateful deduplication across streaming retries.
In cases where clean data must be available immediately after ingestion, upstream deduplication is often the more scalable choice.
ClickHouse also supports JOINs, but they can introduce significant challenges, particularly in real-time streaming pipelines. JOIN operations often consume substantial memory and may degrade performance when working with larger datasets, especially if the right-hand table is large. Materialized views also do not automatically update when changes occur in the joined table, which can lead to inconsistencies. In many cases, JOINs are used to clean or denormalize incoming data before analysis, but doing this inside ClickHouse can be cumbersome. For that reason, it’s often better to perform such enrichment upstream before inserting data into ClickHouse, keeping queries fast and tables simpler.
Another option could be to integrate with stream processing solutions like Apache Flink, but far too often, the operational burden of setup and maintenance holds teams back. Some teams approach the duplication challenges with self-built Go services and accept the effort that it takes to build, manage and maintain those services. That's where GlassFlow comes in.
Introducing GlassFlow
GlassFlow shifts the burden of deduplication and joins upstream processing Kafka streams before the data hits ClickHouse. This way, you only ingest clean streams into your ClickHouse and reduce the load on ClickHouse.
The main components include streaming deduplication, temporal joins and optimized connectors for Kafka and ClickHouse.
Streaming deduplication: You define the deduplication key and a time window (up to 7 days), and it handles the checks in real time to avoid duplicates before hitting ClickHouse. The duplication works based on the fields, not the entire row. GlassFlow guarantees that only the first event with a given key will be forwarded to ClickHouse. Subsequent duplicates are rejected.
Temporal Stream Joins: With a few config inputs, you can join two Kafka streams on the fly. You set the join key, choose a time window (up to 7 days), map the fields and tables, and you're good to go.
Built-in Kafka source connector: There is no need to build custom consumers or manage polling logic. Users point to their Kafka cluster, which auto-subscribes to the topics they define. Payloads are parsed as JSON by default, so you get structured data immediately. As underlying tech, we decided on NATS to make it lightweight and low-latency.
ClickHouse sink: Data gets pushed into ClickHouse through a native connector optimized for performance. You can tweak batch sizes and flush intervals to match your throughput needs. It handles retries automatically, so you don't lose data on transient failures.
GlassFlow is open-source and accessible at https://github.com/glassflow/clickhouse-etl
Why Altinity.Cloud Completes the Picture
Altinity.Cloud is a fully managed service for ClickHouse® that eliminates infrastructure concerns. It handles scaling, monitoring, backups, and security, allowing your team to focus on data instead of DevOps. Altinity Support team partners with you to optimize schemas, tune performance, and provide 24/7 guidance to help you get all the power of open-source ClickHouse.
When combined with GlassFlow, it becomes a zero-ops, end-to-end stack for real-time analytics. GlassFlow ensures only deduplicated and enriched data is ingested, while Altinity.Cloud keeps your ClickHouse instance running smoothly at scale. Together, they simplify streaming architecture and accelerate time to insight.
A Look at the Architecture
Imagine a setup where data flows from Kafka, gets processed and cleaned by GlassFlow, and lands in Altinity fully ready for analytics. Here’s what that looks like:
Glassflow + Altinity.Cloud Tutorial
This demo shows how to build end-to-end Kafka to ClickHouse streaming pipelines using GlassFlow and Altinity.Cloud, focusing on deduplication and stream joins.
What you’ll learn:
- How to connect Kafka and ClickHouse using GlassFlow
- How to configure a streaming deduplication + join pipeline
- How to generate and verify data inside ClickHouse via Altinity
YOUTUBE VIDEO: https://youtu.be/9_Tr8qdG1-I?feature=shared
A repo to follow the tutorial step-by-step can be found here:
https://github.com/glassflow/clickhouse-etl/tree/main/demos/providers/altinity
Step 1: Set Up Tables in ClickHouse
Create two tables in your ClickHouse cluster:
- One to store incoming orders with deduplication enabled.
- Another way to store enriched order data after joining with user information.
You’ll define how records are deduplicated and what fields are needed in each table.
Step 2: Generate Sample Data and Send to Kafka
Use a data generator to create:
- A large batch of fake order events with intentional duplicates.
- A separate stream of user data for enrichment.
Publish both datasets to Kafka topics named orders
and users
.
Step 3: Start Kafka and GlassFlow Locally
Run the Docker Compose file provided in the demo repository to spin up Kafka, GlassFlow, and dependencies on your machine. Open the GlassFlow user interface in your browser.
Step 4: Create a Deduplication Pipeline
In the GlassFlow UI:
- Choose the
orders
Kafka topic as input. - Select a key for deduplication (such as
order_id
) and a time window. - Provide your ClickHouse connection details.
- Map the Kafka fields to the appropriate ClickHouse table columns.
- Launch the pipeline to start streaming deduplicated data into ClickHouse.
Step 5: Verify Deduplication
Send events to the Kafka orders
topic. Once the pipeline is running, confirm in ClickHouse that no duplicate order IDs are present in the final table.
Step 6: Create a Stream Join Pipeline
Now, set up a second pipeline in GlassFlow:
- Select
orders
as the left Kafka stream andusers
as the right stream. - Define the field to join on (typically
user_id
) and a suitable time window. - Map fields from both topics to your destination table in ClickHouse.
- Deploy the pipeline.
Step 7: Generate Matching Events
Use the data generator again to create order and user events that share matching user IDs. This ensures the join will succeed.
Step 8: Check the Final Output
Look in the enriched table in ClickHouse. You should see order records enriched with user information like name and country, showing the result of a successful stream join.
Now you’ll have an end-to-end working streaming ETL with cleaned data in your Altinity.Cloud.
If you want to try the tutorial yourself, you can take a look at our demo https://github.com/glassflow/clickhouse-etl/tree/main/demos
Streaming Deduplications and Joins easier than ever
GlassFlow and Altinity.Cloud together offer an end-to-end, low-ops path to real-time analytics with Kafka and ClickHouse. Kafka users ingest data through an in-built connector. GlassFlow handles stream deduplication and joins upstream, ensuring only clean, enriched data reaches your ClickHouse instance, reducing compute load and improving query performance. Combined with Altinity.Cloud’s fully managed, production-grade ClickHouse environment allows you to get the performance ClickHouse is known for without managing the infrastructure. It’s a modern stack that’s open-source, easy to set up, and built to deliver fast time-to-value for streaming use cases.
Try It Out
You can explore GlassFlow on GitHub at github.com/glassflow/clickhouse-etl, and sign up for Altinity. Cloud to start running ClickHouse with zero ops.
Together, they make Kafka-to-ClickHouse streaming pipelines simpler, faster, and a lot more maintainable.