GlassFlow | Kafka to ClickHouse Transformation

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

Dedupe in real-time without ReplaceMergingTree

More control

Less Load

Clean Data

Comparison

RMT is very popular among ClickHouse users, but there are certain limitations. The merging process is not controllable and it slows down the system during merges. FINAL brings other challenges. Learn more about all of them through our blog article.

How does it work?

7 days deduplication checks

Stateful store built-in

Managed Kafta and ClickHouse connector

Frequently asked questions

Cleaned Kafka Streams for ClickHouse

Cleaned Kafka Streams for ClickHouse

Cleaned Kafka Streams for ClickHouse