Dedupe in real-time without ReplaceMergingTree

RMT slows down your ClickHouse and has a non controllable background process. GlassFlow deduplicates your data before ingesting to ClickHouse.

More control

With GlassFlow your data is immediately deduplicated. That means that your query results are correct without any delays.

Less Load

Drop duplicates and reduce the data volume on your ClickHouse. That makes your system faster and cheaper to run.

Clean Data

By deduplicating before ingestion you ensure that only clean data reaches your ClickHouse.

Comparison

See in detail how GlassFlow performs compared to alternative solutions

Deduplication

Immediate range

Late event management

Stateful store included

Quick to start

Reduced load for ClickHouse

Low maintanance effort

Open source

ClickHouse

ReplaceMergingTree

Go
Service

Limits of ReplacingMergeTree at ClickHouse

RMT is very popular among ClickHouse users, but there are certain limitations. The merging process is not controllable and it slows down the system during merges. FINAL brings other challenges. Learn more about all of them through our blog article.

How does it work?

7 days deduplication checks

Our system automatically detects and rejects duplicate records within up to 7 days (configurable), keeping your data clean and preventing unnecessary storage use. You can define specific fields as deduplication keys, ensuring only unique data is accepted. The system refuses any duplicates identified in real time. With a one-click setup, it's easy to launch fully deduplicated data pipelines with zero manual overhead and minimal lag (<0.12ms per record).

Stateful store built-in

GlassFlow’s built-in stateful store maintains context across streaming events, enabling advanced use cases like deduplication, joins, and aggregations. The state is fully managed and persists automatically without needing external databases or extra infrastructure. With support for keyed state and time-based windows, you can build reliable, real-time pipelines that go far beyond simple transformations.

Managed Kafta and ClickHouse connector

The integration uses a native ClickHouse connection for top performance and reliability. You can tune batch sizes and wait times to optimize throughput, with built-in retries for handling errors. It includes automatic schema detection and management, plus full support for JSON data types, making it easy to work with complex, nested data.

Frequently asked questions

Feel free to contact us if you have any questions after reviewing our FAQs.

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

Do you have a demo?

We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How is GlassFlow’s deduplication different from ClickHouse’s ReplacingMergeTree?

ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream  and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream  and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

How does GlassFlow’s deduplication work?

GlassFlow’s deduplication is powered by NATS JetStream  and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

Why do duplicates happen in Kafka pipelines at all?

Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly. These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What happens during failures? Can you lose or duplicate data?

GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

What is the load that GlassFlow can handle?

We have created a load test for a local setup. You can find the setup and the results here link.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

How do I self-host GlassFlow?

We have several hosting options. You can find them here.

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.