Dedupe in real-time without ReplacingMergeTree
RMT slows down your ClickHouse and has a non controllable background process. GlassFlow deduplicates your data before ingesting to ClickHouse.

More Control
With GlassFlow your data is immediately deduplicated. That means that your query results are correct without any delays.
Less Load
Drop duplicates and reduce the data volume on your ClickHouse. That makes your system faster and cheaper to run.
Clean Data
By deduplicating before ingestion you ensure that only clean data reaches your ClickHouse.
Comparison
See in detail how GlassFlow performs compared to alternative solutions

How does it work?
7 days deduplication checks
Our system automatically detects and rejects duplicate records within up to 7 days (configurable), keeping your data clean and preventing unnecessary storage use. You can define specific fields as deduplication keys, ensuring only unique data is accepted. The system refuses any duplicates identified in real time. With a one-click setup, it's easy to launch fully deduplicated data pipelines with zero manual overhead and minimal lag (<0.12ms per record).


Stateful store built-in
GlassFlow’s built-in stateful store maintains context across streaming events, enabling advanced use cases like deduplication, joins, and aggregations. The state is fully managed and persists automatically without needing external databases or extra infrastructure. With support for keyed state and time-based windows, you can build reliable, real-time pipelines that go far beyond simple transformations.
Managed Kafka and ClickHouse Connector
The integration uses a native ClickHouse connection for top performance and reliability. You can tune batch sizes and wait times to optimize throughput, with built-in retries for handling errors. It includes automatic schema detection and management, plus full support for JSON data types, making it easy to work with complex, nested data.

Frequently asked questions
Feel free to contact us if you have any questions after reviewing our FAQs.
We have prepared several demo setups that you can run yourself locally or in the cloud. You can find them here.
ReplacingMergeTree (RMT) performs deduplication via background merges, which can delay accurate query results unless you force merges with FINAL—which can significantly impact read performance. GlassFlow moves deduplication upstream, before data is written to ClickHouse, ensuring real-time correctness and reducing load on ClickHouse.
GlassFlow’s deduplication is powered by NATS JetStream and is based on a user-defined key (e.g. user_id) and a time window (e.g. 1 hour) to identify duplicates. When multiple events with the same key arrive within the configured time window, only the first event is written to ClickHouse. Any subsequent events with the same key during that window are discarded. This mechanism ensures that only unique events are persisted, avoiding duplicates caused by retries or upstream noise.
Duplicate events in Kafka can occur for several reasons, including producer retries, network issues, or consumer reprocessing after failures. For example, if a producer doesn’t receive an acknowledgment, it may retry sending the same event—even if Kafka already received and stored it. Similarly, consumers might reprocess events after a crash or restart if offsets weren’t committed properly.
These duplicates become a problem when writing to systems like ClickHouse, which are optimized for fast analytical queries but don’t handle event deduplication natively. Without a deduplication layer, the same event could be stored multiple times, inflating metrics, skewing analysis, and consuming unnecessary storage.
GlassFlow uses NATS JetStream as a buffer. Kafka offsets are only committed after successful ingestion into NATS, and then data is deduplicated and written to ClickHouse. We batch inserts using the ClickHouse native protocol. If the system crashes after acknowledging Kafka but before inserting into ClickHouse, that batch is lost. We’re actively improving recovery guarantees to address this gap.
We have created a load test for a local setup. You can find the setup and the results here link.
We have several hosting options. You can find them here.
We are working on a managed service. If you want to get updated on that topic you can submit your email address via this form.
