ClickHouse

Migrating from Snowflake to ClickHouse

Migrating from Snowflake to ClickHouse: Step-by-Step Guide

Written by Armend Avdijaj04/08/2025, 07.07
hero about image

Migrating from Snowflake to ClickHouse

The trade-off between cost and latency in cloud data warehousing is becoming a serious issue for many engineering teams. It's only natural for workloads to grow, and for user expectations for real-time feedback to increase; this, along with the lack of predictable pricing models of the past, are giving way to surprise bills and performance bottlenecks. This is precisely why, for teams building high-throughput, low-latency analytical applications, ClickHouse has presented itself as a very compelling alternative. It's open-source, built for speed, and a joy to use, so what's not to like?

This article presents a playbook introducing migration pipelines from Snowflake to ClickHouse. It provides clear, step-by-step examples covering common use cases, including everything from initial assessment and data extraction to performance optimization and the final cut-over.

So without further ado, let us start.

When (and Why) to Consider Migrating

The decision to migrate a data warehouse is never taken lightly, but it’s often forced by growing pains in cost, performance, and architecture and maintenance, among others. We can thus break down this into the following points:

  • Cost & Performance
  • Architectural & Control Factors
  • Strategic Alignment

Cost & Performance Pressures

Snowflake's pay-for-compute model works well until it doesn't. The moment your workload becomes spiky or your queries are user-facing, you're suddenly caught in a dilema:

  • To handle sudden query bursts, you can rely on auto-scaling, which burns through credits at an alarming rate.
  • To deliver low-latency dashboards, you're forced to keep a virtual warehouse running 24/7, paying for idle time just to avoid the multi-second "resume" delay.

ClickHouse sidesteps this entirely. Its shared-nothing architecture on local storage means there is no "wake-up" time. Queries are executed immediately against data on disk, making it structurally better suited for applications where every millisecond of latency impacts the user experience.

Architectural & Control Factors

The shift to an open-source tool like ClickHouse is often a deliberate move to reclaim engineering autonomy. On a managed platform, the underlying infrastructure and its optimizations are a black box. With ClickHouse, you get direct control over the machine. You can provision nodes with specific CPU-to-RAM ratios, choose high-throughput NVMe storage, and tune kernel settings. You have direct access to table engine parameters, letting you configure MergeTree settings for your exact workload, adjust merge behavior in the system tables, and pick your own compression codecs (LZ4 for speed, ZSTD for size). This level of control allows you to solve performance problems at the root, rather than just throwing more credits at them.

Strategic Alignment

Choosing ClickHouse is also a strategic decision to align your data infrastructure with your operational strengths. If your team is already skilled in managing Linux-based systems and deploying applications on Kubernetes, ClickHouse fits directly into your existing tooling and knowledge base. It’s not an isolated, proprietary system but a component that integrates cleanly with the open-source streaming ecosystem, connecting to sources like Apache Kafka and processors like GlassFlow without friction.

Key Differences Between Snowflake and ClickHouse

Before any data is moved, you have to internalize that Snowflake and ClickHouse are built on opposing design principles. This isn't just a syntax change; it's a paradigm shift. Every choice, from data layout on disk to query execution, is different, and understanding these differences is non-negotiable for a successful migration.

  1. Storage & Compute Model: Decoupled cloud object storage vs. local disks & replicas
  2. Data Layout: Micro-partitions vs. MergeTree parts & primary-key ordering
  3. Indexing & Pruning: Metadata pruning vs. skip indexes, projections
  4. Concurrency & Elasticity: Virtual warehouses vs. shard-based horizontal scaling
  5. Security & Governance: Built-in RBAC & masking vs. self-managed policies

Below is a more comprehensive feature comparison matrix for both services:

FeatureSnowflakeClickHouse
ModelDecoupled Storage & ComputeCoupled Storage & Compute (Shared-Nothing)
Primary Use CaseEnterprise Cloud Data WarehousingReal-Time, High-Throughput Analytics
Data StorageCloud Object Storage (S3, GCS, Azure Blob)Local Disks (NVMe/SSD) or Object Storage
Primary IndexMetadata on Micro-PartitionsSparse Primary Index based on ORDER BY key
ConcurrencyElastic Virtual WarehousesHorizontal Scaling via Sharding and Replication
Data TypesRich SQL types, first-class VARIANT for JSONRich SQL types, requires explicit parsing of JSON
TransactionsMulti-statement ACID transactionsAtomic INSERT statements, limited transaction support
Source ModelClosed-Source, Managed ServiceOpen-Source, Self-Hosted or Cloud-Managed
Query LatencySeconds to Minutes (includes warehouse spin-up)Milliseconds to Seconds
Cost ModelCredit-based (Compute + Storage)Hardware/Instance-based (Compute + Storage)

Figure 01: Feature-by-Feature Comparison Matrix

Migrating the Data

We can think of the migration itself as a five-step engineering project. Approaching it with this structure ensures that data is not only moved correctly but is also reshaped to take full advantage of ClickHouse's architecture from day one:

1. Inventory & Assessment

The first job is a thorough reconnaissance of your Snowflake environment. This means programmatically listing every database, schema, table, and especially any views or pipes that contain hidden logic. The two critical items to flag are reliance on Time Travel, which has no direct equivalent in ClickHouse and will require a new data modeling approach (like an event log table), and the use of VARIANT columns. Every VARIANT field must be analyzed to define an explicit, flattened schema in ClickHouse, as it does not have a native "schemaless" type.

2. Extract from Snowflake

The standard and most effective way to get bulk data out of Snowflake is via the COPY INTO @stage command, writing to an external cloud stage like S3 as Parquet files. Parquet is a good intermediary format because it's columnar, compressed, and universally understood. The extraction strategy typically involves a one-time, full-table unload for the historical data, followed by a separate, ongoing process to capture incremental updates that occur during the transition period. For the incremental updates, you can use an orchestrator like Airflow, dbt Cloud, or Prefect to schedule regular extracts, or implement a simple cron job that runs COPY INTO commands with appropriate date filters (such as WHERE updated_at > '2024-01-01') to capture only the newly added or modified records since the last extraction. The frequency of these incremental runs depends on your data freshness requirements and can range from hourly to daily intervals:

3. Transform for ClickHouse

This is the most critical stage for future performance. Data that is perfectly structured for Snowflake's micro-partitioning is not optimized for ClickHouse's MergeTree engine. The objective here is twofold: flatten any nested JSON data into a clean, columnar structure, and—most importantly—physically sort the Parquet files according to the columns that will form your ClickHouse table's ORDER BY key. If your table will be ORDER BY (event_date, user_id), your Parquet files must be sorted by event_date then user_id. This pre-sorting allows ClickHouse to perform a zero-copy-like ingestion, as the data is already in the order it needs to be on disk.

A Python script using libraries like pandas and pyarrow is a great tool for this job:

4. Load into ClickHouse

With properly prepared Parquet files, ingestion is refreshingly direct. You first define your target table in ClickHouse, ensuring the column types and ORDER BY clause exactly match the structure of your transformed data. Then, you use the s3 table function to execute a streaming INSERT...SELECT query that reads directly from your cloud storage bucket. This method is highly efficient and avoids moving data through an intermediate server.

5. Validate & Reconcile

Never assume a migration was perfect. Validation is a vital final step. You must programmatically verify data integrity between the source and the destination. You should run checksums or aggregate hashes (groupBitXor in ClickHouse) on numeric columns, compare min/max values on timestamps, and execute a suite of identical, business-critical aggregation queries on both systems to prove that the results are bit-for-bit identical.

  • Row Counts: Simple COUNT(*) on both ends.
  • Checksums: Run checksums or hashes on key columns to check for data integrity.
  • Sampled Queries: Run a few representative analytical queries on both systems and compare the results.

diagram-01.png Diagram 01: End-to-End Batch Migration Flow

Optimizing for Performance

With the data migrated and validated, the focus now goes onto to getting the performance you came for. This requires moving beyond the defaults and making deliberate design choices tailored to your specific query patterns.

Table Design & Partitioning

The MergeTree engine variant and the ORDER BY clause are your most powerful tuning levers. For example, use ReplacingMergeTree if you need to handle late-arriving duplicates based on a unique key, or SummingMergeTree to have ClickHouse automatically roll up metrics during its background merge processes. The ORDER BY key is your primary index; it should be composed of the low-cardinality columns used most frequently in your queries' WHERE clauses (e.g., (event_type, region, toStartOfDay(timestamp))). The PARTITION BY clause is for coarse-grained data management, typically by month (toYYYYMM(timestamp)), which allows for extremely fast partition-level operations like DROP or DETACH for archiving.

Compression & Storage

Storage and compression choices directly impact I/O performance. The default LZ4 compression is fast, but ZSTD often yields significantly better compression ratios, which means less data to read from disk. You can apply column-level codecs to compress low-cardinality strings with one algorithm and high-cardinality numeric data with another. For hardware, nothing beats local NVMe SSDs for raw I/O throughput. For very large datasets, ClickHouse's tiered storage allows you to keep the most recent, "hot" partitions on NVMe while automatically moving older, "cold" partitions to cheaper object storage like S3.

Query Tuning

ClickHouse provides several powerful features to accelerate queries without changing the raw data. Projections are a key tool; they are essentially indexed and aggregated subsets of your data, managed by the table itself. If you frequently query SELECT page, count() GROUP BY page, creating a projection for that query will make it orders of magnitude faster. For columns with a small, fixed set of values (e.g., country codes, status enums), the LowCardinality data type creates a dictionary encoding that dramatically reduces the memory footprint and speeds up any GROUP BY or WHERE operations involving that column.

GlassFlow for Preprocessing: Sort, Deduplicate, Stream

The batch-oriented workflow is ideal for the initial historical data load, but it's not a solution for continuous, real-time ingestion. For live data from sources like Kafka, a dedicated stream processing layer is required. A tool like GlassFlow is built precisely for this purpose, acting as an intermediary that can consume a raw stream, apply stateful transformations like time-windowed deduplication, sort records for optimal ingestion, and then write clean, ordered batches into a ClickHouse sink. This automates the entire "transform and load" pipeline for live data.

diagram-02.png Diagram 02: Real-Time Pipeline with GlassFlow

Monitoring & Maintenance

Once operational, you need visibility into the cluster's health, which is provided by the extensive system tables. You should actively monitor system.merges to understand the write amplification and compaction load on your cluster. The system.parts and system.columns tables give you a precise view of your data's physical storage, including the size and row count of each data part and the on-disk size of each column. This data is essential for diagnosing performance issues and tuning background task thresholds to match your workload.

Cut-Over Strategy

The final phase is the production cut-over. This must be a carefully planned operation to avoid any disruption to downstream applications. A gradual, controlled transition is always the best approach. This way you can isolate issues quickly and roll back if required.

A robust cut-over strategy involves implementing dual-writes, where your application logic sends new data to both Snowflake and ClickHouse simultaneously. This creates a parallel system for validation. You can then run "shadow queries" against ClickHouse, mirroring your production read traffic to compare results and benchmark performance under real-world load. Once you have full confidence, you can use a DNS or load balancer to begin routing a small percentage of live user traffic to the ClickHouse backend, gradually increasing the percentage until it handles 100% of the load. A well-documented rollback plan, detailing the exact steps to revert traffic to Snowflake, is a non-negotiable safety measure.

diagram-03.png Diagram 03: Example Migration Timeline & Checkpoints

Adoption Stories

Looking for more practical examples? ClickHouse has a comprehensive documentation section where you can see which companies are adopting ClickHouse, which services they're migrating from, how they're performing these migrations, and why are they even adopting ClickHouse to begin with. We selected a subset of these success stories that are also well documented so you can check out how a successfull migration would be executed on real-world, productive environments:

  • Cloudflare: This is the canonical story. When you need to analyze millions of events per second and give users instant feedback, you don't use a general-purpose warehouse. Cloudflare's detailed account of building their analytics on ClickHouse is required reading and effectively wrote the playbook for high-throughput observability.
  • Sentry: When their search infrastructure started to buckle, Sentry re-architected their core search product, Snuba, on top of ClickHouse to solve their latency and scale issues at the root, bulletproofing the real-time error discovery feature their users depend on.
  • Vantage: Their migration story directly mirrors the pain points of cloud data warehouses. Bogged down by the cost and sluggishness of alternatives like Redshift and Postgres, Vantage moved to ClickHouse for the exact reasons outlined in this article: to regain control and deliver a faster, more cost-effective product.
  • Attentive: They titled their migration story "Confoundingly Fast" for a reason. For a marketing platform that relies on real-time segmentation and analytics, speed is an actual requirement. Their move to ClickHouse was a strategic decision that directly powered their product's capabilities.
  • Didi: To handle its firehose of petabyte-scale log data, Elasticsearch wasn't cutting it. The ride-sharing company migrated its entire log storage system to ClickHouse, documenting massive improvements in performance and a sharp reduction in operational costs.
  • Plausible Analytics: Plausible didn't migrate to ClickHouse; they built their entire business on it from day one. Their open-source, privacy-first analytics platform is living proof that you can build a lean, fast, and user-facing service without the exorbitant price tag of a traditional cloud data warehouse.
  • PostHog: Another adoption story, PostHog migrated from Postgres to ClickHouse to handle their growing analytics workload. They deployed ClickHouse in parallel and used feature flags to switch over queries one by one.
  • LangChain/LangSmith: Chose ClickHouse to power the observability features of LangSmith. LangSmith now stores run results in ClickHouse instead of Postgres to improve performance and scalability.
  • Tekion: Adopted ClickHouse Cloud for application performance and metrics monitoring after their previous search provider became expensive and difficult to maintain. The team noted that "ClickHouse has proved to be a game-changer, propelling us towards greater efficiency and effectiveness in managing our data infrastructure."

Final Thoughts

Migrating from Snowflake to ClickHouse can be a move toward a more performant, cost-effective, and controllable analytics stack, depending on your use case. A successful outcome is based off a deep understanding of the architectural differences and a methodical approach to the migration, with a particular focus on pre-sorting data to match your ClickHouse table design. Our recommendation is to start small. Pilot the migration on a single, well-understood workload. Use it as an opportunity to iterate on your schema design, tune your queries, and build operational confidence. And for new real-time pipelines, always consider tools like GlassFlow to manage the complexities of stream processing from the start. It will simplify your process and save you and your team a lot of headaches.

References

Did you like this article? Share it now!

You might also like

hero about image
ClickHouse

No-Code Clickhouse ETL with Glassflow

Set up real-time ETL pipelines to ClickHouse with GlassFlow and Docker.

Written by Vimalraj Selvam
hero about image
ClickHouse

Best managed clickhouse services in 2025

Managed ClickHouse services compared by features, pricing and use case.

Written by Armend Avdijaj

Try it now

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

GitHub Repo