Docs

Pricing

About

Blog

Contact

Start building

ClickHouse

Migrating from Redshift to ClickHouse

Migrating from Redshift to ClickHouse: Step-by-Step Guide

Written by

Armend Avdijaj

Sep 26, 2025

1. Introduction

Amazon Redshift is a common choice for enterprise data warehousing, serving traditional business intelligence and reporting needs well. However, modern data applications require more: real-time analytics, sub-second queries on high-volume streams, and predictable costs that don't blow up with growth. In these areas, Redshift can become limited.

Scaling a Redshift cluster can introduce operational headaches and cost uncertainty; RA3 nodes create complexity with their opaque hot-cold data tiering, which can cause sudden performance drops. The Serverless model aims for simplicity but its RPU-based pricing makes forecasting costs for spiky workloads difficult. Other features like Concurrency Scaling often act as expensive band-aids, masking underlying capacity problems while adding their own latency spikes. For teams building interactive dashboards, real-time APIs, or observability platforms, these issues are major roadblocks.

If you're a data engineer, you probably already know ClickHouse; an open-source columnar database built to solve these exact problems. Its architecture is particularly focused on query speed, operational transparency, and cost-efficiency.

This article presents a practical migration playbook for engineers moving from Redshift to ClickHouse. We'll provide example executable code, Redshift-specific architectural advice, and a clear direction for data modeling, extraction, loading, and optimization.

Before we start, check the table below; it will give you common pain-points and how ClickHouse addresses these. If some or all of this sounds familiar, then a migration to ClickHouse might be worth considering:

Redshift Symptom	Why It Happens	ClickHouse Fit	Migration Note
High query latency for aggregations (>5s)	MPP architecture is for broad scans, not point lookups. AQUA helps but has limits.	Primary key index provides a sparse index for rapid data pruning and sub-second response.	Rethink `SORTKEY` as the ClickHouse `ORDER BY` key.
Unpredictable, rising costs	Coupled storage/compute (RA3) or opaque RPU billing (Serverless). Concurrency Scaling adds cost.	Decoupled storage (optional S3) and compute. Granular control over compression and hardware.	Plan storage strategy (local NVMe vs. object storage) based on cost/performance needs.
Struggles with CDC / real-time updates	`MERGE` and `VACUUM` operations are heavy, slow, and block concurrent operations.	`ReplacingMergeTree` and `CollapsingMergeTree` engines handle updates efficiently in the background.	Use a streaming layer like Kafka with a tool like GlassFlow for efficient, deduplicated inserts.
Limited operational control	Redshift is a "black box". Tuning is restricted to WLM, sort/dist keys, and node types.	Full control over partitioning, indexing, compression codecs, storage layout, and query execution.	This control requires expertise. Start with sensible defaults and tune based on monitoring.

Table 01: “Is Migration Warranted?” Quick Checklist

2. When (and Why) to Consider Migrating (Redshift-Specific)

Deciding to migrate goes far beyond simple performance benchmarks. The choice must align your database technology with your workload, cost model, and operational philosophy.

Cost & Elasticity

Redshift's scaling model can create unpredictable costs and performance. For example, RA3 nodes tier data between fast SSDs and slower S3, but this process is opaque and can cause sudden query slowdowns when data is unexpectedly "cold." The Serverless option tries to simplify this but links costs to Redshift Processing Units (RPUs), which makes budgeting for irregular analytical traffic difficult. Even features designed to help, like Concurrency Scaling, can mask capacity issues by adding expensive, temporary clusters that introduce their own "cold start" latency. These factors eventually combine to create an environment where performance and cost are hard to control, a significant problem for teams building responsive applications.

Latency & Workload Fit

Redshift's MPP architecture is designed for large, parallel table scans, making it effective for BI reporting. It is not built for the low-latency aggregations required by user-facing analytics. Joins on high-cardinality columns are also a common source of poor performance.

This architectural mismatch is especially painful for real-time workloads. Change Data Capture (CDC) pipelines that require frequent updates or deletions can be used, but in reality are a poor fit. The MERGE command is a heavy, multi-step operation, and tables with many deletes require frequent, blocking VACUUM operations to reclaim space and maintain performance. This is simply too slow and disruptive for a system that needs to reflect changes in seconds.

Operational Control

Redshift is a managed service that hides many of its internal operations. As an engineer, your tuning options are limited to workload management (WLM) queues, sort keys, distribution styles, and instance types. You cannot control how data is compressed, how indexes work at a low level, how storage is physically laid out, etc. ClickHouse gives you direct control over all these aspects, allowing for deep optimization tailored to your specific queries.

3. Key Differences: Redshift vs. ClickHouse

Migrating from Redshift requires a mental shift. The two systems approach data storage, layout, and querying with fundamentally different philosophies. Understanding these distinctions is a must to designing an effective ClickHouse schema:

Storage/Compute: In Redshift, the architecture is a tightly coupled MPP cluster where compute and storage are logically linked. Scaling one often requires scaling the other. ClickHouse separates these concerns. Data can live on local NVMe for maximum performance or in an object store like S3 for cost savings. Shards and replicas provide computational scaling independent of where the data is stored.
Data Layout: Redshift defines data layout with a DISTKEY (how data is distributed across nodes) and a SORTKEY (how data is physically ordered on disk). These choices are rigid and impact the entire cluster. In ClickHouse, the ORDER BY key (also called the primary key) physically sorts data within each partition. This is the main tool for query optimization, as it allows ClickHouse to prune massive amounts of data based on the WHERE clause.
Indexing/Pruning: Redshift uses zone maps, which store the min/max value for each 1MB block of data. They help prune blocks but are not very granular. ClickHouse uses a sparse primary key index. It stores marks for every Nth row (defined by index_granularity), allowing it to quickly find the start of a relevant data range without reading the full dataset. It also supports secondary data skipping indexes and projections for further optimization.
Semi-Structured Data: Redshift uses the SUPER data type and the PartiQL language to handle JSON. This is flexible but can be slow. ClickHouse prefers an explicit, schema-first approach. You ingest JSON using the JSONEachRow format, which parses fields into typed columns on write. This makes subsequent queries on that data much faster.

Area	Redshift Behavior	ClickHouse Behavior	Migration Implication
Primary Goal	BI & Enterprise Warehousing	Real-time Analytics & OLAP	Shift from broad scans to low-latency, filtered aggregations.
Data Distribution	`DISTKEY` across cluster nodes.	Sharding data across independent nodes.	`DISTKEY` for joins becomes a sharding key for distributed queries.
Data Sorting	`SORTKEY` for range scans.	`ORDER BY` (Primary Key) for sparse indexing and data pruning.	`SORTKEY` columns are the best candidates for the ClickHouse `ORDER BY`.
Updates/Deletes	Heavy `MERGE`/`VACUUM` process.	`ALTER ... UPDATE/DELETE` or specialized `MergeTree` engines.	Use `ReplacingMergeTree` for CDC workloads instead of slow `MERGE` statements.
Semi-structured	`SUPER` type with PartiQL.	`JSONEachRow` on ingest, `Nested` types, `Map` columns.	Flatten `SUPER` data into typed columns during the ETL process.
Scaling	Add nodes (RA3) or RPUs (Serverless). Concurrency Scaling for reads.	Add replicas (reads) or shards (writes/distributed processing).	ClickHouse offers more granular, independent scaling for reads and writes.

Table 02: Feature Comparison Matrix

4. Data Model & Type Mapping (From Redshift to ClickHouse)

Translating a Redshift schema to ClickHouse requires rethinking how data is physically organized to match your query patterns.

Most standard data types like INTEGER, BIGINT, VARCHAR, and TIMESTAMP have direct equivalents. However, pay attention to DECIMAL precision and TIMESTAMP time zones; ClickHouse's DateTime64 with an explicit timezone is recommended. Redshift-specific column encodings like AZ64 become irrelevant. Instead, you will use ClickHouse's column-level compression codecs like ZSTD and LZ4.

The biggest structural changes come from handling Redshift's proprietary features; the SUPER type must be flattened into structured columns before loading into ClickHouse. This can be done in Redshift using PartiQL before the UNLOAD step or during an ETL transformation. The goal is to convert schemaless JSON into a well-defined set of typed columns for better query performance.

Most importantly, the DISTKEY and SORTKEY concepts must be completely re-evaluated:

The columns in your Redshift SORTKEY are the strongest candidates for the ORDER BY clause in your ClickHouse table. This is the most critical decision for ClickHouse performance.
Your DISTKEY is often a high-cardinality ID used for joins. In a clustered ClickHouse setup, this would become your sharding key.

To illustrate, below is a common Redshift table using these features:

CREATE TABLE public.events (
    event_id BIGINT IDENTITY(1,1),
    event_timestamp TIMESTAMP NOT NULL,
    user_id VARCHAR(256),
    event_type VARCHAR(128) ENCODE AZ64,
    properties SUPER
)
DISTSTYLE KEY
DISTKEY (user_id)
SORTKEY (event_type, event_timestamp)

Code Block 01: Redshift Sample Schema (DIST/SORT/SUPER)

Now, let's translate that into an idiomatic ClickHouse schema. Notice how the SUPER field is flattened, and the SORTKEY directly informs the ORDER BY clause.

-- This DDL is for a single node. For a cluster, add ON CLUSTER my_cluster.
CREATE TABLE default.events (
    -- IDENTITY column becomes a standard integer
    event_id UInt64,
    event_timestamp DateTime64(3, 'UTC'),
    user_id String,
    event_type LowCardinality(String),
    -- Flattened SUPER properties
    prop_source String,
    prop_value Float64
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (event_type, event_timestamp, user_id); -- SORTKEY becomes ORDER BY

Code Block 02: ClickHouse Equivalent DDL

5. Migrating the Data

This section outlines a practical, batch-oriented migration path from Redshift to ClickHouse using S3 as an intermediary. We'll break this down into 5 steps:

Inventory & Assessment (Redshift-Specific)
Extract from Redshift (Batch)
Transform for ClickHouse
Load into ClickHouse
Validate & Reconcile

5.1 Inventory & Assessment (Redshift-Specific)

Before exporting data, you need a complete inventory of your Redshift objects. The svv_ and pg_ system views are the primary tools for this discovery phase. You should catalog all schemas, tables, and views, paying close attention to view dependencies, as Redshift's late-binding views can hide logic that needs to be recreated.

The main goals here are as follows:

Analyze Table Structure: Document the DISTKEY, SORTKEY, and column types for each table. Check for IDENTITY columns, as these will need a new strategy like using generateUUIDv4().
Assess SUPER and Spectrum Usage: Identify all tables using the SUPER type, as they require a flattening step. Identify any external tables using Redshift Spectrum; if their source is already S3, you can ingest that data directly into ClickHouse and bypass Redshift.
Define the ClickHouse Ordering Strategy: This is your most critical task. Analyze your most frequent Redshift queries. The columns used in WHERE clauses and GROUP BYs are the best candidates for your new ORDER BY key in ClickHouse.

The following example SQL queries can help you gather this information from your Redshift cluster:

-- Get table structure, dist/sort keys, and encodings
SELECT
    schemaname,
    tablename,
    "column",
    "type",
    encoding,
    distkey,
    sortkey
FROM pg_catalog.svv_columns
WHERE schemaname = 'public'
ORDER BY schemaname, tablename, sortkey, "column";

-- Find views and their definitions to recreate logic
SELECT
    schemaname,
    viewname,
    definition
FROM pg_catalog.pg_views
WHERE schemaname = 'public';

-- Check for SUPER column usage
SELECT
    table_schema,
    table_name,
    column_name
FROM svv_all_columns
WHERE data_type = 'super'

Code Block 03: Redshift Inventory SQL

Figure 01: Inventory to Extraction to CH Schema Map

5.2 Extract from Redshift (Batch)

The most efficient way to get large datasets out of Redshift is the UNLOAD command, which writes data in parallel to S3. Parquet is the ideal format for this; it's columnar, preserves data types well, and is highly optimized for ClickHouse ingestion.

When unloading, use the PARTITION BY option to create a directory structure in S3 (e.g., s3://bucket/table/year=2023/). This helps manage the data flow and can align with ClickHouse's partitioning. For this to work, your Redshift cluster needs an IAM role with permission to write to the target S3 bucket.

Below is an UNLOAD command that exports our sample table to partitioned Parquet files.

-- WARNING: CLEANPATH will delete existing files in the S3 path
UNLOAD ('SELECT event_id, event_timestamp, user_id, event_type, properties FROM public.events')
TO 's3://your-migration-bucket/events/'
IAM_ROLE 'arn:aws:iam::ACCOUNT_ID:role/RedshiftS3AccessRole'
FORMAT AS PARQUET
PARTITION BY (event_type)
CLEANPATH; -- This deletes existing files!

Code Block 04: Redshift UNLOAD to Parquet (Partitioned)

After running the command, you should validate that the files were created using the AWS CLI:

# Validate partitioned directories and files were created in S3
aws s3 ls s3://your-migration-bucket/events/ --recursive

Code Block 05: AWS CLI Validation

5.3 Transform for ClickHouse

This optional step is for cleaning and reshaping data before it enters ClickHouse. It is essential if you are dealing with SUPER columns or if you want to pre-sort your data for maximum ingestion performance.

If your tables contain SUPER data, you can use PartiQL syntax directly in the UNLOAD command's SELECT statement to extract nested fields into standard columns. This is often the easiest way to handle flattening, as it avoids an extra processing step.

The following example modifies the SELECT query from the UNLOAD command to perform this flattening on the fly:

-- Use this SELECT inside your UNLOAD command to flatten the 'properties' column
SELECT
    event_id,
    event_timestamp,
    user_id,
    event_type,
    p.prop_source::varchar, -- Extracts 'prop_source' from SUPER
    p.prop_value::float     -- Extracts 'prop_value' from SUPER
FROM public.events e,

Code Block 06: PartiQL Flattening of SUPER

For more complex transformations, you might perform a post-processing step on the Parquet files in S3. The most impactful optimization is to sort the data within each file to match the ORDER BY key of the target ClickHouse table. This pre-sorting drastically reduces the work ClickHouse does during its initial merge, leading to faster data availability and lower resource consumption. A simple script using a library like PyArrow can handle this, but for very large datasets, a distributed framework like Spark is more appropriate.

# Conceptual example using PyArrow to read, sort, and write Parquet files
import pyarrow.parquet as pq
import pyarrow as pa
from pyarrow import dataset as ds

# Assumes files are accessible (e.g., via s3fs)
source_dataset = ds.dataset('s3://your-migration-bucket/events/', format='parquet')

# Define the sort key to match the ClickHouse table's ORDER BY clause
sort_keys = [('event_type', 'ascending'), ('event_timestamp', 'ascending')]

# This operation is memory-intensive. For very large datasets, use
# a distributed framework like Spark or Dask.
table = source_dataset.to_table()
sorted_table = table.sort_by(sort_keys)

pq.write_table(sorted_table, 'path/to/sorted_output.parquet')

Code Block 07: Python: Parquet Post-Processing

5.4 Load into ClickHouse

With prepared data in S3, loading it into ClickHouse is simple. First, create the target table, which will typically use the MergeTree engine or its replicated variant, ReplicatedMergeTree, for clustered environments.

Then, you can ingest the Parquet files using the s3() table function. This function allows ClickHouse to read files directly and in parallel from an S3 bucket, making it highly efficient for bulk loading.

Below is the DDL for our events table in a clustered setup, followed by the INSERT statement to load the data from S3:

-- Create the table on all nodes in the 'my_cluster' cluster
CREATE TABLE default.events ON CLUSTER my_cluster (
    event_id UInt64,
    event_timestamp DateTime64(3, 'UTC'),
    user_id String,
    event_type LowCardinality(String),
    prop_source String,
    prop_value Float64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (event_type, event_timestamp, user_id);

-- Note: {shard} and {replica} are macros automatically expanded by ClickHouse
-- based on the server configuration for each node

Code Block 08: ClickHouse DDL (Cluster-Aware)

Now, use an INSERT SELECT query with the s3() function. The file globbing pattern (*.parquet) allows it to read all Parquet files in the specified path.

-- Run this on one node; the data will be distributed across the cluster
INSERT INTO default.events
SELECT *
FROM s3(
    '<https://your-migration-bucket.s3.amazonaws.com/events/sorted_*.parquet>',
    'Parquet'  -- Relies on IAM role or environment variables
)

Code Block 09: ClickHouse Ingest from S3 Parquet

Figure 02: End-to-End Batch Path: S3 to CH

5.5 Validate & Reconcile

After loading the data, the last step is verifying its integrity. Do not skip this step. A thorough validation process ensures no data was lost or corrupted during migration.

Depending on the size of your database, this can be daunting; start with simple checks like row counts. Then, move to more detailed comparisons. Check the MIN and MAX values on key columns, especially timestamps, to ensure the full data range was migrated. For numerical columns, calculate sums, averages or more detailed statistical measures on both systems and compare the results. Finally, run a few representative business queries such as daily active users—on both Redshift and ClickHouse to confirm the results are identical.

The following queries provide a general template for these validation checks.

-- Run on both Redshift and ClickHouse and compare results
-- 1. Row Count
SELECT count(*) FROM events;

-- 2. Time Range Check
SELECT min(event_timestamp), max(event_timestamp) FROM events;

-- 3. Checksum on a numeric column (if applicable)
SELECT sum(CAST(event_id AS BIGINT)) FROM events; -- Redshift
SELECT sum(event_id) FROM events; -- ClickHouse

-- 4. Sample Aggregation
SELECT event_type, count(*)
FROM events
WHERE event_timestamp >= '2023-01-01' AND event_timestamp < '2023-02-01'
GROUP BY event_type
ORDER BY

Code Block 10: Validation SQL (Both Sides)

Use the following table for a blueprint on which checklists to include:

Check	Redshift SQL	ClickHouse SQL	Purpose
Row Count	`SELECT COUNT(*) FROM tbl`	`SELECT COUNT(*) FROM tbl`	Verify no data loss.
Column MIN/MAX	`SELECT MIN(col), MAX(col) FROM tbl`	`SELECT min(col), max(col) FROM tbl`	Confirm data range integrity.
NULL Counts	`SELECT COUNT(*) WHERE col IS NULL`	`SELECT count(*) WHERE isNull(col)`	Check for discrepancies in NULL handling.
Distinct Count	`SELECT COUNT(DISTINCT col) FROM tbl`	`SELECT uniq(col) FROM tbl`	Validate cardinality.
Business Query	Your BI query	Equivalent query in CH SQL	Ensure end-user results match.

Table 03: Validation Checklist

6. Near-Real-Time / Streaming Option

For many use cases, a batch migration is just the first step. The goal is often to build a real-time data pipeline. A common pattern for this is to perform a one-time historical backfill from Redshift, then switch over to a streaming source like Kafka or Kinesis for ongoing data.

In this architecture, new and updated rows are streamed through a processing layer that handles tasks like deduplication, efficient batching, and schema management before inserting the data into ClickHouse. This approach avoids the performance hits from running frequent, small inserts directly against the database. Tools like GlassFlow are designed for this exact purpose, acting as a specialized streaming ETL layer between Kafka and ClickHouse. This architecture ensures idempotency and high throughput, which is critical for production systems.

Figure 03: Real-Time Pipeline Diagram

A tool like GlassFlow can be configured with a simple file to manage this stream, ensuring data is correctly processed before it lands in ClickHouse.

{
  "pipeline_id": "redshift-delta-stream-to-ch",
  "source": {
    "type": "kafka",
    "connection_params": {
      "brokers": ["kafka:9092"],
      "protocol": "PLAINTEXT",
      "mechanism": "PLAIN",
      "username": "",
      "password": "",
      "skip_auth": true
    },
    "topics": [{
      "name": "redshift_cdc_events",
      "consumer_group_initial_offset": "earliest",
      "schema": {
        "type": "json",
        "fields": [
          {"name": "event_id", "type": "string"},
          {"name": "timestamp", "type": "string"},
          {"name": "data", "type": "string"}
        ]
      },
      "deduplication": {
        "enabled": true,
        "id_field": "event_id",
        "id_field_type": "string",
        "time_window": "24h"
      }
    }]
  },
  "sink": {
    "type": "clickhouse",
    "host": "clickhouse-server",
    "port": 9000,
    "database": "default",
    "table": "events",
    "username": "default",
    "password": "",
    "secure": false,
    "max_batch_size": 10000,
    "max_delay_time": "1s",
    "table_mapping": [
      {
        "source_id": "redshift_cdc_events",
        "field_name": "event_id",
        "column_name": "event_id",
        "column_type": "String"
      },
      {
        "source_id": "redshift_cdc_events",
        "field_name": "timestamp",
        "column_name": "timestamp",
        "column_type": "DateTime"
      },
      {
        "source_id": "redshift_cdc_events",
        "field_name": "data",
        "column_name": "data",
        "column_type": "String"
      }
    ]
  }
}

Code Block 11: GlassFlow JSON: Sort + Dedupe + CH Sink

7. Optimizing for ClickHouse (Post-Load)

Once your data is in ClickHouse, you now need to optimize. ClickHouse offers a wide array of tools to fine-tune performance.

7.1 Table Design, Projections, and Materialized Views

Your initial table design is the most important factor, but you can add optimizations later. Projections are a powerful feature that pre-calculates and stores aggregations alongside your raw data. They act like a materialized index, dramatically speeding up common GROUP BY queries.

If you need to maintain aggregated data in a separate table, Materialized Views can be used. They are triggered by inserts into a source table and automatically update an aggregate target table. For CDC workloads, consider using specialized table engines like ReplacingMergeTree (which keeps only the latest version of a row based on a key) or SummingMergeTree (which automatically sums columns for rows with the same primary key).

In this example, we create a projection to accelerate queries that group by event_type:

-- Add a projection to the existing events table
ALTER TABLE default.events
ADD PROJECTION p_events_by_type (
    SELECT
        event_type,
        count()
    GROUP BY event_type
);

-- Example of a materialized view for daily user counts
CREATE MATERIALIZED VIEW default.daily_active_users_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type)
AS SELECT
    toDate(event_timestamp) AS event_date,
    event_type,
    uniqState(user_id) AS unique_users_state
FROM default.events
GROUP BY event_date, event_type;

-- How to query the materialized view
SELECT
    event_date,
    event_type,
    uniqMerge(unique_users_state) as unique_users_count
FROM default.daily_active_users_mv
GROUP BY event_date,

Code Block 12: Projections & Materialized Views

7.2 Compression, Storage, and TTL

ClickHouse gives you full control over data compression and lifecycle. The default LZ4 codec is fast, but switching to ZSTD can significantly improve compression ratios at the cost of some CPU. You can set this at a column or table level.

Time-to-Live (TTL) expressions are highly relevant for managing data retention, especially for logs or event data. You can set rules to automatically delete old partitions or move them to a slower, cheaper storage tier (like S3) after a certain period.

7.3 Query Tuning & Quotas

To prevent runaway queries and ensure fair resource allocation, use ClickHouse's settings and quotas system. You can create profiles for different user roles that limit max memory usage, execution time, and threads. For exploratory queries on very large tables, encourage the use of SAMPLE to get fast, approximate results.

Here is how you might define a profile for BI users that prevents them from running overly expensive queries.

-- Create a settings profile for analysts
CREATE SETTINGS PROFILE bi_users_profile
    SETTINGS
        max_memory_usage = '10G',
        max_execution_time = 300, -- 5 minutes
        join_algorithm = 'auto';

-- Create a role and assign the profile to it
CREATE ROLE bi_analyst;
GRANT SELECT ON default.* TO bi_analyst;
ALTER USER some_user DEFAULT ROLE bi_analyst;
ALTER USER

Code Block 13: Query Settings & Quotas

7.4 Monitoring

A production ClickHouse cluster requires monitoring. The system database is full of useful tables for observing the health and performance of your cluster. Key tables to watch include:

system.parts: To monitor the number and size of data parts for each table. Too many small parts can indicate merge processes are falling behind.
system.replication_queue: To check for replication lag between nodes.
system.merges: To observe the background merge activity.
system.query_log: To analyze query performance and identify slow or failing queries.

A simple health check can query these tables to alert on potential issues.

-- Check for replication lag (should be close to 0)
SELECT host, database, table, is_leader, is_readonly, total_replicas, active_replicas
FROM system.replicas WHERE is_readonly OR active_replicas < total_replicas;

-- Check for too many parts in a table (a sign of ingestion problems)
SELECT database, table, count() AS num_parts
FROM system.parts
WHERE active
GROUP BY database, table
HAVING num_parts > 300

Code Block 14: Health Checks

8. Redshift-Specific Gotchas

Migrating from Redshift involves a few specific challenges that are important to plan for:

Redshift Quirk	Problem	ClickHouse Strategy
IDENTITY Columns	No native auto-increment feature in ClickHouse.	Use `generateUUIDv4()` for new primary keys. For existing data, migrate the values as-is.
Late-Binding Views	Dependencies are not validated on creation, can hide broken logic.	Manually inspect view definitions (`pg_views`) and recreate them in ClickHouse with explicit schemas.
`SUPER` Arrays	Arrays within JSON must be unnested.	Use `CROSS JOIN UNNEST` in Redshift's `UNLOAD` query or an equivalent "explode" function in your ETL tool.
Transaction Blocks	Redshift uses standard `BEGIN`/`COMMIT`.	ClickHouse has very limited transaction support. Design ingestion to be atomic and idempotent.
Spectrum Bypass	External tables using Redshift Spectrum query S3 directly.	Investigate the original S3 data source. Often more efficient to ingest directly into ClickHouse.

Table 04: Redshift Quirk to CH Strategy Map

9. Security & Access Control

Replicating Redshift's security model in ClickHouse requires mapping AWS-centric concepts to ClickHouse's internal RBAC system.

Authentication: Where Redshift relies on IAM users and roles, ClickHouse has its own user management system. You can configure it to use LDAP or Kerberos for centralized authentication.
Authorization: Redshift's GRANT/REVOKE commands on schemas and tables map directly to ClickHouse's GRANT/REVOKE system. Create roles in ClickHouse that correspond to your IAM policies to manage permissions at scale.
Encryption: Both systems support encryption at rest and in transit. In ClickHouse, you configure TLS for client connections and can enable disk-level encryption. For sensitive credentials, avoid storing them in SQL scripts; use the server-side configuration or a secrets manager.
Row-Level Security: Redshift uses Virtual Private Database (VPD) policies attached to users. ClickHouse has a similar concept called Row Policies, which allows you to define filter conditions that are automatically applied to SELECT queries for specific roles.

10. Cut-Over Strategy

A successful migration requires a careful cut-over plan to minimize downtime and risk. A phased approach is almost always the best choice.

Backfill and Dual-Write: Perform the initial batch load from Redshift to ClickHouse. At the same time, set up a dual-write process where new data is written to both Redshift and ClickHouse. For streaming workloads, this means your Kafka pipeline will sink to both databases.
Shadowing and Validation: Once the systems are in sync, redirect a portion of your read traffic (or run queries in parallel) to ClickHouse. This "shadowing" phase is critical for comparing query results, performance, and dashboards to ensure correctness without impacting users.
Cut-Over: When you are confident that ClickHouse is correct and performant, you can perform the final cut-over. This typically involves updating application connection strings or changing a DNS record to point to the new ClickHouse cluster. Keep the Redshift cluster running and receiving data for a short period to allow for a quick rollback if needed.
Decommission: After a successful cut-over period (e.g., one week), you can stop writing to Redshift and eventually decommission the cluster.

Figure 04: Timeline: Backfill to Validate to Cutover

11. Data Modeling Considerations: From Redshift to ClickHouse

As we've already discussed, a critical aspect in migrations is the difference in data modeling approaches between Redshift and ClickHouse; Redshift typically follows traditional data warehousing patterns with normalized star or snowflake schemas, and ClickHouse often performs better with denormalized, wide tables. Understanding when and how to adjust your data model is crucial for a successful migration.

Traditional Redshift Modeling vs. ClickHouse Approach

Redshift inherits many conventions from traditional data warehousing:

Star Schema: Fact tables surrounded by dimension tables, optimized for flexibility and storage efficiency.
Snowflake Schema: Further normalized star schema with dimension tables split into sub-dimensions.
Slowly Changing Dimensions (SCD): Type 2 SCDs track historical changes with effective dates.

In ClickHouse, these patterns are not the same, due to its columnar storage and query execution model:

Wide Denormalized Tables: ClickHouse is great with fewer, wider tables that minimize JOINs.
Materialized Views for Aggregations: Pre-computed aggregations replace many dimension lookups.
Dictionaries for Lookups: Efficient key-value lookups replace small dimension tables.

When to Denormalize

Consider denormalization in ClickHouse when:

Your queries frequently JOIN the same tables with stable relationships.
Dimension data changes infrequently (e.g., product categories, geographic regions).
Query latency is critical and must be sub-second.
You can afford increased storage for better query performance.
Your data pipeline can handle the complexity of maintaining denormalized data.

Below is an example transformation from Redshift star schema to ClickHouse denormalized model:

-- Redshift: Traditional star schema with JOINs
SELECT
    d.date,
    p.product_name,
    c.customer_segment,
    SUM(f.sales_amount) as total_sales
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY 1, 2, 3;

-- ClickHouse: Denormalized wide table
SELECT
    sale_date,
    product_name,
    customer_segment,
    sum(sales_amount) as total_sales
FROM sales_denormalized
GROUP BY sale_date, product_name,

When to Keep Normalized

Maintain normalized structures when:

Dimension data changes frequently (e.g., customer profiles, inventory status).
Data consistency is crucial.
Storage costs are a main concern.
You need flexibility for ad-hoc queries across many dimensions.
Update complexity would be too much for your data pipeline.

Hybrid Approach

Often, the optimal solution combines both approaches:

Core Fact Table: Keep the main fact table with high-cardinality IDs
Denormalized Projections: Use ClickHouse Projections to create denormalized views
Dictionaries for Lookups: Replace small, stable dimension tables with ClickHouse Dictionaries
Materialized Views: Pre-aggregate common query patterns

Below is an example hybrid implementation:

-- Main fact table (normalized)
CREATE TABLE sales_fact (
    sale_id UInt64,
    sale_date Date,
    product_id UInt32,
    customer_id UInt64,
    amount Decimal(10,2)
) ENGINE = MergeTree()
ORDER BY (sale_date, product_id);

-- Dictionary for product lookups (replaces dim_product)
CREATE DICTIONARY products_dict (
    product_id UInt32,
    product_name String,
    category String
)
PRIMARY KEY product_id
SOURCE(CLICKHOUSE(TABLE 'dim_products'))
LIFETIME(MIN 3600 MAX 7200)
LAYOUT(HASHED());

-- Materialized view for common aggregations
CREATE MATERIALIZED VIEW sales_by_category_mv
ENGINE = SummingMergeTree()
ORDER BY (sale_date, category)
AS SELECT
    sale_date,
    dictGet('products_dict', 'category', product_id) as category,
    sum(amount) as total_amount
FROM sales_fact
GROUP BY sale_date,

Migration Strategy for Data Models

When migrating from Redshift to ClickHouse:

Start Conservative: Begin with a structure similar to your Redshift schema
Identify Hot Paths: Monitor query patterns to identify frequently accessed JOIN paths
Incrementally Denormalize: Create denormalized tables or projections for high-frequency queries
Measure and Iterate: Compare performance and storage trade-offs
Maintain Flexibility: Keep source data in a flexible format for future remodeling

Remember that ClickHouse's powerful compression often mitigates the storage overhead of denormalization. A denormalized table in ClickHouse might use less storage than a normalized structure in Redshift due to superior compression algorithms.

The key here is to think of your data differently, and not try to bring the same thought process to a new paradigm. Adjust to the new paradigm, suffer less.

12. Final Thoughts

Redshift is a great tool; a capable data warehouse for traditional BI. However, it can struggle with the demands of real-time, high-throughput analytical applications. Its scaling model can lead to unpredictable costs and its architecture is not optimized for the sub-second latencies these workloads require.

ClickHouse on the other hand, offers a powerful alternative, giving engineers precise control over performance, storage, and cost. Its design is explicitly for speed on large-scale analytical queries.

We know first hand how a migration is a significant undertaking that requires careful planning, especially around data modeling. The key is to shift your thinking from Redshift's DISTKEY/SORTKEY model to ClickHouse's PARTITION BY/ORDER BY model. This will make it way easier for you and your team to understand the gains you'll be getting with ClickHouse.

Try it now