Guide to Build an Observability Stack with OpenTelemetry and Kafka

Guide to Build an Observability Stack with OpenTelemetry and Kafka

Guide to Build an Observability Stack with OpenTelemetry and Kafka

Build a complete observability pipeline from telemetry to insights.

Written by

Armend Avdijaj

Jan 7, 2026

Introduction

In today's world of distributed systems and microservices, understanding what's happening inside your applications has become critical. When a user reports a slow page load, or worse, when your revenue-generating API starts failing at 3 AM, you need answers fast. Traditional monitoring—checking if servers are up or down—isn't enough anymore.

Observability is the practice of instrumenting your systems to collect telemetry data (logs, metrics, and traces) that reveals not just what is broken, but why it's broken. It's the difference between knowing your checkout service is down versus understanding that it's down because a database connection pool is exhausted due to a retry storm caused by a downstream payment API timeout. This level of insight is what separates companies that can resolve incidents in minutes from those that struggle for hours.

However, building a robust observability pipeline is challenging. You need to collect billions of events, process them in real-time, store them efficiently for analysis, and present them in a way that helps teams debug issues quickly. This tutorial shows you how to build exactly that—a production-grade observability stack that handles the complete journey from data collection to visualization, with proper ETL pipeline management, data deduplication, and schema transformation built in.

What You'll Build

Welcome! 👋 In this hands-on tutorial, you'll build a complete observability stack from scratch. By the end, you'll have a running system that collects, processes, and visualizes logs, metrics, and traces—just like the systems powering modern cloud-native applications.

Imagine you're building the observability infrastructure for a growing startup. You need a system that can:

  • Collect telemetry data (logs, metrics, traces) from your applications

  • Process and store millions of events per hour

  • Provide a beautiful UI for your team to search and analyze data

  • Scale horizontally as your traffic grows

Here's what we're building together:

Figure 1. High-level architecture overview

The Stack:

  • 🔭 OpenTelemetry Collector - Industry-standard telemetry collection

  • 📨 Apache Kafka - Reliable message streaming

  • 🔄 Glassflow - ETL pipeline orchestration with built-in deduplication and transformations

  • 🗄️ ClickHouse - Lightning-fast columnar database

  • 📊 HyperDX - Modern observability UI

Prerequisites

Before we start, make sure you have these tools installed:

System Requirements:

  • 💻 Minimum: 4 CPU cores, 8GB RAM

  • 🚀 Recommended: 6 CPU cores, 12GB RAM

  • 💾 Storage: ~15GB free space

💡 Tip: If you're on a Mac or Windows machine, adjust Docker Desktop's resource limits to meet these requirements in Preferences → Resources.

Understanding the Architecture

Let's break down what each component does and why it's needed.

The Complete Data Flow

Figure 2. Sequence Diagram for Data Flow

Component Responsibilities

1. OpenTelemetry Collector (The Gateway)

Think of this as your reception desk. It:

  • Receives telemetry from applications via OTLP protocol (industry standard)

  • Performs initial processing (filtering, sampling, batching)

  • Routes data to appropriate destinations

Why we need it: Decouples your apps from your storage backend. Apps just need to speak OTLP.

2. Apache Kafka (The Message Highway)

A distributed messaging system that:

  • Buffers data between collection and storage

  • Handles traffic spikes without data loss

  • Enables multiple consumers to process the same data

Why we need it: Provides resilience and enables scaling components independently.

3. Glassflow (The Smart Pipeline)

This is where the magic happens! Glassflow:

  • Manages multiple ETL pipelines as declarative JSON configs

  • Deduplicates traces (no more duplicate spans!)

  • Transforms and validates data schemas

  • Handles retries and error recovery automatically

  • Provides backpressure management

Why we need it: Simplifies complex data pipeline operations that would otherwise require custom code.

4. ClickHouse (The Fast Database)

A columnar database optimized for:

  • Time-series data (perfect for telemetry)

  • High-speed inserts (millions of rows/second)

  • Fast analytical queries

Why we need it: Traditional databases can't handle the write throughput and query patterns of observability data.

5. HyperDX (The Beautiful UI)

A web application that:

  • Provides search and filtering

  • Visualizes traces as flame graphs

  • Creates dashboards and alerts

Why we need it: Raw database queries are powerful but not user-friendly.

Setting Up Your Environment

Let's get your local Kubernetes cluster running!

Step 1: Create Your Kubernetes Cluster

We'll use kind (Kubernetes in Docker) to create a local cluster:

kind create cluster --name

What's happening:

  • Kind creates a Kubernetes cluster inside Docker containers

  • This takes about 1-2 minutes depending on the system you are running it in.

  • You'll see output showing nodes being created

Verify it worked:

You should see output showing your cluster is running at https://127.0.0.1:...

💡 What is kind? It's Kubernetes running in Docker containers. Each "node" is actually a Docker container, making it perfect for local development.

Step 2: Clone the Repository

If you haven't already, get the demo code:

git clone https://github.com/glassflow/clickhouse-etl.git
cd

Deploying the Stack

Now for the exciting part—let's deploy everything! To makes things even simpler, all you need to do this is: run one command!

The One-Command Deploy

We've automated the entire deployment process. Run:

make

What happens under the hood:

Figure 3. Deployment Workflow

What Gets Created

Once the deployment is complete, you should see the following output in your terminal.

Now, let’s Let's explore the namespaces we just created:

# View all pods across namespaces
make

Running this will show You will see the following output:

Figure 4. Deployment Status Check

All pods should show Running or Completed. If any are Pending or CrashLoopBackOff, see the Troubleshooting section.

Understanding the Data Flow

Now that everything is running, let's trace a single log entry through the entire pipeline.

Step-by-Step Journey of a Log Entry

Figure 5. Journey of a Single Log Entry!

The Glassflow Pipeline Configuration

Let's look at how the logs pipeline is configured. The configuration is a simple declarative JSON that can be understood easily:

{
  "version": "v2",
  "pipeline_id": "logs-pipeline",
  "name": "Logs Pipeline",
  "source": {
    "type": "kafka",
    "topics": [
      {
        "name": "otel-logs",
        "consumer_group_initial_offset": "latest"
      }
    ]
  },
  "sink": {
    "type": "clickhouse",
    "table": "otel_logs",
    "max_batch_size": 1000,
    "max_delay_time": "30s"
  },
  "schema": {
    "fields": [
      {"name": "Timestamp", "type": "string", "column_type": "DateTime64"},
      {"name": "SeverityText", "type": "string", "column_type": "LowCardinality(String)"},
      {"name": "Body", "type": "string", "column_type": "String"}
      // ... more fields
    ]
  }
}

Key features of this are:

  • Declarative - No code needed

  • Schema mapping - Automatic type conversion

  • Batching - Efficient writes (up to 1000 records or 30 seconds)

  • Monitoring - Built-in metrics and health checks

Exploring Your Data

Let's actually see the data flowing through the system.

Access the HyperDX UI

Open a new terminal and run:

make

This creates a port-forward to HyperDX. Now open your browser to:

🌐 http://localhost:3000

You should be able to see all the telemetry events coming into ClickHouse on the UI as shown below.

Figure 6. Checking your Logs on the UI!

Neat, right?

Access the Glassflow UI

Now, to see how Glassflow is doing with the pipelines, run the following command in another terminal window:

make

Then, Open your browser to:

🌐 http://localhost:8080

Here, if you navigate to the Pipelines section, you will be able to see the 4 pipelines that have been created for this tutorial.

Figure 7. Checking the Active GlassFlow Pipelines

Let’s take a closer look at the Metrics Sum Pipeline by clicking on it. It will provide us with an intuitive interface to look at a lot of things like the number of entries in the destination table, details about the source topic, the schema mapping, etc.

Figure 8. Pipeline Details at a Glance

Query ClickHouse Directly

Want to see the raw data? Let's query ClickHouse:

# Connect to ClickHouse
kubectl exec -it -n hyperdx deploy/hyperdx-hdx-oss-v2-clickhouse -- clickhouse-client

# Count logs
SELECT count() FROM otel_logs;

# View recent logs
SELECT
    Timestamp,
    SeverityText,
    ServiceName,
    Body
FROM otel_logs
ORDER BY Timestamp DESC
LIMIT 10;

# Count traces
SELECT count() FROM otel_traces;

# View trace spans
SELECT
    Timestamp,
    TraceId,
    SpanName,
    Duration,
    ServiceName
FROM otel_traces
ORDER BY Timestamp DESC
LIMIT 10

What you're seeing:

  • Thousands of events being written every minute

  • Real-time data from the telemetry generators

  • Properly structured and indexed data

How Glassflow Enhances the Pipeline

You might be wondering: "Why not just write from Kafka directly to ClickHouse?"

Great question! Let's explore what Glassflow provides.

1. Deduplication (Critical for Traces!)

Distributed tracing can generate duplicate spans. Glassflow handles this automatically:

{
  "deduplication": {
    "enabled": true,
    "id_field": "TraceId",
    "time_window": "10m"
  }
}

How it works:

Figure 9. How Deduplication Works with GlassFlow?

Without this, you'd see duplicate traces in your UI! 🐛

2. Schema Management with Ease

As your application evolves, field types might change between Kafka topics and the actual ClickHouse Tables. Glassflow handles this gracefully with just some really simple configurations and also provides you an option to change it through the UI:

{
  "name": "StatusCode",
  "type": "string",        // Source type
  "column_type": "UInt16"  // ClickHouse type
}

It looks something like this:

Figure 10. GlassFlow Schema Management (Configurable)

3. Batching & Backpressure

Glassflow optimizes writes to ClickHouse:

{
  "max_batch_size": 1000,
  "max_delay_time": "30s"
}

The batching strategy:

Figure 11. Efficient Writes to ClickHouse with Batching

This reduces database load while maintaining low latency.

4. Error Handling & Retries

Network issues? Schema errors? Glassflow has you covered:

  • Automatic retries with exponential backoff

  • Dead letter queue for problematic records

  • Detailed error logging

5. Monitoring & Observability

Glassflow exposes metrics about itself which can give you a better idea on how is it actually performing and provide you with some useful insights. These are the kind of metrics it tracks:

  • Events processed per second

  • Consumer lag

  • Error rates

  • Pipeline health status

To access these metrics, scrape Glassflow’s Prometheus-compatible /metrics endpoint: {release-name}-otel-collector.{namespace}.svc.cluster.local:9090/metrics

These metrics can be visualized by connecting the endpoint to Prometheus and a dashboarding tool such as Grafana.
For more details, refer to the official documentation.

Troubleshooting

Running into issues? Here are some common problems that you might encounter and quick and easy solutions which will allow you to get up and running without any issues!

1. Problem: Pods Stuck in Pending

Symptom: kubectl get pods -A shows pods in Pending state

Diagnosis:

kubectl describe pod <pod-name> -n

Common causes:

  • Insufficient resources - Increase Docker Desktop memory to 12GB

  • Storage issues - Check kubectl get pv for persistent volumes

Solution:

# Delete and recreate cluster with more resources
kind delete cluster --name observability-demo
kind create cluster --name

2. Problem: HyperDX Shows No Data

Symptom: UI loads but shows empty dashboards

Diagnosis:

# Check if pipelines are running
kubectl get pods -n glassflow

# Check ClickHouse has data
kubectl exec -n hyperdx deploy/hyperdx-hdx-oss-v2-clickhouse -- \\
  clickhouse-client -q "SELECT count() FROM otel_logs"

Common causes:

  • Pipelines not deployed correctly

  • Telemetry generators not running

Solution:

# Redeploy pipelines
make deploy-pipelines

# Restart telemetry generators
kubectl delete -f k8s/telemetry/
kubectl apply -f

3. Problem: Out of Memory Errors

Symptom: Pods getting OOMKilled

Diagnosis:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Last State"

Solution:
Reduce telemetry generation rate. Edit the telemetry files:

# Edit each telemetry generator
kubectl edit job telemetrygen-logs -n otel

# Change: --rate=50
# To:     --rate=10

Next Steps

Congratulations! 🎉 You've built a production-ready observability pipeline! Now, let’s explore how you can take this even further and integrate it into your existing workflow seemlessly.

Experiment Further

  1. Add Custom Transformations

    • Edit pipeline JSON files in glassflow-pipelines/

    • Add field filtering or enrichment

    • Redeploy with make deploy-pipelines

  2. Scale the System

  • Update the replicas value in your Glassflow pipeline configuration

    (see the pipeline JSON reference)

  • Ensure the Kafka topic consumed by the pipeline has at least the same number of partitions as replicas

  • Each pipeline ingestor replica will read from one or more partitions, allowing horizontal scaling through Kafka partitioning

  • 3 pipeline replicas → Kafka topic should have ≥ 3 partitions

# Scale Kafka
kubectl scale statefulset kafka-controller -n kafka --replicas=3
  1. Monitor Resource Usage

    kubectl top nodes
    kubectl top pods -A
  2. Connect Your Real Application

    • Use the OpenTelemetry SDK in your app

    • Point to: opentelemetry-collector.otel.svc.cluster.local:4317

    • Remove telemetry generators

Learn More

Clean Up

When you're done experimenting:

# Delete the entire stack
make delete-stack

# Delete the kind cluster
kind delete cluster --name

Summary

You've learned how to:

  • ✅ Deploy a complete observability stack on Kubernetes

  • ✅ Understand OpenTelemetry data collection

  • ✅ Use Kafka for reliable message streaming

  • ✅ Configure Glassflow ETL pipelines declaratively

  • ✅ Store telemetry in ClickHouse for fast queries

  • ✅ Visualize data in a modern UI

The Key Takeaway

Glassflow eliminates the complexity of building custom ETL code for your observability pipeline. Instead of managing Kafka consumers, handling retries, and implementing deduplication logic, you define pipelines as simple JSON configurations.

This lets you focus on understanding your systems, not maintaining data pipelines.

Questions or Issues?

  • Happy observing! 🔭

Did you like this article? Share it!

You might also like

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.