ClickHouse

How to Handle Semi-Structured Data in Your Log Pipelines

Turn messy logs into analytics-ready data with GlassFlow & ClickHouse.

Written by Armend Avdijaj14/08/2025, 06.55
hero about image

In the world of modern distributed systems, logs are the lifeblood of observability. But this lifeblood is often messy. It arrives as a torrent of semi-structured data—a mix of predictable fields and chaotic JSON blobs. For any developer or DevOps engineer tasked with building a resilient log pipeline, the core challenge is clear: how do you tame this flexible, unpredictable stream into a structured, queryable asset without creating brittle, unmaintainable systems? This article provides a modern playbook, moving beyond traditional ETL to show how a 'schema-on-transform' approach using GlassFlow and ClickHouse can turn log chaos into analytical clarity.

Understanding Semi-Structured Data in Logs

Semi-structured data occupies a middle ground between the rigid, tabular world of structured data found in relational databases and the complete formlesness of unstructured data like images, audio, or video files. While it does not conform to a strict, predefined schema, semi-structured data contains organizational markers, such as tags, metadata, or a discernible hierarchy, that make it inherently searchable and easier to process than purely unstructured content.

Common formats like JSON (JavaScript Object Notation), XML (eXtensible Markup Language), and YAML are examples of this data type. They are self-describing, using key-value pairs and nested structures to create a flexible yet organized format. The point is, it has a flexible schema; it is present but not strictly enforced, allowing for variations from one record to the next.

Application logs are a perfect example of semi-structured data. A typical log entry, such as one from an Apache web server, illustrates this duality perfectly. Consider the following example:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

This single line contains highly structured elements: an IP address (127.0.0.1), a timestamp ([10/Oct/2000:13:55:36 -0700]), an HTTP status code (200), and a response size (2326). However, it also contains free-form text like the user identifier (frank) and the request string ("GET /apache_pb.gif HTTP/1.0"). In modern applications, this free-form portion is often replaced with a complex, nested JSON object containing rich contextual information. It is this combination of predictable structure and unpredictable content that makes logs both incredibly valuable for analysis and notoriously difficult to handle in a pipeline.

Core challenges

The flexibility that makes semi-structured logs so useful for developers is what makes them so tricky for downstream data pipelines. Any robust log management strategy must confront four critical challenges. These are not minor inconveniences; they are systemic risks that can lead to data corruption, pipeline failures, inaccurate analytics, and an erosion of trust in the data platform.

Challenge 1: Schema Drift

Modern software development is agile and iterative. Applications evolve, and as they do, their log formats inevitably change. A developer might add a new field to a JSON log to capture more context, rename an existing field for clarity, or change a data type to accommodate new information. This phenomenon, known as "schema drift" or "schema evolution," is a constant threat to data pipelines. A pipeline built to parse a specific log structure will break the moment that structure changes. Downstream consumers, from dashboards to machine learning models, that expect a fixed schema will either fail or, worse, start processing incomplete data. In today's fast-paced development cycles, schema drift is not an edge case; it is an operational certainty.

Challenge 2: Polymorphic Data

Polymorphism is a more subtle variant of schema drift. It occurs when the same field can contain different data types across different log entries. A powerful example is an address attribute in a log. Initially, it might be a simple string, like "123 Main St." After a feature update, the same address attribute might evolve to become a complex object or an array of objects to store multiple addresses with structured fields like street, city, and zip_code. JSON, being schema-less, places no restrictions on such changes. This creates a nightmare for any data store or processing engine that relies on strongly-typed columns. If the pipeline is not explicitly designed to detect and handle this polymorphism, it can lead to massive, silent data loss as the system attempts to fit an object into a string column or vice-versa.

Challenge 3: Nested Data

Modern application logs, especially those from microservices architectures, are often rich with hierarchical data. They frequently contain deeply nested JSON objects and arrays to capture the full context of an event. While this is great for debugging, it presents a significant challenge for analytics. Most analytical databases perform best with flat, tabular data. The process of "flattening" these hierarchical structures into rows and columns is fraught with complexity. Engineers must write convoluted queries or scripts to unpack nested elements, correctly reference paths through multiple levels of objects and arrays, and find ways to preserve the relationships between parent and child elements (akin to foreign keys in a relational model). This flattening logic is not only difficult to write and test but can also be computationally expensive, creating performance bottlenecks in the pipeline.

Challenge 4: The Silent-Failure Problems

Perhaps the most dangerous challenge is the way many data systems handle errors during flattening. When a query or script attempts to access a path in a JSON object that does not exist for a particular record (due to schema drift or the field being optional), many systems do not throw an error or warning. Instead, they silently return a NULL value. This behavior is catastrophic. It means your data pipeline can appear to be running perfectly healthy, with no failed jobs or alerts, while it is actively corrupting your data. An aggregation like  COUNT(field) will produce a lower number, and AVG(field) will be skewed, but you will have no indication that the underlying data is incomplete. This silent data loss can go undetected for weeks or months, leading to flawed business insights, inaccurate reports, and a fundamental breakdown of trust in the data.

The root of these challenges lies in a fundamental tension. The schema flexibility that developers value at the point of data creation, allowing them to quickly and easily log rich, contextual information, becomes a significant liability at the point of data consumption, where reliability, structure, and performance are paramount. When a developer changes a log format to solve an immediate application-level problem, they may unknowingly break the downstream data pipeline. The silent failure problem ensures this breakage goes unnoticed, leading to corrupted metrics and, eventually, a loss of confidence from business stakeholders who can no longer trust the data. A successful architecture must resolve this organizational and technical tension, not just patch over the symptoms.

When to Impose Structure

At the heart of any data pipeline design lies a fundamental architectural decision: when to impose structure on the data. Historically, this has been a binary choice between two competing philosophies: schema-on-write and schema-on-read. Understanding these two classical paradigms is essential to appreciating the more advanced, hybrid approach that modern tools enable.

The difference between these two models boils down to a simple question: when is the data checked against its schema?

Schema-on-Write (The Traditionalist)

In the schema-on-write model, a strict, predefined schema is made before any data is loaded. Every piece of incoming data is validated against this schema, transformed, and cleaned to ensure it conforms to the required structure. Only then is it written to the target database or data warehouse. This is the long-standing approach of traditional relational database management systems (RDBMS).

  • Pros: This approach yields exceptionally high query performance because the data is already clean, structured, and optimized for analysis. Data integrity is also very high, as validation rules are enforced upfront, preventing "garbage" data from entering the system.
  • Cons: The primary drawback is its inflexibility. If a log arrives with a new, unexpected format, it will be rejected by the pipeline, leading to data loss or requiring manual intervention. This model demands significant upfront planning and is ill-suited for the rapidly evolving data formats common in modern applications.

Schema-on-Read (The Modernist)

The schema-on-read model takes the opposite approach. Raw, messy, semi-structured data is loaded into a storage system, typically a data lake, "as-is" without any upfront transformation or validation. The schema is defined and applied only at the moment the data is queried or read for analysis. This paradigm is favored by many big data and analytics platforms that need to ingest massive volumes of diverse data quickly.

  • Pros: This model is extremely flexible and agile. It can easily accommodate new and evolving data sources without requiring changes to the ingestion pipeline. Ingestion is also very fast, as the system simply dumps the raw data into storage.
  • Cons: The flexibility comes at a cost to query performance. Every query must bear the computational burden of parsing, cleaning, and structuring the data on the fly, which can be slow and resource-intensive. This approach is also prone to "garbage in, garbage out" issues, as there is no upfront validation to ensure data quality. It effectively shifts the burden of data preparation from the data engineer to the data analyst or scientist who is consuming the data.

The following table summarizes the critical trade-offs between these two foundational architectures.

FeatureSchema-on-WriteSchema-on-Read
Core PrincipleStructure data before loadingLoad data raw, structure it on query
FlexibilityLow / RigidHigh / Flexible
Ingestion SpeedSlower (due to validation/transform)Faster (loads data as-is)
Query PerformanceHigh (data is pre-structured)Lower (parsing happens at query time)
Data IntegrityHigh (enforced at write time)Lower (risk of inconsistent data)
Typical Use CaseRelational Databases, Data WarehousesData Lakes, Big Data Analytics
Who Bears the Burden?The Data Producer / ETL DeveloperThe Data Consumer / Analyst

A Better Way: The Schema-on-Transform

The binary choice between schema-on-write and schema-on-read forces a difficult compromise between ingestion flexibility and query performance. However, a modern architecture using a dedicated stream processor like GlassFlow in front of an analytical database like ClickHouse enables a third, more advanced pattern: Schema-on-Transform. This hybrid approach decouples the data producer from the data consumer by placing the responsibility for creating structure onto a dedicated, intermediate transformation layer.

This model elegantly resolves the tension between development agility and analytical reliability.

  1. Like a schema-on-read system, the pipeline can ingest raw, semi-structured logs from the source (e.g., Kafka) without imposing rigid constraints on the developers who produce them. This preserves the flexibility that development teams need.
  2. GlassFlow then acts as this intermediate "transform" layer. It intercepts the data stream in-flight and applies a schema, performs complex transformations, enriches the data with external context, and validates its structure. This step is where the chaos of semi-structured data is tamed.
  3. The data that is ultimately loaded into the destination, ClickHouse, is clean, structured, and analytics-ready. This provides the data consumer with all the benefits of a schema-on-write system: high data integrity and blazing-fast query performance.

This pattern is not schema-on-read, because the structure is applied before the data is queried by the end-user. It is not traditional schema-on-write, because the original data source is not constrained by a rigid schema. By creating this intermediate processing stage, the schema-on-transform pattern offers the best of both worlds: flexible ingestion and performant, reliable analytics. It provides a powerful mental model for designing resilient, modern log pipelines.

The Modern Toolkit: GlassFlow + ClickHouse

Building a pipeline based on the schema-on-transform pattern requires a specific set of tools: a powerful stream processor for the transformation layer and a high-performance database for the analytical layer. GlassFlow and ClickHouse are excellent choices for these respective roles, forming a potent combination for real-time log analytics.

GlassFlow: The Python-Native Streaming ETL

GlassFlow is a real-time, serverless stream processing platform engineered to simplify the creation and management of event-driven data pipelines. Its core purpose is to enable developers to ingest, transform, and observe data streams at scale. A primary use case is processing data from message queues like Apache Kafka and loading it into analytical databases such as ClickHouse.

What sets GlassFlow apart is its Python-native approach. Instead of requiring developers to learn a complex, specialized Domain-Specific Language (DSL) or a new ecosystem like Java/Scala, GlassFlow empowers them to write transformation logic using standard Python. This is a significant advantage, as it allows teams to leverage their existing programming skills and the vast, mature ecosystem of Python libraries for data manipulation, machine learning, and API integration directly within their data pipeline.

ClickHouse: The High-Performance Analytics Engine

ClickHouse is an open-source, high-performance SQL database management system (DBMS) designed specifically for Online Analytical Processing (OLAP). Its remarkable speed is primarily due to its  column-oriented storage architecture.

In a traditional row-oriented database, all the values for a single row are stored together on disk. In a column-oriented (or columnar) database like ClickHouse, all the values for a single column are stored together. This has two profound benefits for analytics. First, when a query only needs to access a few columns from a table with hundreds of columns (a very common pattern in log analysis), the database only needs to read the data for those specific columns, drastically reducing disk I/O. Second, storing similar data types together allows for extremely high compression ratios, further reducing storage costs and I/O requirements.

Transforming Streams with GlassFlow

In the schema-on-transform model, the stream processor is the critical juncture where raw, unpredictable data is refined into a clean, structured asset. GlassFlow provides a suite of powerful features designed to handle the most common and difficult challenges of log data processing upstream, before the data ever reaches the analytical database.

Solving Duplicates: Upstream Deduplication

Distributed systems, particularly those using messaging queues like Kafka, often operate with an "at-least-once" delivery guarantee. This ensures that messages are never lost, but it comes at the cost of potentially delivering the same message more than once, resulting in duplicate records in the log stream.

While ClickHouse offers mechanisms like the ReplacingMergeTree table engine to handle duplicates, this process occurs during asynchronous background merges. This means that for a period of time, duplicates can exist in the table, leading to temporary inconsistencies in query results. Forcing deduplication at query time with the FINAL keyword guarantees clean results but can significantly slow down queries on large datasets.

GlassFlow solves this problem more elegantly by shifting deduplication upstream. It provides a streaming deduplication feature that processes the Kafka stream in real time. Users can define a deduplication key (e.g., a unique event_id in the log) and a time window (up to 7 days). GlassFlow then keeps track of the keys it has seen within that window and ensures that only the first event with a given key is forwarded to ClickHouse. All subsequent duplicates are simply discarded, guaranteeing that only clean, unique data is ingested and reducing the computational load on ClickHouse. To learn more about this you read our previous articles here and here.

Adding Context: Real-Time Enrichment with Temporal Joins

Raw logs often lack the full context needed for meaningful analysis. A log entry might contain a user_id, but an analyst will want to see the user's name, subscription plan, or geographic region. To add this information, the log stream must be enriched by joining it with other data sources, such as a users table or stream.

Performing these JOIN operations in a high-performance OLAP database like ClickHouse can be challenging. Joining large tables in real time can be slow and consume substantial memory and CPU resources, degrading overall query performance.

GlassFlow offloads this work from the database by providing Temporal Stream Joins. This feature allows for the joining of two real-time Kafka streams on the fly. For example, a stream of order logs can be joined with a stream of user profile updates. The join is configured with a common key (e.g., user_id) and a time window to handle events that may arrive out of order. GlassFlow performs this enrichment within the pipeline, so the data that lands in ClickHouse is already a complete, contextualized record. This keeps the ClickHouse schema simpler and the analytical queries fast and efficient, as they no longer need to perform complex joins.

Custom Logic: The Power of Python Transformations

While deduplication and joins are powerful built-in features, the true flexibility of GlassFlow lies in its use of Python for custom transformations. This is not merely a matter of convenience; it is a strategic capability that elevates the platform from a simple ETL tool to a highly adaptable data processing engine. It allows developers to inject any logic that can be expressed in Python directly into the data stream, addressing the "long tail" of complex, business-specific requirements that are often difficult or impossible to handle with declarative-only tools.

This programmatic control is the ultimate solution to the challenges of semi-structured data. If a log format is non-standard, a developer can write a custom Python parser. If a log contains sensitive Personally Identifiable Information (PII), they can use a library to redact it. If enrichment requires data from an internal microservice, they can make a REST API call. This unlocks nearly unlimited potential for data manipulation.

The core of this feature is the handler function, which every transformation must implement. This function receives the incoming data record, processes it, and returns the transformed record.

Here are several practical examples of what can be achieved with Python transformations:

Parsing Non-Standard Formats

Imagine a legacy application that produces logs in a custom, non-JSON format, like key-value pairs separated by semicolons. A simple Python function can parse this into a structured dictionary.

Data Enrichment via API Call

Expanding on the concept of enrichment, a Python transformer can call any external or internal API to add context. This example enriches a log containing an IP address with geolocation data.

Personally Identifiable Information Redaction

Data privacy and compliance are paramount. A Python transformer can be used to inspect log data and redact or hash sensitive information before it is stored, ensuring compliance with regulations like GDPR or CCPA.

These examples demonstrate how the Python-native approach provides the ultimate escape hatch, giving developers the power to solve any data transformation challenge with the tools and libraries they already know.

A Step-by-Step Guide: Building Your Resilient Log Pipeline

This section synthesizes the concepts of schema-on-transform, upstream processing, and hybrid schemas into a practical, step-by-step guide for building a resilient log pipeline using GlassFlow and ClickHouse.

Prerequisites and Setup

To follow this guide, a local development environment with Docker is required. The necessary components are:

  • A running Apache Kafka instance.
  • A running ClickHouse instance.
  • A local installation of the GlassFlow ETL tool or a GlassFlow Cloud account.

The open-source glassflow/clickhouse-etl repository provides a docker-compose.yml file that simplifies this setup, allowing you to launch local instances of Kafka, ClickHouse, and the GlassFlow UI and backend with a single command.

  1. Clone the repository: git clone https://github.com/glassflow/clickhouse-etl.git cd clickhouse-etl
  2. Start the services: docker compose up -d

This will make the ClickHouse native port available at localhost:9000, Kafka at localhost:9092, and the GlassFlow web interface at http://localhost:8080.

Step 1: Define the ClickHouse Schema

First, connect to your local ClickHouse instance and create the target table. This table will use the hybrid schema design pattern, with dedicated columns for hot-path fields and a JSON column for all other dynamic attributes.

This schema is optimized for common log queries. Ordering by service_name, level, and timestamp will make filtering on these dimensions extremely fast. The event_id will be used for upstream deduplication.

Step 2: Configure the GlassFlow Pipeline

Navigate to the GlassFlow UI (http://localhost:8080) and create a new pipeline. The configuration involves setting up a source, a series of transformations, and a sink.

Source Configuration

  • Select Kafka as the data source.
  • Provide the connection details for your local Kafka instance (e.g., Broker: kafka:9092).
  • Specify the input topic name, for example, raw_logs.

Transformation Configuration

This is where the raw data is refined. Multiple transformation steps can be chained together.

  1. Add a Deduplication Transformer:
    • Select the Deduplication processor.
    • Set the Deduplication Key to event_id. This tells GlassFlow to discard any messages that have the same event_id as one seen previously.
    • Set a Time Window, for example, 1 day, to define how long GlassFlow should remember seen keys.
  2. Add a Python Transformer:
    • Select the Python processor. This is where the core parsing and structuring logic will reside.
    • Paste the following Python code into the editor. This function assumes the incoming log from Kafka is a JSON string containing a mix of known and unknown fields. It will parse this string, extract the "hot-path" fields into the top level, and bundle the rest into the attributes field.

Sink Configuration

  • Select ClickHouse as the data sink.
  • Provide the connection details for your local ClickHouse instance (e.g., Host: clickhouse, Port: 9000, Database: default).
  • Specify the target table: logs.
  • Map the output fields from the Python transformer to the corresponding columns in the logs table. The UI will allow you to draw connections from timestamp to timestamp, level to level, attributes to attributes, and so on.
  • Deploy the pipeline.

Step 3: Generate and Process Data

With the pipeline running, the final step is to produce some sample data. The following Python script uses the kafka-python library to send semi-structured log messages to the raw_logs Kafka topic. For more advanced data generation, a tool like glassgen could be used.

Step 4: Verify and Analyze

After running the producer script, the data will flow through Kafka, be processed by GlassFlow, and loaded into ClickHouse. Connect to ClickHouse and run a query to verify the results.

SQL

The output should show clean, structured data with the hot-path fields in their own columns and the remaining dynamic fields (user_id, request_id, duration_ms) nested within the attributes JSON column.

Next, verify that deduplication worked by checking for records with the known duplicate event_id.

This query should return zero rows, confirming that the upstream deduplication was successful. Finally, run an analytical query to see the value of the processed data.

This demonstrates a fully functional, end-to-end resilient log pipeline that successfully tames the chaos of semi-structured data.

Conclusion

The journey from a raw, unpredictable stream of semi-structured logs to a clean, queryable analytical asset is fraught with challenges. Traditional architectural choices have long forced a difficult compromise: the rigidity of schema-on-write, which stifles development agility, versus the performance pitfalls of schema-on-read, which burdens data consumers. This analysis has charted a course for a more modern and effective approach.

By adopting a schema-on-transform architecture, organizations can resolve this fundamental tension. This pattern provides the flexibility that development teams require at the source while delivering the structure and reliability that analytics teams demand at the destination. It achieves this by introducing a dedicated, intelligent processing layer between data production and consumption.

The combination of GlassFlow and ClickHouse delivers a powerful, modern stack. It significantly improves data quality by catching errors early, increases developer velocity by abstracting away pipeline complexity, and enables high-performance analytics on massive datasets. This architecture transforms your logs from a reactive troubleshooting tool into a proactive, strategic asset.

Try it Yourself. Ready to stop wrestling with complex stream processing frameworks and start building? Because GlassFlow is Python-native, you can get a robust, production-ready pipeline running in minutes, not weeks. Explore their official GitHub repository to see more examples and follow their "getting started" guide to deploy your first pipeline today. Take control of your data stream and discover how simple real-time transformation can be.

Ultimately, this architectural pattern represents a significant evolution in data pipeline design. It moves away from brittle, monolithic scripts and toward a more resilient, decoupled system. By embracing this approach, teams can transform their log data from a purely reactive troubleshooting tool into a proactive, strategic source of business and operational intelligence, unlocking its true value at scale.

Did you like this article? Share it now!

You might also like

hero about image
ClickHouse

Clickhouse and Its Limitations with JOINS

Clickhouse and the limitations when it comes to JOINS

Written by Armend Avdijaj
hero about image
ClickHouse

GlassFlow vs ClickPipes for Deduplication

We compared GlassFlow and ClickPipes for ClickHouse deduplication

Written by Armend Avdijaj

Try it now

Cleaned Kafka Streams for ClickHouse

Clean Data. No maintenance. Less load for ClickHouse.

GitHub Repo