Kafka Isn’t a Database, But We Gave It a Query Engine Anyway

Simon Chaffetz + Sami Tabet

Software Engineers

March 20, 2026

warpstream-events-built-in-observability-powered-by-kafka

HN Disclosure: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

Making Life Even Easier for Our Customers

WarpStream’s BYOC architecture gives users the data governance benefits of self-hosting with the convenience of a fully managed service. Events are a new observability tool that take this convenience one big step further.

When we initially designed WarpStream, our intention was for customers to lean on their existing observability tools to introspect into WarpStream. Scrape metrics and logs from your WarpStream Agent and pipe them into your own observability stack. This approach is thorough and keeps your WarpStream telemetry consolidated with the rest. It works well, but over time we noticed two limitations.

Not all our users have modern observability tooling. Some may take structured logging with search and analytics for granted, while others don't have centralized logs at all. For various reasons, some teams have to tail their Agents’ logs in the terminal.
High-cardinality data is expensive. Customers with modern observability systems often use third party platforms that charge a fortune for high-cardinality metrics. We restrict the Agent’s telemetry data to keep these costs low for our customers and in turn these restrictions reduce visibility into the Agent.

We realized that to truly make WarpStream the easiest to operate streaming platform on the planet, we needed a way to:

Emit high cardinality structured data.
Query the data efficiently and visualize results easily.
All while keeping the data inside the customer's environment.

Events address all three of these issues by storing high-cardinality events data, i.e. logs, in WarpStream, as Kafka topics, and making these topics queryable via a lightweight query engine.

Events also solves an immediate problem facing two of our more recent products: Managed Data Pipelines and Tableflow which both empower customers to automate complex data processing all from the WarpStream Console. These products are great, but without a built-in observability feature like Events, customers who want to introspect one of these processes have to fall back to an external tool, and switching from the WarpStream console adds friction to their troubleshooting workflows.

We considered deploying an open-source observability stack alongside each WarpStream cluster, but that would undermine one of WarpStream's core strengths: no additional infrastructure dependencies. WarpStream is cheap and easy to operate precisely because it's built on object storage with no extra systems to manage. Adding a sidecar database or log aggregation pipeline would add operational burden and cost.

So we decided to build it directly into WarpStream. WarpStream already has a robust storage engine, so the only missing piece was a small, native query engine. Luckily, many WarpStream engineers helped build Husky at Datadog so we know a little something about building query engines!

This post will have plenty of technical details, but let’s start by diving into the experience of using Events first.

An Intuitive Addition to Your Tool Belt

Events is a built-in observability layer capturing structured operational data from your WarpStream Agents so you can search and visualize it directly in the Console. No external infrastructure is required: the Events data is simply stored as Kafka topics using the same BYOC architecture as WarpStream’s original streaming product. Here's a quick example of how you might use it.

Concrete Example: Debugging Iceberg Ingestion in Just a Couple Clicks

Suppose you've just configured a Warpstream Tableflow cluster to replicate an 'Orders' Kafka topic into an Iceberg table, with an AWS Glue catalog integration so your analytics team can query the Iceberg tables data from Athena (AWS's serverless SQL query engine). A few hours in, you check the WarpStream Console and everything looks healthy. Kafka records are being ingested and the Iceberg table is growing. But when your analysts open Athena, the table isn't there.

You navigate to the Tableflow cluster in the Console and scroll down to the Events Explorer at the bottom of the table's detail view. You search for errors: data.log_level == "error".

Alongside the healthy ingestion events, a parallel stream of aws_glue_table_management_job_failed events appears, one every few minutes since ingestion started. You expand one of the events. The payload includes the table name, the Glue database, and the error message:

"AccessDeniedException: User is not authorized to perform glue:CreateTable on resource"

‍

The IAM role attached to your Agents has the right S3 permissions for object storage, which is why ingestion is working, but is missing the Glue permissions needed to register the table in the catalog. You update the IAM policy, and within minutes the errors are replaced by an aws_glue_table_created event. Your analysts refresh Athena and the table appears.

The data was safe in object storage the entire time, the Iceberg table was healthy, only the catalog registration was failing. Without Events, you would have seen a working pipeline on one side and an empty Athena catalog on the other, with no indication of what was wrong in between. The event payload pointed you directly to the missing permission.

Using the Events Explorer

Events are consumable through the Kafka protocol like any other topic, but raw kafka-console-consumer output isn't the most pleasant debugging experience. The Events Explorer in the WarpStream Console provides a purpose-built interface for exploring, filtering, and aggregating your events.

The top of the Explorer has four inputs:

Event type: which logs to search: Agent logs, ACL logs, Pipeline logs, or all.
Time range: quick presets like 15m, 1h, 6h, or custom durations. Absolute date ranges are also supported. The search window cannot exceed 24 hours.
Filter: conditions on event fields using a straightforward query language.
Sort order: newest first, oldest first, or unsorted.

Results come back as expandable event cards showing the timestamp, type, and message. Expand a card to see the full structured JSON payload. Following the CloudEvents format, application data lives under data.*.

The timeline charts event volume over time, making it easy to spot patterns, for example an error spike after a deploy, periodic ACL denials, a gradual uptick in pipeline failures. You can group by any field in each event’s payload. Group Agent logs by data.log_level to see how the error-to-warning ratio shifts over time, or group ACL events by data.kafka_principal to see which service accounts generate the most denials.

You'll also find Events Explorer widgets embedded in the ACLs and Pipelines tabs. These provide tightly scoped views relevant to the current context. For example, the ACLs widget pre-filters for ACL events, and the Pipelines widget only shows events generated by the current pipeline.

A Light Footprint

Your Storage, Your Data

WarpStream Agents run in your cloud account and read/write directly to your object storage. Events fit right into this model. Event data is stored as internal Kafka topics, subject to the same encryption, retention policies, and access controls (ACLs) as any other topic. Importantly, Events queries run in the Agents directly, just like Kafka Produce and Fetch requests, so you don’t have to pay for large volumes of data to be egressed from your VPC.

Query results do pass through the control plane temporarily so they can be rendered in the WarpStream Console, but they aren’t persisted anywhere. In addition, the Events topics themselves are hard-coded to only contain operational metadata such as Agent logs, request types, ACL decisions, and Agent diagnostics. They never contain your topics' actual messages or raw data.

Cost Impact

Events contribute to your storage and API costs just like any other topic data persisted in your object storage bucket, but we've specifically tuned Events to be cheap. For moderately sized clusters, the expected impact is less than a few dollars per month.

If cost is a concern, i.e. for very high-throughput clusters, you can selectively disable event types you don't need, for example, keeping ACL logs and turning off Agent logs. You can also reduce each event type’s retention period below the 6-hour default.

But How Does It Work? The Query Engine

To put it bluntly, we bolted a query engine onto a distributed log that stores data in a row-oriented format. The storage layer is not columnar, which means we're never going to win any benchmark competitions. But that's okay. Our Events product doesn't need to be the fastest on the market. It just needs to support infinite cardinality and be fast enough, cheap enough, and easy enough to use that it makes life easier for our customers. And that’s what we built.

Lifecycle of a Query

When you submit a query through the Events Explorer, it gets routed to one of your Agents as a query job. The Agent then:

Parses the query into an Abstract Syntax Tree (AST).
Compiles the AST into a logical plan (filter, project, aggregate, sort, limit nodes).
Physically plans the query by resolving topic metadata: fetching partition counts, start/end offsets, and using topic metadata to narrow the offset ranges based on the time filter.
Splits the offset ranges into tasks. Each task covers a contiguous range of offsets for a single partition.
Schedules tasks for execution in stages, with results flowing between stages via output buffers.
Executes tasks using our in-house vectorized query engine.

For now, a single Agent executes the entire query, though the architecture is designed to distribute tasks across multiple Agents in the future.

Pruning Timestamps

The core challenge is that queries are scoped to a time range, but data in Kafka is organized by offset, not timestamp. While WarpStream supports ListOffsets for coarse time-to-offset translation, the index is approximative (for performance reasons), and small time windows like "the last hour" can still end up scanning much more data than necessary.

The query engine addresses this with progressive timestamp pruning. As tasks complete, the engine records the actual timestamp ranges observed at each offset range. We call these timestamp watermarks. These watermarks are then used to skip pending tasks whose offset ranges fall entirely outside the query's time filter.

The pruning works in both directions:

Lower offsets: If a completed task at offset 200 has a minimum timestamp of 1:30 AM, and the query filters for 2:00 AM–4:00 AM, then all tasks with offsets below 200 can be safely skipped: their timestamps can only be earlier.
Higher offsets: Similarly, if a completed task shows timestamps already past the query's end time, all tasks at higher offsets can be skipped.

To maximize pruning effectiveness, tasks are not scheduled sequentially. Instead, the scheduler uses a golden-ratio-based spacing strategy (similar to Kronecker search and golden section search) to spread early tasks across the offset space, sampling from the middle first and then filling in gaps. This maximizes the chances that the first few completed tasks produce watermarks that eliminate large swaths of remaining work.

On a typical narrow time-range query, this pruning eliminates the majority of scheduled tasks and allows us to avoid scanning all the stored data.

The Direct Kafka Client

The query engine reads data using the Kafka protocol, fetching records at specific offset ranges just like a consumer would. But in the normal Kafka path, data flows through a chain of processing: the Agent handling the fetch reads from blob storage (or cache), decompresses the data, wraps it in Kafka fetch responses, compresses it for network transfer, and sends it to the requesting Agent, which then decompresses it again (learn about how we handle distributed file caches in this blog post). Even when the query is running on an Agent that is capable of serving the fetch request directly, this involves real network I/O and redundant compression cycles.

The query engine short-circuits this with a direct in-memory client. It connects to itself using Go's net.Pipe(), creating an in-memory bidirectional pipe that looks like a network connection to both ends but never hits the network stack. On top of that, the direct client signals via its client ID that compression should be disabled, eliminating the compress-decompress round entirely. Additionally, this ensures that the Events feature always works, even when the cluster is secured with advanced authentication schemes like mTLS.

These two optimizations–in-memory transport and disabled compression–more than doubled the data read throughput of the query engine in our benchmarks. Is it faster than a purpose-built observability solution? Absolutely not, but it’s cheap, easy to use, adds zero additional dependencies, and is integrated natively into the product.

Query Fairness and Protecting Ingestion

Events is designed as an occasional debugging tool, not a primary observability system. To make sure queries never impact live Kafka workloads, several safeguards are in place:

Memory limits: Configurable caps on how much memory a single query can consume.
Concurrency control: A semaphore in the control plane limits the maximum number of concurrent queries to 2 per cluster, regardless of the number of Agents. This is intentionally conservative for now and will be relaxed as the system matures.
Scan limits: Restrictions on the amount of data scanned from Kafka per query, to minimize pressure on Agents handling fetch requests.
Query only Agents: It’s possible to restrict some Agents to query workloads (see the documentation here).

More Optimizations

Beyond pruning and the direct client, the query engine applies several standard techniques:

Metadata-only evaluation: For queries that only need record metadata (e.g., counting events by timestamp), the engine skips decoding the record value entirely.
Early exit: For list-events and TopN queries, scanning stops as soon as enough results have been collected.
Adaptive fetch sizing: List-like queries use smaller fetches (to minimize over-reading), while aggregate queries use larger fetches (to maximize throughput).
Progressive results: For timeline queries, multiple sub-queries are scheduled to show results progressively for a more interactive UI.

Data Streams and Future Plans

Events launches with three data streams:

Agent logs: structured logs from every Agent in the cluster, regardless of role. Filter by log level, search for specific error messages, or correlate Agent behavior with a timestamp.
ACL events: every authorization decision, including denials. Captures the principal, resource, operation, and reason. Useful for rolling out ACL changes, managing multi-tenant clusters, and auditing shadow ACL decisions.
Pipeline events: execution logs from WarpStream Managed Data Pipelines. These help you understand why a pipeline is failing and make the Tableflow product much easier to operate, since you can see processing feedback directly in the Console without context-switching to an external logging system.

We plan to add new data streams over time as we identify more areas where built-in observability can make our customers' lives easier.

Audit Logs

The same infrastructure that powers Events also drives WarpStream's Audit Logs feature. Audit Logs track control plane and data plane actions–Console requests, API calls, Kafka authentication/authorization events, and Schema Registry operations–using the same CloudEvents format. They are queryable through the Events Explorer with the same query language and enjoy the same query engine optimizations.

The only difference is that in the audit logs product, the WarpStream control plane hosts the Kafka topics and query engine because many audit log events are not tied to any specific cluster.

Getting Started

Events are available now for all WarpStream clusters. To enable them, you'll need to upgrade your Agents to version v779 or later. Once your Agents are updated:

Go to your cluster's Events tab in the Console.
Turn on Global Events Status.
Optionally toggle individual event types on or off.
Wait a few minutes for your Agents to capture their first events.
Scroll down to the Events Explorer and run a search.

The global status and individual event types can be toggled on and off via the API or the Terraform provider.

Event topics have a default retention of 6 hours. Retention and partition count are configurable per event type. Find these details and more in the Events Explorer’s public documentation page.