Robinhood Swaps Kafka for WarpStream to Tame Logging Workloads and Costs

Robinhood + Jason Lauritzen

WarpStream Customer

December 9, 2025

robinhood-swaps-kafka-for-warpstream-to-tame-logging-workloads-and-costs

HN Disclosure: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

By switching from Kafka to WarpStream for their logging workloads, Robinhood saved 45%. WarpStream auto-scaling always keeps clusters right-sized, and features like Agent Groups eliminate issues like noisy neighbors and complex networking like PrivateLink and VPC peering.

Robinhood is a financial services company that allows electronic trading of stocks, cryptocurrency, automated portfolio management and investing, and more. With over 14 million monthly active users and over 10 terabytes of data processed per day, its data scale and needs are massive.

Robinhood software engineers Ethan Chen and Renan Rueda presented a talk at Current New Orleans 2025 (see the appendix for slides, a video of their talk, and before-and-after cost-reduction charts) about their transition from Kafka to WarpStream for their logging needs, which we’ve reproduced below.

Why Robinhood Picked WarpStream for Its Logging Workload

Logs at Robinhood fall into two categories: application-related logs and observability pipelines, which are powered by Vector. Prior to WarpStream, these were produced and consumed by Kafka.

The decision to migrate was driven by the highly cyclical nature of Robinhood's platform activity, which is directly tied to U.S. stock market hours. There’s a consistent pattern where market hours result in higher workloads. External factors can vary the load throughout the day and sudden spikes are not unusual. Nights and weekends are usually low traffic times.

Traditional Kafka cloud deployments that rely on provisioned storage like EBS volumes lack the ability to scale up and down automatically during low- and high-traffic times, leading to substantial compute (since EC2 instances must be provisioned for EBS) and storage waste.

“If we have something that is elastic, it would save us a big amount of money by scaling down when we don’t have that much traffic,” said Rueda.

WarpStream’s S3-compatible diskless architecture combined with its ability to auto-scale made it a perfect fit for these logging workloads, but what about latency?

“Logging is a perfect candidate,” noted Chen. “Latency is not super sensitive.”

Architecture and Migration

The logging system's complexity necessitated a phased migration to ensure minimal disruption, no duplicate logs, and no impact on the log-viewing experience.

Before WarpStream, the logging setup was:

Logs were produced to Kafka from the Vector daemonset.
Vector consumed the Kafka logs.
Vector shipped logs to the logging service.
The logging application used Kafka as the backend.

To migrate, the Robinhood team broke the monolithic Kafka cluster into two WarpStream clusters – one for the logging service and one for the vector daemonset, and split the migration into two distinct phases: one for the Kafka cluster that powers their logging service, and one for the Kafka cluster that powers their vector daemonset.

For the logging service migration, Robinhood’s logging Kafka setup is “all or nothing.” They couldn’t move everything over bit by bit – it had to be done all at once. They wanted as little disruption or impact as possible (at most a few minutes), so they:

Temporarily shut off Vector ingestion.
Buffered logs in Kafka.
Waited until the logging application finished processing the queue.
Performed the quick switchover to WarpStream.

For the Vector logging shipping, it was a more gradual migration, and involved two steps:

They temporarily duplicated their Vector consumers, so one shipped to Kafka and the other to WarpStream.
Then gradually pointed the log producers to WarpStream turned off Kafka.

Now, Robinhood leverages this kind of logging architecture, allowing them more flexibility:

Deploying WarpStream

Below, you can see how Robinhood set up its WarpStream cluster.

‍

The team designed their deployment to maximize isolation, configuration flexibility, and efficient multi-account operation by using Agent Groups. This allowed them to:

Assign particular clients to specific groups, which isolated noisy neighbors from one another and eliminated concerns about resource contention.
Apply different configurations as needed, e.g., enable TLS for one group, but plaintext for another.

This architecture also unlocked another major win: it simplified multi-account infrastructure. Robinhood granted permissions to read and write from a central WarpStream S3 bucket and then put their Agent Groups in different VPCs. An application talks to one Agent Group to ship logs to S3, and another Agent Group consumes them, eliminating the need for complex inter-VPC networking like VPC peering or AWS PrivateLink setups.

Configuring WarpStream

WarpStream is optimized for reduced costs and simplified operations out of the box. Every deployment of WarpStream can be further tuned based on business needs.

WarpStream’s standard instance recommendation is one core per 4 GiB of RAM, which Robinhood followed. They also leveraged:

Horizontal pod auto-scaling (HPA). This auto-scaling policy was critical for handling their cyclical traffic. It allowed fast scale ups that handled sudden traffic spikes (like when the market opens) and slow, graceful scale downs that prevented latency spikes by allowing clients enough time to move away from terminating Agents.
AZ-aware scaling. To match capacity to where workloads needed it, they deployed three K8s deployments (one per AZ), each with its own HPA and made them AZ aware. This allowed each zone’s capacity to scale independently based on its specific traffic load.
Customized batch settings. They chose larger batch sizes which resulted in fewer S3 requests and significant S3 API savings. The latency increase was minimal (see the before and after chart below) – an increase from 0.2 to 0.45 seconds, which is an acceptable trade-off for logging.

Robinhood’s average produce latency before and after batch tuning (in seconds).

Pros of Migrating and Cost Savings

Compared to their prior Kafka-powered logging setup, WarpStream massively simplified operations by:

Simplifying storage. Using S3 provides automatic data replication, lower storage costs than EBS, and virtually unlimited capacity, eliminating the need to constantly increase EBS volumes.‍
Eliminating Kafka control plane maintenance. Since the WarpStream control plane is managed by WarpStream, this operations item was completely eliminated.‍
Increasing stability. WarpStream’s removed the burden of dealing with URPs (under-replicated partitions) as that’s handled by S3 automatically.‍
Reducing on-call burden. Less time is spent keeping services healthy.‍
Faster automation. New clusters can be created in a matter of hours.

And how did that translate into more networking, compute, and storage efficiency, and cost savings vs. Kafka? Overall, WarpStream saved Robinhood 45% compared to Kafka. This efficiency stemmed from eliminating inter-AZ networking fees entirely, reducing compute costs by 36%, and reducing storage costs by 13%.