The WarpStream team receives lots of questions about our architecture, pricing, unique features, and other aspects of WarpStream. We created this page to serve as an up-to-date repository or list of frequently asked questions.
This blog is split into sections so you can browse by where we received the question, the question, and the question’s answer.
Not only can you reach out to us directly on platforms like Reddit, but if you have questions, you can use the Contact Us page or click on the AI chatbot icon in the bottom-right corner of your screen to get answers in real time. The chatbot is an LLM trained on our website, docs, other materials, public GitHub, and questions we’ve received.
WarpStream co-founders and engineers Richard Artoul and Ryan Worl did a Reddit AMA (Ask Me Anything) about WarpStream and software engineering. You can go here to check out the original post. We’ve reproduced the questions and answers below.
Question: What storage backend is the control plane using (if you can answer), and what was the hardest problem you’ve faced so far with it (bonus points if you share how you solved it).
Answer: The control plane is built on top of cloud-native databases, so it's DynamoDB in AWS, Spanner in GCP, and Cosmos DB in Azure. It can also run on top of any database with a SQL interface. We don't really depend on much functionality from the underlying datastore, so it's easy to add new backends for it!
Question: What enabled WarpStream to be stateless? Are things kept in memory for optimization?
Answer: We were able to make it completely stateless for two reasons, I think:
We do lots of batching and caching in-memory, as you would expect as well. This blog post explains some of that.
Question: More of a concern than a question: I have a SaaS where 80% of our data can go directly to S3, and latency in terms of seconds isn't a problem. The other 20% needs to be as fast as "traditional" Kafka. Does WarpStream have a solution to cover 100% of my use case? Also, what about secondary offerings like Flink or a TableFlow-like capability?
Answer: Yes! WarpStream latency is very tuneable, it's basically a cost vs. performance knob. See our docs. This won't allow you to get latency as low as an extremely well-tuned open-source (OSS) Kafka cluster, but it will get you pretty close to the performance of most real-world deployments. We don't have any plans for a BYOC Flink offering right now, but you can buy Confluent Platform Flink and use that with WarpStream. WarpStream Tableflow is coming very soon though ;)
Question: Are you still maintaining Bento despite selling Benthos? Are there any new features coming to Bento?
Answer: Yes! It's very actively maintained at https://github.com/warpstreamlabs/bento/commits/main. Our designer also did a slick rebrand recently (you can check out the website here). We add new features all the time, like new inputs/outputs, rate limiting functionality, we just added a new "strict mode", etc. It's a pretty mature product already that does what it says on the box, so we're mostly just optimizing and integrating with new technologies.
Question: What are the architectural differences between WarpStream and your previous design, Husky, at Datadog?
Answer: The biggest difference is probably that Husky relied directly on FoundationDB for all metadata operations, whereas WarpStream has a custom metadata store that's built on top of other cloud native data stores. We did this for two reasons:
Question: WarpStream is built to be compatible with the Kafka protocol, that's evolved over many years; many parts of it are un/under-documented. What part of adhering to the protocol would you get rid of to improve WarpStream if you could? (Gunnar Morling's suggestion to get rid of partitions is one possibility that comes to mind.)
Answer: The original consumer group protocol implementation is pretty complicated and finicky, and can only effectively be ported by transcribing the Java implementation line by line. That said, I think the consumer group v2 implementation that was just released in Apache Kafka 4.0 should be a big step forward for the community.
Question: It’s a great concept. How would you compare your approach to Bufstream? For which scenarios might either be the better option?
Answer: WarpStream sells a managed BYOC solution where the vast majority of the monitoring, scaling, and operational stuff is handled by the WarpStream team. Bufstream (effectively) sells enterprise licenses for fully self-hosted software where the customer is responsible for all day-to-day operations, including monitoring and scaling the embedded control plane and consensus mechanisms. They're just very different products.
Also, Bufstream has published almost no information about their architecture and design, so it's hard to compare technically.
That being said, WarpStream is used in production at dozens of companies, many at multi-GiB/s and some at tens of GiB/s scale. The same cannot be said of Bufstream.
Question: What’s next on the roadmap?
Answer: Lots of stuff! Just off the top of my head:
We also launched a bunch of really cool "boring" but very powerful stuff in the last few weeks, like the ability to restore deleted topics and our new Diagnostics feature that automatically analyzes your cluster in the background to proactively surface inefficiencies or problems.
Question: I've read that Protobuf validation is coming soon. Do you plan to support Google CEL to implement data quality rules as well?
Answer: We're open to it, I think it's technically on our roadmap, but we've only seen demand from one customer so far. We'll prioritize it if it crops up more!
Question: What's the worst on-call situation you've faced at WarpStream?
Answer: Our control plane is also completely stateless, like the Agents (it just depends on cloud native data stores like DynamoDB and Spanner), so our on-call rotations aren't too bad. That said, we did have some really annoying on-call shifts early on in the company's life. I remember one incident we had before we GA'd the product where control plane nodes would lock up randomly, and so we had to manually monitor the logs and kill nodes for like 14 hours until we root-caused the issue (deadlock in a telemetry library).
Question: Since all of the data is already on S3, you already have a compaction process, and you also have stateless transforms, how hard is it to build an index on top of the S3 data and query it via Trino?
Answer: Yeah, we're adding support for exactly this by materializing Iceberg and Delta Lake tables that you'll be able to query natively with whatever query engine you want!
Question: Which parts of the Kafka API were more difficult to implement? What do you think about "Kafka Transactions are Broken"? Will you go through Jepsen for validation?
Answer: I think the hardest part to get right was actually the consumer group implementation. Transactions and idempotent producer were actually fairly straightforward thanks to our architecture.
Regarding that Jepsen report: the bugs surfaced there were fairly ... crude in my opinion. None of these bugs would have made it past our chaos fault injector in our integration tests we run in our regular CI environment. We run over 50,000 tests on every commit. We test using librdkafka, franz-go, sarama, and the Java client to ensure we can catch client differences before our customers do.
We run similar test workload designs as Jepsen does for Kafka inside of Antithesis. This is a more powerful solution that runs on our system continuously vs. a Jepsen report, which just reflects a single point in time.
Question: If we leave the price issue aside, what cases does your technology cover better than classic Kafka? I mean business scenarios. It seems that it is not very well suited for use in messaging, but there are areas for streaming data analytics? That is, what scenarios can I cover better if the price issue is not relevant?
Answer: Observability, telemetry, analytics, feeding data lakes, security, and IOT are our bread and butter workloads, but a good amount of our customers just use us for traditional messaging use-cases as well. Usually they just want a simpler and more elastic solution.
Question: Before DataDog, where did you both learn about programming and distributed systems? Did you perhaps have exposure to computer science at an early age in high school?
Answer: Ryan: I took a few computer science (CS) classes in college, but my degree was actually in Business Management. But I've been writing software and building apps since I was in high school. I released a few apps on the Apple App Store in high school, and I even co-founded a laundry pickup and delivery service for college students in New York when I was 22 (hint: it didn't go well).
Richie: I actually majored in biochemistry / pharmacology in college. I graduated top of my class, but then decided I hated hospitals and didn't want to follow in my father's footsteps and go to medical school. I got a job out of college that I really hated for a few months, then quit and went to a coding bootcamp, and the rest is kind of history. I wrote a whole blog post about how I transitioned from a coding bootcamp grad to a distributed systems expert, you can read here.
Question: Do you have any fears of AWS offering anything within MSK to compete with WarpStream?
Answer: Not really. Obviously, competing with MSK is hard because they have an unfair advantage being the cloud provider themselves. For example, they just don't bill themselves for inter-zone networking between MSK brokers because why not?
MSK has had more than 6 years to build a better product, and they haven't done very much (still no auto partition rebalancing, until a few months ago, the answer to downscaling a cluster was: "delete it and make a new one", no automated scaling, etc). Express brokers was the first thing they released in a long time where it felt like they were actually trying again, but if you dig under the covers a bit, it's really just fancy tiered storage, and all the existing product issues remain.
MSK is always the elephant in the room, but they have a really big business because they exist as a first party solution and customers can start using MSK without requesting budget, going through security/legal/compliance approvals, talking to procurement etc. But I don't worry about them from a technology perspective, and we see a lot of MSK to WarpStream migrations for cost, usability, and reliability reasons.
Question: WarpStream abstracts away traditional Kafka brokers by pushing durability to S3-compatible object storage and maintaining a stateless broker model. This is a powerful simplification, but also raises questions about tradeoffs.
How do you handle consistency and ordering guarantees, especially in the presence of S3's eventual consistency model (especially around overwrite and list operations)? Have you built additional coordination mechanisms (like distributed consensus) to mitigate that, and if not, how do you prevent tail-latency spikes or ghost reads during high-concurrency workloads?
Lastly, just curious if you've benchmarked tail latency in scenarios involving aggressive topic partition fan-outs and reader/consumer churn ... how does the system scale and retain low-latency guarantees when metadata/state must be reconstructed frequently from object storage?
Answer: The WarpStream control plane / metadata service provides remote consensus for the cluster, basically, and S3 is not eventually consistent (although it would be fine even if it were).
We have some public benchmarks that I think would answer your other question. Check the "one final workload" section; we're the only vendor that publishes a benchmark for this type of extremely pathological workload at all.
Question: Given S3’s eventual consistency and lack of atomic operations, how does WarpStream guarantee linearizable writes and exactly-once semantics without introducing a consensus layer or broker state, especially under concurrent producer writes and high consumer fan-out?
Answer: S3 is no longer eventually consistent and hasn't been for years now. However, WarpStream only relies on the guarantees that have been available in S3 for well over a decade: read-after-write consistency on newly-written keys. That gives durability and high throughput/concurrency on the data plane. The rest of the protocol is accomplished with the control plane running on top of a distributed database that itself implements consensus somewhere, but is the responsibility of the cloud provider to maintain.
Question: If WarpStream’s pricing is one of its core value props today, do you guarantee current pricing for existing customers? How do you plan to handle future pricing changes without breaking customer trust or usage patterns?
Answer: We offer customers pricing guarantees through long-term contracts for committed usage like basically every other infrastructure software vendor. Technically, we have the right to increase prices for existing customers who don't have commits / contracts with us in place, but we've never done that in the history of the company.
Question: Does WarpStream have any hotspots for writing due to partition placement?
Answer: WarpStream doesn't have a concept of partition leaders, so any Agent can read and write data for any partition. That means in theory you could write many GiB/s of traffic into a single partition spread across dozens of Agents if you wanted, and that would actually be more efficient for us on the control plane side!
The only way to hotspot a WarpStream cluster realistically is to use very small Agents and then have a single Kafka client produce or consume more data than can be processed by a single WarpStream Agent which is almost impossible to do unless you run really small Agent nodes.
We guarantee ordering within a partition just like Kafka, even when a partition is being written to concurrently by two different Agents.
Question: If WarpStream is stateless and relies entirely on object storage, what happens during a prolonged S3 outage or API throttling event? Do you have any fallback path, or is all ingestion and consumption completely blocked?
Answer: S3 is in the critical path for reading and writing data, which means that WarpStream would be unavailable during an S3 outage. This is an extremely rare event, and the amount of cumulative S3 downtime we've observed in all of the AWS regions where customers are running WarpStream since GA is very small.
Question: If everything is on S3 and the metadata is on DynamoDB, does this mean WarpStream can run on a single server? How much throughput does that get you?
Answer: Yeah, technically you can run on a single server, although realistically you would probably want at least two for H.A. in case that one server randomly dies. In terms of throughput, it depends on how big the server is.
This section compiles questions and answers from our Reddit posts. To see our posts and comments, you can follow our official Reddit account, warpstream_official.
Question: This [S3] is only relevant for AWS and multi-zone clusters. I think Kafka with HDDs in a single zone will be much cheaper.
Answer: Even if someone was using something EBS in a single zone, it would be $0.08 GiB/month vs. $0.02 GiB/month for S3, but yes, 1 AZ Kafka is going to be cheaper than 3 AZ Kafka.
Question: What? Where are you getting that insane number [$0.053 for every GiB of data that you stream through your Kafka cluster in the best case scenario] from? If you're using Apache Kafka, then not even that. That's a lot of money [the cost of storing a GiB of data in S3 for a month is only $0.0214] if you're dealing with real volumes. But it's clear that you're not dealing with high-demand streaming, considering you're suggesting S3 as backing storage. This is not capable of real-time streaming.
Answer: A separate blog we did on cloud disk costs may be helpful as it dives into the types of numbers you referenced.
If you look at something like EBS for storage, it's $0.08/GiB per month. Factor in triple replication, and that goes to $0.24/GiB per month. Add in a buffer to run the system with 50% of storage free for disasters and changes in workload volume, and you're doubling to $0.48/GiB. If follower fetch is not enabled, that can then climb to $0.53/GiB per month.
Obviously, you can lower your replication factor, use SSDs instead of EBS, etc., to reduce those costs, but it's going to be difficult to get as cheap as S3 at $0.02/GiB per month.
By using S3, WarpStream gets rid of those interzone networking fees, and there's no local data to replicate.
As far as "high demand streaming" and not being "capable of real-time streaming", it comes down to your latency requirements. We have a public-facing benchmark dashboard and can get sub-one-second P99 producer latency and sub-two-second end-to-end latency. That can further be reduced via S3 Express One Zone. We also have customers that split workloads, e.g., more relaxed latency workloads go through WarpStream, and more strict latency workloads go through self-hosted Kafka or Confluent.
If you check out our website, you'll see we have many companies (like Grafana Labs, Goldsky, Zomato, and PostHog) using WarpStream in production. We don't do that deceptive SaaS tactic of slapping logos on our site for anyone who signs up for a free account; someone has to be a paying customer and using us in production.
Question: But with MSK, you don’t pay for inter-AZ traffic. With Express brokers, the disk isn’t managed on them, which makes it much more efficient, say when rebalancing. Also, over-provisioning isn’t an issue with MSK if you choose serverless brokers. How do u actually save costs compared to MSK? Are you able to give me both simple and fact examples based on your pricing and MSK for the specific use case you are giving the example for? Thanks :)
Answer: For MSK, you pay for client --> broker traffic. There's a Q&A available here where someone asks about data in/out charges.
Express brokers charge you for write throughput directly in exchange for optimizing the storage cost, and the storage cost is more than S3, e.g., price per GB/month for primary storage is $0.10 with Express, whereas S3 is $0.02.
The broker compute is marked up compared to a base EC2 cost. For example, on the Express section of the MSK pricing page, a m7g.4xlarge is $3.264/hour, whereas it's only $0.6528/hour on the EC2 pricing page.
You can use our pricing calculator to compare WarpStream to MSK, MSK Express, and Serverless. It lets you look at a bunch of factors like write throughput, number of partitions, retention, whether fetch from follower is enabled, etc.
Here's an example of a WarpStream customer that switched from MSK and saved 83%.
Question: Pretty cool and congrats! I often work in greenfield projects, where I nor my team members have the tendency to choose for a new-in-town solution, but rather look at existing and battle-tested solutions, such as AWS MSK or Azure Event Hub with Kafka protocol. How would you convince me to go through the extra effort of deployment, management, and learning curve to use WarpStream? It's not that I don't want to use this, but often it's not offered the chance, as you have to justify choices in ADRs to managers, POs, and PMs.
Answer: Yep, totally get the whole thing where older vendors or things like the Gartner "Magic Quadrant" come into play.
As far as effort of deployment, management, and learning curve, we'd make the case that it's actually easier to use WarpStream than something like MSK or Azure, or OSS Kafka, as we are Kafka compatible and use a stateless architecture built on top of object storage, so you don't have to worry about EBS or local disk management, partition or broker rebalancing, hot spots, VPC peering, over-provisioning, auto-scaling issues (or lack of zero ops auto-scaling), etc.
We've actually had companies that have never run Kafka but were running something like Pub/Sub and transitioned over in two weeks or less. We even have folks reach out on Reddit posts like this to share their experience:
We’ve reduced our Pub/Sub costs by around 95% by migrating from GCP PubSub to Warpstream BYOC this month, so far so good.
The infra part is simple as hell, the architecture is cool, you can use a single cross-project bucket to make queue replications and stuff. I do recommend the service, and I’ll use it in the future as the only sane Kafka implementation.
Because of that stateless architecture and everything living in your cloud, there are no inter-AZ fees, storage costs go down 24x (or more), and you're secure by default as you're only responsible for the compute (no need to worry about cross-account IAM access or privileges needed by WarpStream).
We're well battle-tested. Big companies like Zomato, Character.AI, Cursor, Grafana Labs, Goldsky, and PostHog use us in production.
Here's a case study that Character.AI wrote on their own engineering blog about switching to and using WarpStream. We're backed by Confluent, too.
Question: Say that I'm running in Azure, and I want to try out WarpStream. What would I run it on? Currently I still have to convince the company I work for to start using Kubernetes, so imagine that that's out of the picture. What would you advise?
Answer: You can stay in Azure; that's the point of Bring Your Own Cloud. WarpStream works with any S3-compatible object storage, which in the case of Azure, would be Blob storage.
We host the control plane, which just stores metadata. You host the compute (producer, Agent, consumer, and object storage). To get up and running, you deploy our stateless Agents, which have four requirements that you can find here, but I'll summarize below. You specify:
If you want to try our demo and sandbox environment, all you have to do is run in a terminal is:
To run WarpStream in Azure without Kubernetes, there are two options.
Option No. 1: Run the Docker image directly on an Azure VM, i.e.:
Option No. 2: Run our WarpStream binary directly on an Azure VM and set up a <span class ="codeinline>systemd</span> service unit to run them.
These links may be helpful:
Question: How would this compare to AWS Kinesis in functionality and cost? (Aside from the Kafka compatibility.)
Answer: First, let me address cost. You can use our public-facing pricing calculator to compare WarpStream vs. Kinesis. For example, if we assume something like a 1 GiB/s write throughput, 4,096 partitions, and 7-day retention, WarpStream would be 86% cheaper per month than Kinesis.
As far as functionality, it really comes down to what you're currently doing and how you need to evolve in the future. Kinesis can be thought of as a data firehouse (even one of its components has "firehose") in its title. It's great for getting data in and out, but you lose some control over your streaming infrastructure. Also, retention isn't as configurable as WarpStream.
WarpStream, being Kafka compatible, is going to give you much more control over your streaming infrastructure while reducing your costs. For example, WarpStream has its core product, BYOC, but it is also a complete data-streaming solution, as it has Managed Data Pipelines (ETL and stream processing), Diagnostics, Orbit (Kafka migration and replication), and Data Governance (schema registry, validation, and linking).
Question: Nice results and even better title [referring to our “Kafka is dead, long live Kafka” blog]. Curious, S3 worked for your latency requirements. I thought maybe DynamoDB or more faster object store would have been better, given its availability guarantees.
Can agents be monitored using same tools that are used to monitor kafka brokers or there is a need for different monitoring. Nice to see this development in Kafka world. Keep going!
Answer: While there are some cases were really low end-to-end latency is needed, often we find that it's not the case and a little give in latency is OK.
To your point: We do support DynamoDB. Also, we've supported S3 Express One Zone for some time and given its price drops, that will make it an even more attractive option for those that want to use it with WarpStream.
Yes, we have an API that allows you to pull metrics to do monitoring, so you can use your preferred tool(s). See more info on that here.
We also recently released WarpStream Diagnostics, which is another level of monitoring. Instead of simply reading metrics, it not only monitors your clusters for potential issues, but suggests ways to fix them.
Question: Hi, our team uses Event Hubs, Azure's data streaming platform, and we are exploring alternatives for cost reasons. Have you guys published anything comparing the overall costs we would incur by using your solution instead of other managed data streaming platforms?
A study of latency with number of messages/data volume would also be helpful. I understand that you guys are claiming that your stateless agents can effortlessly scale horizontally and handle new load, do you have anything which goes into more detail about how you did that, especially the part were you managed to achieve low latency streaming requirements on top of a relatively higher latency object store?
We also have a hard requirement of ordering messages by partition key, we cannot tolerate out of order data in any case. Currently, we have dedicated consumers for each partition. In your "Kafka is dead..." blog post, you talked about moving past partitions as a low level concept and offering a higher level abstraction instead. Have you published anything elaborating on that?
Answer: Yes, we have cost comparisons versus other stream platforms, like this 83% cost reduction of using WarpStream instead of MSK. We also have a downloadable executive summary that goes more into cost comparisons.
Our benchmarks and TCO blog cover a lot about latency. Plus, we have this dedicated docs page about tuning for lower latency.
While WarpStream abstracts some traditional partitioning concepts, it maintains the same per-partition ordering guarantees as Kafka. This ensures that messages with the same partition key are processed in order.
WarpStream supports idempotent produce requests, allowing clients to produce the same batch of data multiple times while ensuring it's only appended once. This enhances data consistency and reliability.
Question: Why not just use Redis [or] PubSub, which uses no disk?
Answer: It depends on your use case; we wouldn’t recommend over-engineering a data stack if a simpler solution works.
If sub-millisecond latency is an absolute requirement (though in practice, near-real-time, under ~1 second, is often sufficient) and you don’t need message persistence, replayability, consumer groups, or extremely high throughput, then Redis [or] Pub/Sub can work well. Just be mindful of inter-AZ traffic costs if running across multiple availability zones.
However, if low latency isn’t the only priority and you need message durability, replayability, consumer groups, or large-scale event streaming, then a Kafka-compatible platform like WarpStream makes much more sense. It enables event-driven architectures, ETL, and stream processing without requiring extra infrastructure for persistence.
Redis is purely in-memory and does not persist messages or allow replay – if a subscriber is offline, they miss the message entirely.
Google Cloud Pub/Sub, on the other hand, does persist messages and supports replayability for up to 7–365 days.
If you meant Google Cloud Pub/Sub, then it does offer persistence and replay, making it more comparable to Kafka-like systems. But if you’re comparing Redis vs. WarpStream, Redis lacks durability and replayability, making it better suited for real-time notifications rather than long-lived event-driven architectures.
Question: Interesting. How do you plan to defeat that nasty P99 latency issue? Is it solvable with the current architecture?
Answer: We trade off super-low latency in favor of reduced costs and operational simplicity. You can check out our benchmark blog here. Even under heavy workloads, we can keep the end-to-end P99 latency under 1 second and average around 400ms. This accounts for the full WarpStream write and read path, including commits to WarpStream’s metadata store and acknowledgements back to the clients.
If you leverage S3 Express One Zone and our low-latency clusters, we can get P99 down to around 50ms, but that comes with higher costs. You can learn more about how WarpStream manages S3 Express One Zone via our official blog post and in our docs.
The question to always ask is, "How low does P99 really need to be?" For most situations, our standard setup using S3 should be fast enough, but, as noted above, if folks are willing to pay a little more for S3 Express One Zone, we can reduce latency a lot more.