Tiered Storage Won’t Fix Kafka

Apr 28, 2024
Richie

Tiered Storage Seems Like a Good Idea

Tiered storage is a hot topic in the world of data streaming systems, and for good reason. Cloud disks are (really) expensive, object storage is cheap, and in most cases, live consumers are just reading the most recently written data. Paying for expensive cloud disks to store historical data isn’t cost-effective, so historical data should be moved (tiered) to object storage. On paper, it makes all the sense in the world.

But first, what exactly does tiered storage mean in the context of a streaming system? The basic idea is to only persist recent data to disk, and asynchronously move historical data to object storage where it can rest cheaply.

Tiered storage in action.

This optimization is particularly useful in scenarios where the primary scaling dimension for Kafka is storage, for example when you need long topic retention. In those scenarios, tiered storage can (sometimes dramatically) reduce the number of required EBS volumes, which will (sometimes dramatically) reduce your costs.

In theory, tiered storage should also ease operational burden by reducing the amount of data stored on each broker’s local disk, which (among other benefits) should make scaling the cluster in and out faster. Some particularly enthusiastic users are even optimistic that taken to its logical conclusion, tiered storage could turn Kafka into a modern data lake.

The excitement around the addition of tiered storage in Apache Kafka is palpable, most likely because there has been very little innovation in the Kafka space for the last few years, but users are still frustrated with the status quo in terms of cost, complexity, and operational burden. Could tiered storage be the silver bullet that everyone’s been waiting for?

Unfortunately, the answer seems to be no. I have talked to hundreds of Kafka (and other streaming systems) users in the last year, many at the largest and most advanced engineering organizations in the world, and I have yet to find a single happy and satisfied tiered storage user.

What I did find was many people who evaluated it, realized it wouldn’t reduce their costs very much, and moved on. I also found some people who tried to adopt it, ran into a number of operational issues and friction points, and decided it wasn’t worth it. I even found a few people who managed to get into production with it, but they had a lot of extra gray hairs to show for it, and almost all of them were underwhelmed with the end result.

I realize that everything I'm saying runs contrary to popular sentiment in the industry, so let me explain how I got here.

The Tarpit

Tiered storage can reduce costs for some workloads, but it ends up doing so in a way that is penny wise and pound foolish. Specifically, the main issue with tiered storage is that it doesn’t address the two primary problems that people actually have with Kafka today:

  1. Complexity and Operational Burden
  2. Costs (specifically, inter-zone networking fees)

In fact I’ll argue that it doesn’t just not solve these problems, it actually makes them worse.

Increased Complexity and Operational Burden

The first problem with tiered storage is that instead of making Kafka simpler and easier to deal with, it actually makes it more difficult and more complicated. Retrofitting an existing system like Apache Kafka to account for the fact that some portion of the data is stored in a completely different storage medium, with completely different cost and performance characteristics, is incredibly complex. The end result is a finicky and brittle system with a never-ending series of limitations, sharp edges, and gotchas.

For example, reasoning about performance becomes incredibly difficult. Consider a Kafka consumer that wants to read a topic from the beginning. In normal circumstances, fetching a batch of records from Kafka should only take tens or hundreds of milliseconds. But with tiered storage enabled, it can take tens of seconds or even minutes to read the first batch of records.

The reason for this is that many tiered storage solutions don’t support reading data incrementally from the object store. Instead, these implementations require that entire log segments are downloaded from the remote object store before any reads can be served. This is problematic because log segments can be hundreds of MiBs or even gigabytes in size. While these large segments are being downloaded, the consumer is completely blocked and unable to make any progress.

Some tiered storage implementations support incremental reads via a chunking mechanism, but the log segment chunks still have to be downloaded and cached on disk after they’re paged in from the object store. Writing the downloaded chunks to disk competes with the live workload for the limited amount of IOPS (and disk space) available on the Kafka brokers. This means that a single consumer, replaying a tiny amount of historical data, can easily disrupt all of the live consumers serving production traffic. This also means that while tiered storage does reduce the disk size on the brokers, there are limits to how small these disks can get. We need to leave enough headroom for both the live and historical workload. 

Further compounding the issue, a cloud disk’s available IOPS is often linked to the amount of provisioned storage. But the whole point of tiered storage is to reduce the amount of disk storage that has to be provisioned in the first place!

Issues like this can be worked around with very careful capacity planning, benchmarking, and configuration tuning, but it is painful. Furthermore, the issues that I have described so far tend to lie dormant at first, and then rear their head at the worst possible time: when something has gone wrong and historical data needs to be replayed without sacrificing the reliability of the live workload.

Needing to read historical data doesn’t just happen in emergencies either. In the last few months, I’ve run into three different organizations who are trying to migrate away from a tiered storage data streaming system, and they describe being unable to migrate to a new solution because they can’t read their data out of tiered storage without disrupting live traffic.

Finally, and perhaps worst of all, the extra complexity introduced by tiered storage is additive. In addition to all of the operational problems you’re suffering from currently:

  • Partition rebalancing
  • Finicky consensus mechanisms
  • Performance problems
  • Inelasticity
  • Hotspots
  • Broker imbalances
  • etc

Tiered storage adds a whole new set of additional failure modes and operational tasks, and eliminates none of the existing ones.

No Reduction in inter-zone Networking

Operations aside, tiered storage also promises to reduce the overall total cost of Kafka. Unfortunately, it usually doesn’t live up to this promise either, and the reason might surprise you: cloud networking fees.

If you thought cloud disks were expensive, cloud networking will really make your eyes water. Inter-zone networking fees are the silent killer of data streaming workloads. To put this into concrete terms, every GiB of data that is written through an Apache Kafka cluster running in a standard HA three availability zone setup costs 5.3 cents per GiB. By comparison, storing a GiB of data in S3 for a month only costs 2 cents.

Data path for a standard HA Kafka cluster running in three availabiltiy zones.

So how does tiered storage help reduce inter-zone networking costs? Unfortunately, it doesn’t. At all. To ensure high availability and durability, all data that you write to a cluster with tiered storage enabled still has to be replicated to three disks in three different availability zones before being acknowledged and considered durable. Tiering does not happen until later. That’s a big problem because inter-zone networking is often responsible for 80%+ of a streaming workload's total cost.

This means that no matter how small the brokers’ local disks are, or how quickly the tiered storage implementation copies data to object storage, streaming systems that leverage a tiered storage architecture will usually cost almost exactly as much as a traditional Kafka cluster for any workload that is not storage-bound, even if they only store data on local disks for one minute.

It doesn’t enable new use cases

Okay fine. So maybe tiered storage doesn’t save you as much money as you hoped, and hey the performance is unpredictable, but you can provision EBS volumes with dedicated IOPS, do some very careful gamedays, and test that everything will work as you expect. It’s a little bit more complex to deal with, but the benefits are well worth it right? After all, tiered storage isn’t just about reducing costs, it’s also about all the new stuff that tiered storage enables you to do.

For example, since all of the data is stored in object storage, in theory, you could also just run batch jobs/queries against that data directly, bypassing the brokers entirely. Voila, your data stream is now a data lake.

That would indeed be useful if it was possible, but in practice, it’s not. The data in the object store is being actively “managed” by the brokers (compacted, retention enforced, etc) and so querying the data files directly would result in anomalies: missing and duplicate records, specifically due to lack of snapshot isolation. And unless you figure out how to provide a single pane of glass between broker disks and object storage, you will only be able to query data that is outside of the hot set, which means that you can only query stale data.

You can still query all of the data using the Kafka APIs, but in practice, you’ll be bottlenecked by the maximum throughput of however many brokers happen to be hosting the topic-partitions that you want to query. For example, if you only want to query a subset of the topics, and those topics are only hosted on a subset of your brokers, you may only be able to use a tiny fraction of your cluster’s total capacity to run the query even though your other brokers are all sitting idle.

In theory, when you need to scale your read capacity, you could temporarily mark some of the brokers as “read replicas” for specific topic-partitions such that they have all the metadata for the data tiered to object storage and can serve reads, but don’t participate in any leader elections, and you’d just need to make sure they don’t also replicate the data that hasn’t been tiered yet to not waste too much disk space, and… well this is all starting to sound like an awful amount of work.

Tiered Storage in Disguise

I’m not the only person in the streaming space that is disappointed with the false promises of tiered storage. Many users and experts in the industry are beginning to recognize that tiered storage isn’t the silver bullet they expected it to be. As a result, many systems and vendors are taking the path of less resistance: rebranding. 

Look out for the following phrases:

  1. “Cloud First Storage” (tiered storage)
  2. “Mostly Stateless” (tiered storage)
  3. “Use disks as a cache in front of object storage” (tiered storage)

These terms are all just the same old tiered storage in disguise. I’ve written about this before, but it bears repeating: It’s much easier for existing players to make incremental changes, and then repackage and rebrand their existing products than it is for them to do the right thing for their users: rebuild from scratch for the cloud.

Zero Disks Would be Better

Hopefully, by now you’re convinced that while tiered storage does have some benefits, namely reducing storage costs for some workloads, it’s not the radical improvement that you were promised, and while it may reduce costs, it certainly won’t simplify operations or make your life any easier. Quite the opposite actually, it will make your Kafka workloads more unpredictable and more difficult to manage.

I’ll conclude with this. Tiered storage is all about using fewer disks. But if you could rebuild streaming from the ground up for the cloud, you could achieve something a lot better than fewer disks – zero disks. The difference between some disks and zero disks is night and day. Zero disks, with everything running directly through object storage with no intermediary disks, would be better. Much better. But more on that in a future blog post.

Return To Blog
Return To Blog