The Original Sin of Cloud Infrastructure

Mar 14, 2024
Richard Artoul

The origins of big data infrastructure

Many of today's most highly adopted open source “big data” infrastructure projects – like Cassandra, Kafka, Hadoop, etc. – follow a common story. A large company, startup or otherwise, faces a unique, high scale infrastructure challenge that's poorly supported by existing tools. They create an internal solution for their specific needs, and then later (kindly) open source it for the greater community to use. Now, even smaller startups can benefit from the work and expertise of these seasoned engineering teams. Great, right?

Almost a decade later, I’d guess most developers would say that it’s complicated. Adopting big data systems created at other companies that were later open sourced did not turn out to be the time saving, efficiency boosting gift many thought it would be. Unfortunately, while the big tech companies open sourced their code, they didn’t open source their expertise or tooling. Many early adopters were burned badly when they realized just how difficult to operate and scale these “big data” infrastructure technologies are.  The issue might be even more fundamental, though. When developers adopt software under a radically different set of circumstances than it was designed for, can we really blame the software when things go wrong?

Early in my career at Uber, I was complaining to a new storage director (that had just joined us from Apple) about all the problems we were running into with Cassandra. “I have no idea why Uber has so much trouble with Cassandra” he told me, “we ran 100x as many Cassandra nodes at Apple and it worked great.” I was skeptical, but chalked it up to a skill issue. Maybe we really just didn’t know what we were doing.

Years later I had an epiphany while talking to an engineer from the Apple Cassandra team: Apple runs a custom build of Cassandra that bears almost no resemblance to the open source Cassandra that Uber was running! The basic primitives were the same, but between the custom plugins, modules, tooling, and orchestration they had created they might as well have been running a completely different database. 

As these big data infrastructure projects grew in adoption, it quickly became obvious to everyone that the commercial opportunity around this infrastructure was massive. Today, there are many public companies that are built on this exact business model. You can trace 3 distinct phases to how this market played out, and in each of them, end users lose.

Phase 1: selling tooling, automation, and support

The earliest infrastructure companies in this market tried to address these problems – the sheer difficulty of running the kind of open source infrastructure software we’re talking about – by selling tooling, automation, and support for the software in question. These companies had great stories – technical founders who built and open sourced something innovative, building a company and ecosystem around it. 

But the effect of this kind of monetization on the OSS itself wasn’t always positive. The creators of OSS trying to commercialize it themselves ended up creating a perverse set of incentives (that they probably never intended for in the first place):

  • In many cases the development of the original OSS was severely degraded, because good tooling was gate-kept out by the vendors who were also the primary sponsors of the projects… 
  • While at the same time, these infrastructure companies were starting to make quite a bit of money off of how difficult the software was to use. 

These incentives put vendors in a tough spot: if you improve the OSS, you make less money off of it. In a sense, these companies had indirectly created a disease that they’d be able to profit nicely off of selling the cure to.

Phase 2: selling infrastructure and managed services

Improved tooling and support helped make the infrastructure easier to run, but didn’t really do enough to chip away at the problem. And perhaps most importantly, the cloud continued to gain adoption. At this point most new startups were defaulting to AWS, and even larger successful tech-first companies like Pinterest, etc. were running on public clouds. 

The next logical step for vendors was to provide infrastructure “as a service” to their customers. This way users would not be exposed to any of the messy details of operating infrastructure technology at scale, and the vendor could have more control over the environment in which the software ran. Users would get a significantly better experience, and the vendor would take advantage of economies of scale to drive down costs. Infrastructure as a service would be both easier, and cheaper. Everybody wins!

In practice, this did not happen. Many early adopters were shocked to discover that while they had been promised cost savings, in reality their costs actually increased 5 to 10x over their previously self-hosted solutions. It turns out that taking software designed for on-prem data centers and lifting and shifting it into the cloud results in bad unit economics at any scale, and paying a vendor to abstract the problem away from you just makes the problem worse. 

This is the original sin of cloud infrastructure. In their rush to take advantage of this new market, the infrastructure companies never redesigned their software to actually take advantage of cloud primitives. And they can’t! Because even if they wanted to, doing so would slowly destroy the very businesses they had built. This is where the sin has us stuck: between infrastructure that’s too hard to run, and vendors that are too expensive to pay for. Worse, this new monetization strategy increased the vendor’s incentive to hamper the effectiveness of the open source infrastructure technologies even further.

Apache Kafka is practically the poster child for this problem. Kafka’s architecture made a lot of sense for the data center environments it was designed for, but its approach to replication makes it eye-poppingly expensive and difficult to run in cloud environments.

To make this concrete: Transmitting a single GiB of data through Kafka in the cloud costs more than storing a GiB of data in S3 for 2 months in the best case scenario where you’ve perfectly configured your Kafka cluster. Adding networking ingress and egress fees between your cloud account and a vendor’s cloud account for all of your producers and consumers just compounds this problem further.

Phase 3: "lift and shift" BYOC

More recently, vendors have come up with a third approach: Bring Your Own Cloud (BYOC) deployments. The idea here is solid; split the software stack into two components: a data plane and a control plane. Run the data plane in the customer’s cloud account, and the control plane in the vendor’s. In theory, this should provide the best of both worlds: the low costs of self hosting, with the low overhead of a fully managed SaaS. 

This is a particularly appealing approach for Kafka vendors because Apache Kafka is inherently networking heavy, and as we just noted above, cloud networking is expensive. Keeping all of the networking inside the customer’s cloud account has potential for massive cost savings for everyone involved. 

But existing BYOC products usually don’t work well in practice for two reasons. First, the vendors never actually redesigned their software to create a real data plane / control plane split. Instead, they punch a gaping hole between their cloud account and the customer’s cloud account, deploy a giant constellation of containers and custom controllers into a special Kubernetes cluster, and then hire a team of skilled engineers to remotely manage all of this infrastructure. This works (for a while), but ultimately the unit economics are not sustainable for the vendor, which means this business model has an expiration date.

Second, while existing BYOC Kafka solutions do eliminate the networking fees between the customer’s cloud account and the vendor’s cloud account, they do nothing to address the egregious inter-zone networking and storage fees incurred within the customer’s cloud account. These costs can only be addressed by redesigning all the software from first principles for the cloud (like I said, the original sin). 

In summary, BYOC (done wrong) is just a new way to try and shove on-prem shaped software into a cloud-shaped hole. The only difference is that instead of trying to lift and shift on-prem software into the vendor’s cloud account, vendors are now trying to lift and shift on-prem software into the customer’s cloud account. Better, but still missing the point.

Infrastructure should be purpose built for the cloud

For a long time now, the software industry has been stuck between a rock and a hard place for critical infrastructure. Companies either had to spend years training teams and writing custom tooling to maintain open source infrastructure software at scale, or go broke paying a vendor to do it for them. As a result, systems like Apache Kafka are only used when they can provide an overwhelming amount of business value. This has to change!

We need companies out there building this critical infrastructure from the ground up, designed for the cloud, and with real world, “normal” company use cases in mind. We need the systems that power our businesses to be resilient, transparent, and trivial to manage. We need them to be created for the cloud, instead of the cloud being an afterthought, and we need to rethink the story of blindly adopting open source data infrastructure that wasn’t designed for what we need. Maybe you could call this “purpose built infrastructure” – infra that’s designed with purpose for real world use cases, not infra that’s retrofitted onto the use case you happen to have.

This is the promise of BYOC done right – it’s how great things can be when infrastructure is truly designed for the cloud. If a vendor is going to run software in the customer’s cloud account, that software has to be trivial to manage. Trivial as in “completely stateless and almost impossible to mess up”, trivial as in: “if you accidentally delete the entire Kubernetes cluster running the software, you won’t lose any data”, trivial as in: “scaling in out just means adding or removing containers”. Not trivial as in “we wrote a really sophisticated Kubernetes operator to manage this extremely stateful software that thinks it’s running in a datacenter”. In other words, not all BYOC implementations are created equal.

Our product, WarpStream, is one of many new infrastructure projects that fit this bill. Building on our experience creating systems like M3DB at Uber and Husky at Datadog, we designed it purposefully for the cloud, and the experience that it enables is a meaningful step up from what teams are used to. Yes: we’re a vendor talking our book. Yes, we will make money off of how difficult it is to stream data reliably at scale. But our customers – and customers of a new wave of startups like us – will actually reap the benefits of what they pay for. WarpStream just works, is cheaper than even self-hosting Apache Kafka (let alone paying a vendor), and can be deployed and maintained by a single engineer irrespective of scale.

When we were designing WarpStream, we knew that ultimately it would have to compete with self-hosted open source Apache Kafka if we were going to truly disrupt the streaming space. That observation led to two inevitable design choices:

  1. WarpStream had to be designed from the ground up around cloud unit economics because the efficiency bar was incredibly high. The total cost of ownership for running WarpStream had to be cheaper than self-hosting Apache Kafka even after taking into account our control plane / licensing fees.
  2. WarpStream had to be BYOC-first. There is simply no other model that makes sense for “big data” and networking heavy mission-critical infrastructure like Kafka.

We didn’t fork Apache Kafka and try to lift and shift it into the cloud. We evaluated every design decision from first principles and created a completely new implementation that takes advantage of everything modern cloud environments have to offer. For example:

  1. WarpStream uses object storage as the primary and only storage in the system. As a result, it has zero local disks anywhere in the stack. This means that instead of having to become experts in Kafka’s replication and durability guarantees, every WarpStream user gets to delegate that problem to the AWS S3 team instead. Either the data was written to object storage and appropriately replicated and available, or it wasn’t. There is no in-between.
  2. WarpStream never manually replicates data between availability zones. This eliminates the inter-zone networking fees we discussed earlier and is the primary reason that WarpStream can reduce Kafka costs by 4-10x even compared to self hosting.
  3. WarpStream has an extremely hard split between the data plane and the control plane. This reduces costs by ensuring the amount of networking traffic between the customer’s cloud account and WarpStream cloud is minimal, while also ensuring privacy and data sovereignty. This split also makes it possible for WarpStream to provide a “managed SaaS” experience within the customer’s environment by moving as much of the tricky control plane, metadata, and consensus logic into our cloud account while keeping the entire data plane in the customer’s cloud account.

And that’s just the tip of the iceberg. Every WarpStream design decision was evaluated from first principles to create a system that is cloud native and BYOC-first. This inevitably led to a radically different architecture than anything else out there.

When we initially announced the developer preview of WarpStream, we were completely blown away by the reaction. We received messages from companies of all shapes and sizes that had been struggling with the cost and operational burden of Apache Kafka for years and couldn’t wait to use WarpStream, but were waiting for our first GA release. Since then, we’ve been working closely with our early adopters and design partners to make WarpStream the most cost effective and easiest to operate Kafka implementation, as well as one of the most robust and well-tested.

I use the phrase “Kafka implementation” very intentionally. Postgres and Kafka used to refer to systems that only had one implementation; today, things have changed and there are now many competing systems implementing that same protocol. This is good! The one thing we all want to avoid is 100 new vendors with 100 new pieces of infrastructure, each not interoperable or compatible. Users will win in a future where major use cases like data streaming have a standard protocol, and vendors compete on implementation. Users should be able to switch between vendors without requiring months of effort – and these kinds of incentives will lead to what we all want the most, which is the best possible infrastructure for everyone. 

With that in mind, today we’re excited to announce that both the BYOC and Serverless flavors of our Apache Kafka protocol compatible product are now GA. We’re also excited to announce that we raised 20 million dollars from Amplify Partners and Greylock Partners to fund our vision for cloud native, BYOC-first infrastructure. If you want to learn more, contact us or book a demo!

Return To Blog
Return To Blog