Deterministic Simulation Testing for Our Entire SaaS

Mar 12, 2024
Richard Artoul

Deterministic Simulation Testing

Deterministic simulation testing is fast becoming the gold standard for how mission critical software is tested. Deterministic simulation testing was first popularized by the FoundationDB team who spent 18 months building a deterministic simulation framework for their database before ever letting it write or read data from an actual physical disk. The results speak for themselves: FoundationDB is widely considered to be one of the most robust and well-tested distributed databases, so much so that Kyle Kingsbury (of Jepsen fame) refused to test it because their deterministic simulator already stress tested FoundationDB more than the Jepsen framework ever could.

The WarpStream team utilized FoundationDB heavily at Datadog when we built Husky, Datadog’s columnar storage engine for event data. Over the course of our careers, our team has operated (and been on-call for) almost every database on the market: M3DB, etcd, ZooKeeper, Cassandra, Elasticsearch, Redis, MongoDB, MySQL, Postgres, Apache Kafka and more. In our experience, FoundationDB stands in a league of its own in terms of correctness and reliability due to its early investment in deterministic simulation testing.

A more recent example of a database system that leverages this approach to testing is TigerBeetle, a financial transactions database, that uses deterministic simulation testing to build one of the most robust financial OLTP databases available today.

When we were designing WarpStream, we knew that it wouldn’t be enough to just replace Apache Kafka with something cheaper and easier to operate. Kafka is the beating heart of many companies most critical infrastructure, and if we were to stand any chance of convincing those organizations to adopt WarpStream, we’d have to compress 12+ years of production hardening into a much shorter time frame. We accelerated this process with our architectural decision to rely on object storage as the only storage in the system, bypassing many of the tricky problems of ensuring data durability, availability, and replication at scale. Still, the fact that WarpStream leverages object storage is only a small part of ensuring the correctness of the overall system.

Antithesis

When we first heard about Antithesis, we could hardly contain our excitement. Antithesis has created the holy grail for testing distributed systems: a bespoke hypervisor that deterministically simulates an entire set of Docker containers and injects faults, created by the same people who made FoundationDB. For a group of gray-haired distributed systems engineers, seeing Antithesis in action felt like a tribe of cavemen stumbling upon a post-industrial revolution society. As we spoke more to the Antithesis team, an idea began to crystallize: we could use Antithesis to deterministically simulate not only WarpStream, but our entire SaaS!

WarpStream was built differently than most traditional database products. It was designed from day one with a true data plane / control plane split. There are two primary components to WarpStream: First, the Agents (data plane) that act as “thick proxies” and expose the Kafka protocol to clients. The Agents also take care of all communication with object storage, layering in batching and caching to improve performance and keep costs low.


Second is the WarpStream control plane which has two major components:

  1. The metadata store that tracks cluster metadata and performs remote consensus.
  2. Our SaaS software that manages different tenants’ metadata stores, API keys, users, accounts, etc.

The metadata store only has two dependencies:

  1. Any cloud KV store 
  2. Object storage

The SaaS software adds one additional dependency: a traditional SQL database for managing users, organization, API keys, etc. Looking at WarpStream’s minimal dependencies, we thought, why not test its entire customer experience, from initial signup to running Kafka workloads?

We created a docker-compose file that contains the following components:

  1. Several WarpStream Agents
  2. Several WarpStream Control Plane nodes
  3. Several Apache Kafka clients
  4. A KV store
  5. Postgres
  6. An object store (localstack)

With the help of the Antithesis team, we wrote a test workload that started all of those services, signed up for a WarpStream account, created a virtual cluster, and then began producing and consuming data. The workload was carefully structured so that we could assert on a variety of different important properties that WarpStream must maintain at all times.

The test workload consists of multiple producers that are each assigned a unique ID and write records to a small set of topics. These producers synchronously write a few small JSON records that contain the producer’s ID, a counter (a monotonic sequence number for that producer), and a few other properties. We repeat the same components as the record’s key, value, and in a header to ensure we never shuffle those around accidentally. The consumer side of the workload polls all the topics and all the partitions and asserts that:

  1. The topic and partition the record was consumed from matches the topic and partition the record was produced to.
  2. The offsets for each record in each partition are monotonic.
  3. The sequence numbers for each producer are monotonic, i.e. if we group the records by <Topic, Partition, ProducerID> the sequence number encoded in the record is monotonically increasing.

The consumers store all of the records for each polling iteration and can assert that a record at offset X in the previous poll still exists in a future poll. This ensures that WarpStream doesn't lose or reorder data as e.g. background compaction runs to reorganize the cluster’s data for more efficient access.

These assertions address many of the classes of bugs found in previous Jepsen tests of Apache Kafka and other Kafka-compatible systems. For example, prior Jepsen tests have caught bugs like:

  1. Loss of previously acknowledged writes. If a write was acknowledged but failed to appear in the output for that topic-partition in the future, assertion 1 or 3 above would fire.
  2. Violation of producer idempotency (i.e producing duplicate records without the producer itself crashing or restarting).  Antithesis automatically tests our Idempotent Producer implementation by disrupting the network between the producer client and the agent or the agent and the control plane, leading to internal retries inside the Kafka client. A duplicate would cause the sequence number from a producer to stay the same or decrease, causing assertion 3 above to fire.
  3. Records appearing in different topic-partitions than they were originally written to. This is addressed by assertions 1 or 3 above.

What’s the big deal?

At this point you might be scratching your head a little bit and wondering: “What’s the big deal here? Isn’t this just a really fancy integration test!?”. Yes and no. Before we started using Antithesis, WarpStream already had a pretty robust set of stress tests we called the “correctness tests”.

These tests do essentially everything we just described, but in a regular CI environment. Our correctness tests even inject faults all over the WarpStream stack using a custom chaos injection library that we wrote. These tests are incredibly powerful, and they caught a lot of bugs. We would go as far as saying that investing deeply in those correctness tests is one of the main reasons that we were able to develop WarpStream as efficiently as we did.

Just like our existing correctness tests, the Antithesis hypervisor automatically injects faults, latency, thread hangs, and restarts into the workload. However, unlike our correctness tests, the Antithesis hypervisor is really smart and automatically fuzzes the system under test in an intelligent way.

Antithesis automatically instruments your software to measure code coverage and build statistics about the execution frequency of each code path. This enables Antithesis to detect “interesting” behavior in the test (such as infrequent code paths getting exercised, or rare log messages being emitted).

When Antithesis detects interesting or rare behavior, it immediately snapshots the state of the entire system before exploring various different execution branches. This means that Antithesis is much better at triggering rare or unlikely behavior in WarpStream than our existing correctness tests were.

Also, since Antithesis runs the entire software stack in a deterministic simulator, they can actually run the simulation at faster than wall clock time. Similar to FoundationDB, WarpStream makes heavy use of timers and batching to improve performance. Anytime a WarpStream Goroutine does the equivalent of time.Sleep(), the Antithesis hypervisor doesn’t actually have to wait. On top of that, the Antithesis hypervisor explores code branches concurrently. All of this adds up in a meaningful way such that Antithesis can cost effectively compress years of stress testing into a much shorter time frame.

It’s hard to over-emphasize just how transformative this technology is for building distributed systems. For all intents and purposes, it really does feel like a time-traveler arrived from 20 years in the future and gave us their state of the art software testing technology. Of course, it’s not actually magic. Antithesis is the result of dozens of the smartest software engineers, statisticians, and machine learning experts pouring their heart and souls into the problem of software testing for 5 years straight. But to us mere mortals, it does feel a lot like magic.

We found some bugs

Let’s look at a few example runs that Antithesis generated for us.

Antithesis ran the WarpStream workload for 6 wall clock hours, during which it simulated 280 hours of application time. The graph shows that it took about 160 “application hours” for Antithesis to “stall” and stop discovering new “behaviors” in the WarpStream workload. This means that running the tests for longer than 160 hours has diminishing returns, and instead we should invest in making the test itself more sophisticated if we want to exercise the codebase more. Great feedback for us!

But think about that for a moment: even after 140 hours of injecting faults, randomizing thread execution, automatically detecting that something interesting / rare had happened and intentionally branching to investigate further, Antithesis was still “discovering” new behaviors in WarpStream. We could hire a 100 distributed systems engineers and make them write integration tests for an entire year, and they probably wouldn’t be able to trigger all the interesting states and behavior that a single Antithesis run covered in 6 hours of wall clock time.

As just one example of how powerful this is, on the first day we started using Antithesis it caught a data race in our metrics instrumentation library that had been present since the first month of the project.

Our correctness tests had run in our regular CI workflows for literally 10s of thousands of hours by then, with the Go race detector enabled, and not once ever caught this bug. Antithesis caught this bug in its first 233 seconds of execution.

A data race in the instrumentation library isn’t that exciting, though. What about an extremely rare data loss bug that is the result of both a network failure and a race condition? That’s more exciting!

To minimize the number of S3 PUTs that WarpStream users have to pay for, the Agents buffer Kafka Produce requests from many different clients in-memory for ~250ms before combining the batches of data into a single file and flushing it to object storage.

In some scenarios, like if write throughput is high, there will be multiple outstanding files being flushed to object storage concurrently. Once flushing the files succeeds, committing the metadata for the flushed files to the control plane can be batched to reduce networking overhead. This is implemented using a background Goroutine that periodically scans the list of “flushed but not yet committed” files.

While refactoring the Agent to add speculative retries for flushing files to object storage, we subtly broke the error handling on this path so that, for a very brief window of time, a file which failed to flush would be considered successful and ready to commit to the control plane metadata store. In program order (i.e. the linear flow of the code, ignoring concurrency) this window where the background Goroutine that commits metadata would see the successful file would be nearly impossible to squeeze into. This background Goroutine only polls for successful files every five milliseconds, and the time between the two state transitions in the common case would be less than a microsecond!

This bug is the manifestation of two unlikely events: a file failing to flush and a specific thread interleaving that should be extremely rare in practice. Despite how unlikely these events are to occur together, on a long enough time-scale, this bug would have resulted in data loss at some point.

Instead, thanks to Antithesis’ powerful fuzzer and fault injector, this rare combination of events happened roughly once per wall clock hour of testing. We’d been running a build with this bug in our staging environment and obviously did not encounter that bug at all, let alone once per hour, as it would’ve immediately been noticed when a future background compaction failed due to the missing file in object storage. We’ve since fixed the regression in the code such that the invalid, temporary state transition cannot occur.

Why not Jepsen?

The obvious question you might be asking yourself at this point is: Why use Antithesis instead of a traditional Jepsen test? It’s a good question, and one we asked ourselves before embarking on our journey with Antithesis.

We’re big fans of Jepsen and have consumed almost every published report. However, after speaking with the Antithesis team and spending a few months integrating with it, we feel strongly that deterministic simulation testing with tools like Antithesis is a much more robust and sustainable path forward for the industry. Specifically, we think that the Antithesis’ approach is better than Jepsen’s for a few reasons:

  1. The Antithesis technology is more robust, and much more likely to catch bugs than the Jepsen harness. There is simply no other equivalent (that we’re aware of) to Antithesis’ custom hypervisor, and its ability to automatically instrument distributed systems for code coverage and effectively “hunt” for bugs. Yes, the Jepsen framework will inject faults into a running environment in an effort to trigger bugs and edge-case behavior, but this approach is crude in comparison to what Antithesis does.
  2. Antithesis integrates natively into how our engineers are used to working. The entire test setup is expressed using standard docker-compose files and Docker images, and Antithesis tests are kicked off using Github Actions that push WarpStream images to Antithesis’ docker registry. When we add new functionality, all our engineers have to do is modify the Antithesis workload and kick off an automated CI job. The entire experience and workflow lives right next to our existing codebase, CI, and workflows. Extra bonus: none of our engineers had to learn Clojure.
  3. Antithesis testing is designed to be a continuous process with accompanying professional services that help you grow and adapt the tests as the scope of your product increases. That means our users get the confidence that every WarpStream release is actively tested with Antithesis, unlike a traditional Jepsen test where the engagement is short-lived and usually only covers a “snapshot” of a system at a static point in time.
  4. Finally, it would not have been practical to continuously test our entire SaaS platform with Jepsen in the same way that we do with Antithesis. While that may seem like overkill, we think it’s a pretty important point. For example, consider the fact that almost every cloud infrastructure provider has a routing or proxy layer that is responsible for routing customer requests to the correct set of infrastructure that hosts the customer’s resources. A small data race or caching bug in that routing layer could result in exposing one customer’s resources to a different customer. These multi-tenant SaaS layers are never tested in traditional Jepsen testing, but with Antithesis it was actually easier to include these layers in our testing than to specifically exclude them.

We’re just getting started with Antithesis! Over the coming months we plan to work with the Antithesis team to expand our testing footprint to cover additional functionality like:

  1. Multi-region deployments of our SaaS platform.
  2. Multi-role Agent Clusters.
  3. Injecting and detecting data corruption at the storage and file cache layer using checksums.
  4. And much more!

If you’d like to learn more about WarpStream, please contact us, or join our Slack!

Return To Blog
Return To Blog