.png)
.png)
.png)
.png)
Each Kafka broker tops out at a few thousand partitions before metadata overhead, leader elections, and controller bottlenecks degrade the cluster. On AWS MSK, the recommended limit is 6,000 partitions per broker including replication. Leviathan needs 900,000+ partitions for parallelism across 300,000 GPUs. With RF=3, that means 2.7 million partition-replicas. At 6,000 per broker, you need 450 brokers just to hold the partition metadata, before a single byte flows. And each broker you add makes the next problem worse.
Any broker scaling event triggers a rebalancing storm. At 900,000+ partitions, storms last 60 to 90 minutes. 300,000 GPUs idle the entire time. Adding more brokers makes the next storm worse.
Scaling brokers means hours of data shuffling. A 100TB crawl dump can’t wait for a rebalance to finish. Every hour of pipeline delay is another $600K in GPUs doing nothing.
Fan-out scales linearly with consumer count. Five monitoring systems reading telemetry from the same topics means 5x the disk I/O. The broker’s physical IOPS becomes the ceiling. When alerting falls four hours behind, that’s $2.4M in wasted compute before anyone notices the training run diverged.
Separate clusters per region plus replication tooling equals offset drift, cascading failures, and operational complexity that scales faster than the cluster. A 3 AM failure in one region blocks the global pipeline, which idles GPUs 8,000 miles away.


Partitions are metadata pointers into object storage, not on-disk state tied to a broker. No per-broker limit. Customers are currently running 40,000+ partitions on a single WarpStream cluster. Leviathan's 900,000 partitions are just rows in the control plane.
Agents are stateless. No data lives on the agent, so there's nothing to rebalance. Scale from 100 to 500 agents in minutes via Kubernetes HPA. Scale back down when the burst is over. Zero coordination, zero data movement.
Five independent agent groups: ingestion, processing, data serving, telemetry, and evaluation. During a crawl dump burst, the ingestion group triples while the data serving group feeding 300,000 GPUs doesn't flinch. Same topics, same offsets, zero interference.
Object storage chunks cached in agent memory, shared across all consumer groups. Five consumers or fifteen, same I/O cost. No disk multiplication. Alerting never falls behind because a sixth monitoring system got added.
One logical cluster spanning multiple regions. No MirrorMaker, no offset drift, no separate clusters to manage. 99.999% uptime SLA. Zero data loss (RPO=0). Ripcord mode keeps writing even when the control plane is down. A 3 AM failure in one region doesn't block anything.