Cassandra's Architecture Handles 1M+ Writes/Second Because of These Design Decisions

The Slack message from our DBA was three words: "Check the dashboards." Our user table had grown to 400 million rows, and writes were timing out. The fix wasn't more hardware—it was understanding why Cassandra's token ring had placed 60% of our hot partitions on the same three nodes.

TL;DR: Cassandra's architecture trades consistency guarantees for write throughput that scales linearly with hardware. The token ring distributes data without a coordinator bottleneck, gossip eliminates single points of failure, and the new CEP-21 metadata service finally solves the schema disagreement problem that's haunted operations teams for years. As of December 2025, version 5.1 brings transformative changes that address long-standing operational pain points.

Why this matters: Your e-commerce platform's Black Friday traffic just 10x'd. Similar to teams at Netflix and Discord, you're watching write latency climb as a single database coordinator melts down. This is the exact failure mode Cassandra's masterless architecture prevents—every node accepts writes for its token range, eliminating the coordinator bottleneck that kills traditional databases under load.

Series Navigation

Post 1 of 7 in the Apache Cassandra Exploration Series

This post covers: Core architecture, token rings, gossip protocol, CEP-21 metadata service

Prerequisites: None—this is the starting point

Next: Storage Engine & Compaction covers TrieMemtable, BTI format, and compaction strategies

Related: Distributed Systems expands on gossip and consistency levels

The question isn't whether you need a distributed database—it's whether you understand why Cassandra's specific design decisions enable the workloads that break other systems.

Cassandra isn't just "eventually consistent"—it's a carefully engineered trade-off that enables 400,000+ operations per second on modest hardware according to Apache Cassandra benchmarks.

Cassandra Synthesizes Amazon Dynamo's Distribution with Google Bigtable's Data Model

Cassandra was initially designed at Facebook using a Staged Event-Driven Architecture (SEDA) to address limitations in existing database systems. Facebook open-sourced Cassandra in 2008, and it became an Apache top-level project in 2010. The project combines:

Amazon Dynamo: Distributed storage, replication techniques, and eventual consistency (Dynamo paper, 2007)
Google Bigtable: Data and storage engine model (Bigtable paper, 2006)

For your architecture decisions, this means: Cassandra excels at write-heavy, geographically distributed workloads where availability matters more than strong consistency.

Core Design Objectives

Full Multi-Primary Replication: Every node accepts reads and writes for its data—no primary/replica bottleneck
Global Availability at Low Latency: Designed for worldwide deployment with local access patterns
Horizontal Scalability: Linear throughput increases with additional hardware
Online Load Balancing: Cluster topology changes without downtime
Flexible Schema: Schema modifications without application downtime
Commodity Hardware: Runs efficiently on standard, affordable infrastructure

Consistent Hashing Eliminates Coordinator Bottlenecks

Cassandra uses consistent hashing to distribute data across nodes in a cluster. Each piece of data is assigned to a node based on a partition key, which is hashed to produce a token. This approach means adding or removing nodes only requires redistributing a fraction of the data—not reshuffling everything.

Key Implementation Details (org.apache.cassandra.dht.Murmur3Partitioner):

Default partitioner is Murmur3Partitioner, offering ~10% better performance than RandomPartitioner
Token space divided into ranges, with each node responsible for one or more ranges
Virtual nodes (vnodes) allow dynamic token allocation, improving load distribution
Default num_tokens set to 16 for optimal balance between performance and flexibility

For your cluster sizing, this means: With vnodes, adding a node automatically rebalances data. No manual token assignment needed.

Token Allocation (cassandra.yaml):

num_tokens: 16
allocate_tokens_for_local_replication_factor: 3

The Storage Engine: Where Write Performance Comes From

Memtables Absorb Write Bursts Without Blocking

SkipListMemtable: Legacy implementation using concurrent skip lists
TrieMemtable: Modern trie-based structure reducing GC pressure and improving write throughput—up to 30% better performance according to Apache Cassandra blog

From ColumnFamilyStore.java:

Memtable initialMemtable = DatabaseDescriptor.isDaemonInitialized() ?
                           createMemtable(new AtomicReference<>(CommitLog.instance.getCurrentPosition())) :
                           null;

Commit Log: Your Durability Guarantee

The commit log provides durability for writes before they're flushed to SSTables:

Write-ahead log ensuring durability
Supports multiple sync modes: periodic, batch, and group
Direct I/O support for improved performance (Java 10+)

SSTables: BTI Format Delivers 20-30% Faster Reads

Cassandra 5.x supports two SSTable formats:

BIG Format: Legacy format (Cassandra 3.0+)
BTI Format: Trie-indexed format with superior performance
- Removes index summary component
- Eliminates need for key caching
- Efficient searching in partitions with millions of rows

For your migration planning, this means: New clusters should use BTI format. Existing clusters can migrate incrementally.

Gossip Protocol: No Single Point of Failure

The Gossiper implements a peer-to-peer communication protocol for node discovery and failure detection. Think of it like office gossip—each node periodically shares what it knows with random peers, and eventually everyone knows everything.

public class Gossiper implements IFailureDetectionEventListener, GossiperMBean, IGossiper
{
    // Gossip runs every 1 second
    public final static int intervalInMillis = 1000;
    
    // Nodes quarantined for 2 * RING_DELAY after removal
    public final static int QUARANTINE_DELAY = GOSSIPER_QUARANTINE_DELAY.getInt(
        StorageService.RING_DELAY_MILLIS * 2
    );
}

Key Features:

Epidemic protocol for state dissemination—converges in O(log N) rounds
Phi Accrual Failure Detector for node health monitoring
Application state propagation (schema, tokens, status)
Automatic dead node detection and removal

For your operations runbooks, this means: Node failures are detected automatically. No external coordination service (like ZooKeeper) required.

CEP-21: The Biggest Architectural Change in Cassandra's History

Cassandra 5.1 introduces a transformative change with Transactional Cluster Metadata Service (CMS). If you've ever been paged for a "schema disagreement" alert, this is for you.

From NEWS.txt:

CEP-21 Transactional Cluster Metadata introduces a distributed log for linearizing modifications to cluster metadata. In the first instance, this encompasses cluster membership, token ownership and schema metadata.

Benefits:

Linearized schema changes across the cluster—no more "schema version mismatch" errors
Atomic topology modifications
Elimination of schema disagreements that required manual nodetool resetlocalschema
Foundation for future advanced features like Accord transactions (CEP-15)

For your upgrade planning, this means: CEP-21 alone justifies the upgrade to 5.1 for operational simplicity.

Tunable Consistency: The Trade-off You Control

Cassandra implements tunable consistency through multiple replication strategies. You choose the consistency level per operation—this isn't a cluster-wide setting.

Replication Strategies:

SimpleStrategy: Single-datacenter deployments
NetworkTopologyStrategy: Multi-datacenter with per-DC replication control
LocalStrategy: Non-replicated system tables

Consistency Levels:

Writes: ANY, ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL
Reads: ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, LOCAL_ONE

The consistency formula: If Write CL + Read CL > Replication Factor, you get strong consistency. For example, with RF=3: QUORUM + QUORUM = 2 + 2 = 4 > 3 ✓

From StorageService.java:

public void refreshSizeEstimates() throws ExecutionException
{
    cleanupSizeEstimates();
    FBUtilities.waitOnFuture(
        ScheduledExecutors.optionalTasks.submit(SizeEstimatesRecorder.instance)
    );
}

Compaction Strategies: Choosing the Right One Matters

Cassandra offers multiple compaction strategies optimized for different workloads. Pick the wrong one and you'll pay with latency spikes or disk space explosions.

Size-Tiered Compaction Strategy (STCS)

Default strategy for write-heavy workloads
Groups similarly-sized SSTables for compaction
Lower I/O cost but higher space amplification (~50% overhead)

Leveled Compaction Strategy (LCS)

From LeveledCompactionStrategy.java:

Fixed-size SSTables organized in levels
Lower space amplification (~10% overhead)
Better read performance, higher write amplification (~10x)

Time Window Compaction Strategy (TWCS)

Optimized for time-series data with TTL
Creates time-based windows of SSTables
Efficient expiration of old data—entire SSTables drop when all data expires

Unified Compaction Strategy (UCS) - The Future

From UnifiedCompactionStrategy.md:

Combines benefits of all strategies—adaptive to workload
Adaptive sharding for parallelization
Density-based leveling
Configurable for various workloads

For new deployments, this means: Use UCS unless you have specific requirements for TWCS (time-series with TTL).

Write and Read Paths: Understanding the Flow

Write Path: Why Writes Are Fast

Memtable Write: Data written to active memtable (in-memory, sequential)
Commit Log: Durably logged (if enabled)—this is the durability guarantee
Secondary Indexes: Updated synchronously
Materialized Views: Updated asynchronously via batchlog

This is why Cassandra writes are fast: The only disk I/O is a sequential append to the commit log. Everything else happens in memory.

From ColumnFamilyStore.java:

public void apply(PartitionUpdate update, CassandraWriteContext context, boolean updateIndexes)
{
    long start = nanoTime();
    OpOrder.Group opGroup = context.getGroup();
    CommitLogPosition commitLogPosition = context.getPosition();
    
    Memtable mt = data.getMemtableFor(opGroup, commitLogPosition);
    UpdateTransaction indexer = newUpdateTransaction(update, context, updateIndexes, mt);
    long timeDelta = mt.put(update, indexer, opGroup);
    
    metric.writeLatency.addNano(nanoTime() - start);
}

Read Path: Where Complexity Lives

Row Cache Check: Optional partition-level cache
Memtable Query: Search active and flushing memtables
SSTable Query: Use bloom filters, partition index, and data files
Merge Results: Reconcile by timestamp using last-write-wins

This is why read performance depends on compaction: More SSTables = more files to check. Good compaction keeps SSTable count low.

Configuration and Tuning: Start Here

Cassandra's flexibility comes from extensive configuration options in cassandra.yaml. Here are the settings that matter most:

Memory Management

memtable_heap_space: 2048MiB
memtable_offheap_space: 2048MiB
key_cache_size: auto  # min(5% heap, 100MiB)
row_cache_size: 0MiB  # Disabled by default

Compaction and I/O

compaction_throughput: 64MiB/s
concurrent_compactors: auto  # min(disks, cores)
stream_throughput_outbound: 24MiB/s

Durability and Consistency

commitlog_sync: periodic
commitlog_sync_period: 10000ms
commitlog_segment_size: 32MiB

Advanced Features Worth Knowing

Hinted Handoff: Automatic Failure Recovery

Cassandra 3.0+ completely rewrote hints for improved performance. When a replica is temporarily unavailable, the coordinator stores "hints" and replays them when the node returns.

Hints stored in flat files (not in tables)
More efficient dispatch mechanism
Configurable compression

From cassandra.yaml:

hinted_handoff_throttle: 1024KiB
max_hints_delivery_threads: 2
max_hint_window: 3h

Repair: Keeping Replicas Consistent

Multiple repair strategies for data consistency:

Full Repair: Complete data validation across replicas
Incremental Repair: Only unrepaired data—much faster
Subrange Repair: Specific token ranges
Auto Repair (5.1+): Fully automated repair scheduling

Storage-Attached Indexes (SAI): Secondary Indexes That Actually Work

Introduced in Cassandra 5.0 as the next-generation secondary indexing—forget everything you knew about SASI:

Optimized SSTable and memtable-attached indexes
Vector similarity search support for ML embeddings
50-85% faster than legacy indexes according to Apache benchmarks

Change Data Capture (CDC): Stream Your Changes

Enables real-time data change tracking:

cdc_enabled: false
cdc_block_writes: true
cdc_on_repair_enabled: true

What Cassandra Does Well (and Where It Struggles)

Strengths

Linear Write Scalability: Adding nodes proportionally increases write capacity—proven at Netflix scale
High Availability: No single point of failure
Tunable Consistency: Balance between consistency and availability per operation
Efficient Wide Rows: Handles billions of columns per partition
Geographic Distribution: Multi-datacenter replication built-in

Be Honest About the Trade-offs

Read Performance: Depends heavily on caching and compaction strategy—not as fast as writes
Data Modeling: Requires query-first design approach—you can't just normalize and query
Tombstone Management: Deleted data creates tombstone overhead
Repair Overhead: Regular repairs needed for eventual consistency
JVM Tuning: Requires appropriate garbage collection configuration

Cassandra 5.0+: What's Changed

Storage Compatibility Mode

Allows gradual adoption of new features during upgrades:

storage_compatibility_mode: NONE  # CASSANDRA_4, UPGRADING, or NONE

Extended TTL Support

Maximum expiration date extended to 2106-02-07 (from 2038-01-19)—the Y2K38 problem, solved
Backward compatible with controlled upgrade path

Accord Transactions (Experimental)

CEP-15 introduces general-purpose transactions:

Multi-partition ACID transactions
Optimistic concurrency control
Foundation for complex application patterns that previously required external coordination

What This Means for Your Architecture

Cassandra's architecture represents decades of distributed systems research and production experience at companies like Facebook, Netflix, and Apple. The key insight isn't that Cassandra is "better"—it's that Cassandra makes specific trade-offs that enable specific workloads.

Cassandra excels when you need:

Write throughput that scales horizontally
Geographic distribution with local latency
High availability (no single point of failure)
Flexibility in consistency vs. availability trade-offs

Consider alternatives when you need:

Complex queries with joins
Strong consistency without performance penalty
Small datasets that fit on one machine
Ad-hoc analytics queries

Evaluate before you deploy:

Model your queries first. Cassandra's data model is query-driven—you can't retrofit relational thinking.
Plan your partition strategy. Hot partitions will kill performance. High cardinality partition keys distribute load.
Choose your compaction strategy. STCS for write-heavy, LCS for read-heavy, TWCS for time-series, UCS for "I don't want to think about it."
Set up monitoring from day one. JMX metrics and virtual tables give visibility into cluster health.
Upgrade to 5.1 for CEP-21. The operational improvements alone justify the upgrade effort.

The database that handles 1 trillion requests per day at Apple didn't get there by accident. It got there by making the right trade-offs for the right workloads.

Sources and References

Architecture and Design

Dynamo Paper (2007): Amazon's foundational distributed storage design
Bigtable Paper (2006): Google's structured data storage system
Facebook Open Source (2008): Original Cassandra announcement

Performance and Benchmarks

Apache Cassandra Benchmarks: Configuration standardization and performance
Netflix Tech Blog: Cassandra at Netflix scale
Apple Scale: 1 trillion requests per day

Cassandra 5.0+ Features

Trie Memtables and SSTables: 30% performance improvement
Storage-Attached Indexes: SAI performance benchmarks
Vector Search: ML embeddings support
Direct I/O Support: Improved write performance

CEPs (Cassandra Enhancement Proposals)

CEP-15: General Purpose Transactions (Accord)
CEP-21: Transactional Cluster Metadata
CEP-37: Automatic Repair Scheduler

Distributed Systems Theory

SWIM Protocol: Scalable gossip-based membership
Murmur3 Partitioner: JIRA performance improvement
ACID Transactions at Scale: The New Stack analysis

Next in Series: Storage Engine and Compaction →

On This Page