Cassandra's Architecture Handles 1M+ Writes/Second Because of These Design Decisions

- Published on
- /12 mins read
The Slack message from our DBA was three words: "Check the dashboards." Our user table had grown to 400 million rows, and writes were timing out. The fix wasn't more hardware—it was understanding why Cassandra's token ring had placed 60% of our hot partitions on the same three nodes.
TL;DR: Cassandra's architecture trades consistency guarantees for write throughput that scales linearly with hardware. The token ring distributes data without a coordinator bottleneck, gossip eliminates single points of failure, and the new CEP-21 metadata service finally solves the schema disagreement problem that's haunted operations teams for years. As of December 2025, version 5.1 brings transformative changes that address long-standing operational pain points.
Why this matters: Your e-commerce platform's Black Friday traffic just 10x'd. Similar to teams at Netflix and Discord, you're watching write latency climb as a single database coordinator melts down. This is the exact failure mode Cassandra's masterless architecture prevents—every node accepts writes for its token range, eliminating the coordinator bottleneck that kills traditional databases under load.
Series Navigation
Post 1 of 7 in the Apache Cassandra Exploration Series
This post covers: Core architecture, token rings, gossip protocol, CEP-21 metadata service
Prerequisites: None—this is the starting point
Next: Storage Engine & Compaction covers TrieMemtable, BTI format, and compaction strategies
Related: Distributed Systems expands on gossip and consistency levels
The question isn't whether you need a distributed database—it's whether you understand why Cassandra's specific design decisions enable the workloads that break other systems.
Cassandra isn't just "eventually consistent"—it's a carefully engineered trade-off that enables 400,000+ operations per second on modest hardware according to Apache Cassandra benchmarks.
Cassandra Synthesizes Amazon Dynamo's Distribution with Google Bigtable's Data Model
Cassandra was initially designed at Facebook using a Staged Event-Driven Architecture (SEDA) to address limitations in existing database systems. Facebook open-sourced Cassandra in 2008, and it became an Apache top-level project in 2010. The project combines:
- Amazon Dynamo: Distributed storage, replication techniques, and eventual consistency (Dynamo paper, 2007)
- Google Bigtable: Data and storage engine model (Bigtable paper, 2006)
For your architecture decisions, this means: Cassandra excels at write-heavy, geographically distributed workloads where availability matters more than strong consistency.
Core Design Objectives
- Full Multi-Primary Replication: Every node accepts reads and writes for its data—no primary/replica bottleneck
- Global Availability at Low Latency: Designed for worldwide deployment with local access patterns
- Horizontal Scalability: Linear throughput increases with additional hardware
- Online Load Balancing: Cluster topology changes without downtime
- Flexible Schema: Schema modifications without application downtime
- Commodity Hardware: Runs efficiently on standard, affordable infrastructure
Consistent Hashing Eliminates Coordinator Bottlenecks
Cassandra uses consistent hashing to distribute data across nodes in a cluster. Each piece of data is assigned to a node based on a partition key, which is hashed to produce a token. This approach means adding or removing nodes only requires redistributing a fraction of the data—not reshuffling everything.
Key Implementation Details (org.apache.cassandra.dht.Murmur3Partitioner):
- Default partitioner is
Murmur3Partitioner, offering ~10% better performance thanRandomPartitioner - Token space divided into ranges, with each node responsible for one or more ranges
- Virtual nodes (vnodes) allow dynamic token allocation, improving load distribution
- Default
num_tokensset to 16 for optimal balance between performance and flexibility
For your cluster sizing, this means: With vnodes, adding a node automatically rebalances data. No manual token assignment needed.
Token Allocation (cassandra.yaml):
num_tokens: 16
allocate_tokens_for_local_replication_factor: 3The Storage Engine: Where Write Performance Comes From
Memtables Absorb Write Bursts Without Blocking
- SkipListMemtable: Legacy implementation using concurrent skip lists
- TrieMemtable: Modern trie-based structure reducing GC pressure and improving write throughput—up to 30% better performance according to Apache Cassandra blog
From ColumnFamilyStore.java:
Memtable initialMemtable = DatabaseDescriptor.isDaemonInitialized() ?
createMemtable(new AtomicReference<>(CommitLog.instance.getCurrentPosition())) :
null;Commit Log: Your Durability Guarantee
The commit log provides durability for writes before they're flushed to SSTables:
- Write-ahead log ensuring durability
- Supports multiple sync modes: periodic, batch, and group
- Direct I/O support for improved performance (Java 10+)
SSTables: BTI Format Delivers 20-30% Faster Reads
Cassandra 5.x supports two SSTable formats:
- BIG Format: Legacy format (Cassandra 3.0+)
- BTI Format: Trie-indexed format with superior performance
- Removes index summary component
- Eliminates need for key caching
- Efficient searching in partitions with millions of rows
For your migration planning, this means: New clusters should use BTI format. Existing clusters can migrate incrementally.
Gossip Protocol: No Single Point of Failure
The Gossiper implements a peer-to-peer communication protocol for node discovery and failure detection. Think of it like office gossip—each node periodically shares what it knows with random peers, and eventually everyone knows everything.
public class Gossiper implements IFailureDetectionEventListener, GossiperMBean, IGossiper
{
// Gossip runs every 1 second
public final static int intervalInMillis = 1000;
// Nodes quarantined for 2 * RING_DELAY after removal
public final static int QUARANTINE_DELAY = GOSSIPER_QUARANTINE_DELAY.getInt(
StorageService.RING_DELAY_MILLIS * 2
);
}Key Features:
- Epidemic protocol for state dissemination—converges in O(log N) rounds
- Phi Accrual Failure Detector for node health monitoring
- Application state propagation (schema, tokens, status)
- Automatic dead node detection and removal
For your operations runbooks, this means: Node failures are detected automatically. No external coordination service (like ZooKeeper) required.
CEP-21: The Biggest Architectural Change in Cassandra's History
Cassandra 5.1 introduces a transformative change with Transactional Cluster Metadata Service (CMS). If you've ever been paged for a "schema disagreement" alert, this is for you.
From NEWS.txt:
CEP-21 Transactional Cluster Metadata introduces a distributed log for linearizing modifications to cluster metadata. In the first instance, this encompasses cluster membership, token ownership and schema metadata.
Benefits:
- Linearized schema changes across the cluster—no more "schema version mismatch" errors
- Atomic topology modifications
- Elimination of schema disagreements that required manual
nodetool resetlocalschema - Foundation for future advanced features like Accord transactions (CEP-15)
For your upgrade planning, this means: CEP-21 alone justifies the upgrade to 5.1 for operational simplicity.
Tunable Consistency: The Trade-off You Control
Cassandra implements tunable consistency through multiple replication strategies. You choose the consistency level per operation—this isn't a cluster-wide setting.
Replication Strategies:
- SimpleStrategy: Single-datacenter deployments
- NetworkTopologyStrategy: Multi-datacenter with per-DC replication control
- LocalStrategy: Non-replicated system tables
Consistency Levels:
Writes: ANY, ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL
Reads: ONE, TWO, THREE, QUORUM, LOCAL_QUORUM, EACH_QUORUM, ALL, LOCAL_ONE
The consistency formula: If Write CL + Read CL > Replication Factor, you get strong consistency. For example, with RF=3: QUORUM + QUORUM = 2 + 2 = 4 > 3 ✓
From StorageService.java:
public void refreshSizeEstimates() throws ExecutionException
{
cleanupSizeEstimates();
FBUtilities.waitOnFuture(
ScheduledExecutors.optionalTasks.submit(SizeEstimatesRecorder.instance)
);
}Compaction Strategies: Choosing the Right One Matters
Cassandra offers multiple compaction strategies optimized for different workloads. Pick the wrong one and you'll pay with latency spikes or disk space explosions.
Size-Tiered Compaction Strategy (STCS)
- Default strategy for write-heavy workloads
- Groups similarly-sized SSTables for compaction
- Lower I/O cost but higher space amplification (~50% overhead)
Leveled Compaction Strategy (LCS)
From LeveledCompactionStrategy.java:
- Fixed-size SSTables organized in levels
- Lower space amplification (~10% overhead)
- Better read performance, higher write amplification (~10x)
Time Window Compaction Strategy (TWCS)
- Optimized for time-series data with TTL
- Creates time-based windows of SSTables
- Efficient expiration of old data—entire SSTables drop when all data expires
Unified Compaction Strategy (UCS) - The Future
From UnifiedCompactionStrategy.md:
- Combines benefits of all strategies—adaptive to workload
- Adaptive sharding for parallelization
- Density-based leveling
- Configurable for various workloads
For new deployments, this means: Use UCS unless you have specific requirements for TWCS (time-series with TTL).
Write and Read Paths: Understanding the Flow
Write Path: Why Writes Are Fast
- Memtable Write: Data written to active memtable (in-memory, sequential)
- Commit Log: Durably logged (if enabled)—this is the durability guarantee
- Secondary Indexes: Updated synchronously
- Materialized Views: Updated asynchronously via batchlog
This is why Cassandra writes are fast: The only disk I/O is a sequential append to the commit log. Everything else happens in memory.
From ColumnFamilyStore.java:
public void apply(PartitionUpdate update, CassandraWriteContext context, boolean updateIndexes)
{
long start = nanoTime();
OpOrder.Group opGroup = context.getGroup();
CommitLogPosition commitLogPosition = context.getPosition();
Memtable mt = data.getMemtableFor(opGroup, commitLogPosition);
UpdateTransaction indexer = newUpdateTransaction(update, context, updateIndexes, mt);
long timeDelta = mt.put(update, indexer, opGroup);
metric.writeLatency.addNano(nanoTime() - start);
}Read Path: Where Complexity Lives
- Row Cache Check: Optional partition-level cache
- Memtable Query: Search active and flushing memtables
- SSTable Query: Use bloom filters, partition index, and data files
- Merge Results: Reconcile by timestamp using last-write-wins
This is why read performance depends on compaction: More SSTables = more files to check. Good compaction keeps SSTable count low.
Configuration and Tuning: Start Here
Cassandra's flexibility comes from extensive configuration options in cassandra.yaml. Here are the settings that matter most:
Memory Management
memtable_heap_space: 2048MiB
memtable_offheap_space: 2048MiB
key_cache_size: auto # min(5% heap, 100MiB)
row_cache_size: 0MiB # Disabled by defaultCompaction and I/O
compaction_throughput: 64MiB/s
concurrent_compactors: auto # min(disks, cores)
stream_throughput_outbound: 24MiB/sDurability and Consistency
commitlog_sync: periodic
commitlog_sync_period: 10000ms
commitlog_segment_size: 32MiBAdvanced Features Worth Knowing
Hinted Handoff: Automatic Failure Recovery
Cassandra 3.0+ completely rewrote hints for improved performance. When a replica is temporarily unavailable, the coordinator stores "hints" and replays them when the node returns.
- Hints stored in flat files (not in tables)
- More efficient dispatch mechanism
- Configurable compression
From cassandra.yaml:
hinted_handoff_throttle: 1024KiB
max_hints_delivery_threads: 2
max_hint_window: 3hRepair: Keeping Replicas Consistent
Multiple repair strategies for data consistency:
- Full Repair: Complete data validation across replicas
- Incremental Repair: Only unrepaired data—much faster
- Subrange Repair: Specific token ranges
- Auto Repair (5.1+): Fully automated repair scheduling
Storage-Attached Indexes (SAI): Secondary Indexes That Actually Work
Introduced in Cassandra 5.0 as the next-generation secondary indexing—forget everything you knew about SASI:
- Optimized SSTable and memtable-attached indexes
- Vector similarity search support for ML embeddings
- 50-85% faster than legacy indexes according to Apache benchmarks
Change Data Capture (CDC): Stream Your Changes
Enables real-time data change tracking:
cdc_enabled: false
cdc_block_writes: true
cdc_on_repair_enabled: trueWhat Cassandra Does Well (and Where It Struggles)
Strengths
- Linear Write Scalability: Adding nodes proportionally increases write capacity—proven at Netflix scale
- High Availability: No single point of failure
- Tunable Consistency: Balance between consistency and availability per operation
- Efficient Wide Rows: Handles billions of columns per partition
- Geographic Distribution: Multi-datacenter replication built-in
Be Honest About the Trade-offs
- Read Performance: Depends heavily on caching and compaction strategy—not as fast as writes
- Data Modeling: Requires query-first design approach—you can't just normalize and query
- Tombstone Management: Deleted data creates tombstone overhead
- Repair Overhead: Regular repairs needed for eventual consistency
- JVM Tuning: Requires appropriate garbage collection configuration
Cassandra 5.0+: What's Changed
Storage Compatibility Mode
Allows gradual adoption of new features during upgrades:
storage_compatibility_mode: NONE # CASSANDRA_4, UPGRADING, or NONEExtended TTL Support
- Maximum expiration date extended to 2106-02-07 (from 2038-01-19)—the Y2K38 problem, solved
- Backward compatible with controlled upgrade path
Accord Transactions (Experimental)
CEP-15 introduces general-purpose transactions:
- Multi-partition ACID transactions
- Optimistic concurrency control
- Foundation for complex application patterns that previously required external coordination
What This Means for Your Architecture
Cassandra's architecture represents decades of distributed systems research and production experience at companies like Facebook, Netflix, and Apple. The key insight isn't that Cassandra is "better"—it's that Cassandra makes specific trade-offs that enable specific workloads.
Cassandra excels when you need:
- Write throughput that scales horizontally
- Geographic distribution with local latency
- High availability (no single point of failure)
- Flexibility in consistency vs. availability trade-offs
Consider alternatives when you need:
- Complex queries with joins
- Strong consistency without performance penalty
- Small datasets that fit on one machine
- Ad-hoc analytics queries
Evaluate before you deploy:
- Model your queries first. Cassandra's data model is query-driven—you can't retrofit relational thinking.
- Plan your partition strategy. Hot partitions will kill performance. High cardinality partition keys distribute load.
- Choose your compaction strategy. STCS for write-heavy, LCS for read-heavy, TWCS for time-series, UCS for "I don't want to think about it."
- Set up monitoring from day one. JMX metrics and virtual tables give visibility into cluster health.
- Upgrade to 5.1 for CEP-21. The operational improvements alone justify the upgrade effort.
The database that handles 1 trillion requests per day at Apple didn't get there by accident. It got there by making the right trade-offs for the right workloads.
Sources and References
Architecture and Design
- Dynamo Paper (2007): Amazon's foundational distributed storage design
- Bigtable Paper (2006): Google's structured data storage system
- Facebook Open Source (2008): Original Cassandra announcement
Performance and Benchmarks
- Apache Cassandra Benchmarks: Configuration standardization and performance
- Netflix Tech Blog: Cassandra at Netflix scale
- Apple Scale: 1 trillion requests per day
Cassandra 5.0+ Features
- Trie Memtables and SSTables: 30% performance improvement
- Storage-Attached Indexes: SAI performance benchmarks
- Vector Search: ML embeddings support
- Direct I/O Support: Improved write performance
CEPs (Cassandra Enhancement Proposals)
- CEP-15: General Purpose Transactions (Accord)
- CEP-21: Transactional Cluster Metadata
- CEP-37: Automatic Repair Scheduler
Distributed Systems Theory
- SWIM Protocol: Scalable gossip-based membership
- Murmur3 Partitioner: JIRA performance improvement
- ACID Transactions at Scale: The New Stack analysis
Next in Series: Storage Engine and Compaction →
On this page
- Cassandra Synthesizes Amazon Dynamo's Distribution with Google Bigtable's Data Model
- Core Design Objectives
- Consistent Hashing Eliminates Coordinator Bottlenecks
- The Storage Engine: Where Write Performance Comes From
- Memtables Absorb Write Bursts Without Blocking
- Commit Log: Your Durability Guarantee
- SSTables: BTI Format Delivers 20-30% Faster Reads
- Gossip Protocol: No Single Point of Failure
- CEP-21: The Biggest Architectural Change in Cassandra's History
- Tunable Consistency: The Trade-off You Control
- Compaction Strategies: Choosing the Right One Matters
- Size-Tiered Compaction Strategy (STCS)
- Leveled Compaction Strategy (LCS)
- Time Window Compaction Strategy (TWCS)
- Unified Compaction Strategy (UCS) - The Future
- Write and Read Paths: Understanding the Flow
- Write Path: Why Writes Are Fast
- Read Path: Where Complexity Lives
- Configuration and Tuning: Start Here
- Memory Management
- Compaction and I/O
- Durability and Consistency
- Advanced Features Worth Knowing
- Hinted Handoff: Automatic Failure Recovery
- Repair: Keeping Replicas Consistent
- Storage-Attached Indexes (SAI): Secondary Indexes That Actually Work
- Change Data Capture (CDC): Stream Your Changes
- What Cassandra Does Well (and Where It Struggles)
- Strengths
- Be Honest About the Trade-offs
- Cassandra 5.0+: What's Changed
- Storage Compatibility Mode
- Extended TTL Support
- Accord Transactions (Experimental)
- What This Means for Your Architecture
- Sources and References
- Architecture and Design
- Performance and Benchmarks
- Cassandra 5.0+ Features
- CEPs (Cassandra Enhancement Proposals)
- Distributed Systems Theory



