Your Compaction Strategy Choice Will Cost You 10x in Write Amplification (Or Save It)

- Published on
- /16 mins read
The moment you realize your 50% disk space headroom isn't a safety margin—it's STCS's tax on your infrastructure budget.
TL;DR: Compaction strategy choice impacts write amplification by 5-10x, disk usage by 10-50%, and read latency by 3-5x. The new Unified Compaction Strategy (UCS) in Cassandra 5.0 adapts to workload changes automatically, eliminating the "which strategy should I pick?" problem. TrieMemtable reduces GC pauses by 43% compared to SkipListMemtable. As of December 2025, there's no reason to use the old strategies on new deployments.
The production incident that taught me this: Your time-series cluster is running fine—until someone runs an ad-hoc analytics query that triggers a tombstone storm. Similar to teams at Discord who've documented their Cassandra scaling challenges, you discover that 100,000 tombstones per read request will timeout every query. The root cause? Wrong compaction strategy for the workload.
Series Navigation
Post 2 of 7 in the Apache Cassandra Exploration Series
This post covers: Memtables (TrieMemtable vs SkipList), SSTable formats (BTI vs BIG), compaction strategies (STCS, LCS, TWCS, UCS)
Prerequisites: Architecture Overview—understand token rings and gossip first
Next: Distributed Systems covers consistency levels and replication strategies
Related: Performance Benchmarks shows real numbers for each compaction strategy
Understanding compaction isn't optional—it's the difference between a cluster that purrs and one that pages you at 3 AM.
The storage engine is where Cassandra's performance magic happens, and compaction is where it goes wrong if you're not paying attention.
Memtable (0/4)
SSTables by Level
SSTable Storage (Disk)
Memtables: TrieMemtable Cuts GC Pauses by 43%
SkipListMemtable vs. TrieMemtable: The Numbers Matter
Cassandra uses memtables as an in-memory write buffer before data is persisted to disk. Version 5.0 introduced a pluggable memtable API, and the choice matters more than you'd think.
SkipListMemtable (Legacy):
- Concurrent skip list data structure
- Thread-safe without locks
- Higher GC pressure with frequent writes
- Well-tested, stable implementation
TrieMemtable (Modern - Recommended):
From cassandra.yaml:
memtable:
configurations:
skiplist:
class_name: SkipListMemtable
trie:
class_name: TrieMemtable
default:
inherits: skiplist # Conservative defaultTrieMemtable advantages (Apache Cassandra 5.0 announcement):
- Off-heap metadata storage reduces GC pauses by 43% in benchmarks
- Sharded single-writer architecture eliminates contention
- 9% higher write throughput
- 25% better memory efficiency
- More predictable P99 latencies
For your production clusters, this means: Switch to TrieMemtable. The performance gains are free, and the reduced GC pressure alone justifies the change.
Implementation Note from ColumnFamilyStore.java:
public Memtable createMemtable(AtomicReference<CommitLogPosition> commitLogUpperBound)
{
return memtableFactory.create(commitLogUpperBound, metadata, this);
}Commit Log: Your Durability Guarantee
The commit log provides durability for writes before they're flushed to SSTables. This is the only synchronous disk I/O on the write path—and it's sequential, which is why SSDs aren't mandatory for Cassandra (though they help).
Sync Modes:
- Periodic (default): Fsync every 10 seconds—fast but up to 10s of data loss on crash
- Batch: Fsync before acknowledging write—slowest but zero data loss
- Group: Block for configurable period between fsyncs—middle ground
Direct I/O Support (5.0+) from Apache Cassandra blog:
- Available when commit log is uncompressed and unencrypted
- Reduces memory mapping overhead
- Minimizes page cache pollution
- Better for high-throughput workloads
From cassandra.yaml:
commitlog_disk_access_mode: legacy # legacy, mmap, direct, standardFor your durability requirements, this means: Match sync mode to your tolerance for data loss. Financial transactions need batch; analytics can use periodic.
BTI SSTable Format: 28% Faster Reads, 28% Smaller Indexes
BIG Format (Legacy) vs. BTI Format (5.0+)
BIG Format:
- Used since Cassandra 3.0
- Partition index with index summary
- Bloom filters for existence checks
- Key cache for index summary entries
- Column index for wide partitions
BTI Format (Trie-Indexed - 5.0+):
The BTI format represents a fundamental improvement (Apache Cassandra 5.0 Features):
Key Innovations:
- Trie-based partition index (no index summary needed)
- Eliminates key cache requirement—no warm-up time
- More efficient for partitions with millions of rows
- Smaller on-disk footprint
- Faster point queries
Performance Impact (from Apache Cassandra benchmarks):
- 20-30% smaller index size
- 28% faster partition lookups
- No warm-up time (no key cache to populate)
- Better cache-line utilization
For your migration planning, this means: New clusters should default to BTI. Existing clusters can migrate by rewriting SSTables during compaction.
Index Granularity Configuration:
column_index_size: 4KiB # BIG default: 64KiB, BTI default: 16KiBCompaction: Where Strategy Choice Makes or Breaks Performance
Compaction is the process of merging SSTables to reclaim space, remove deleted data, and improve read performance. Pick wrong, and you'll pay with latency spikes, disk space explosions, or write amplification that burns through your SSDs.
Select Your Workload Profile
Recommended for Balanced
STCS Details
Groups similarly-sized SSTables together and compacts them when enough accumulate. Good for write-heavy workloads.
Best For
- ✓Write-heavy workloads
- ✓Insert-only patterns
- ✓Time-series data with TTL
Characteristics
Key Parameters
Amplification Comparison
Lower values are better (less amplification overhead)
Size-Tiered (STCS): The 50% Disk Tax
Algorithm from SizeTieredCompactionStrategy.java:
public static List<SSTableReader> mostInterestingBucket(List<List<SSTableReader>> buckets,
int minThreshold,
int maxThreshold)
{
// Buckets grouped by size similarity
// Most interesting = largest average hotness
// Hotness = read rate per byte
}The numbers that matter:
- Best for: Insert-heavy workloads, time-series data (without TTL)
- Space amplification: ~50% (worst case: 100%)—you need 2x the data size in free disk
- Read amplification: O(log N) SSTables—reads slow as data grows
- Write amplification: ~2-3x—lowest of all strategies
Configuration:
compaction:
class: SizeTieredCompactionStrategy
options:
min_threshold: 4 # Minimum SSTables to compact
max_threshold: 32 # Maximum SSTables per compaction
bucket_high: 1.5 # Size similarity factor
bucket_low: 0.5For your capacity planning, this means: Provision 2x your expected data size in disk space. STCS will use it during compaction spikes.
Leveled (LCS): Predictable Reads, 10x Write Cost
From LeveledCompactionStrategy.java:
Level Organization:
- L0: Newly flushed SSTables (may overlap)
- L1-LN: Fixed-size, non-overlapping SSTables
- Each level is 10x the size of the previous (configurable via
fanout_size)
Algorithm Highlights:
public class LeveledManifest
{
// Maximum bytes for level = fanout^level * max_sstable_size
public long maxBytesForLevel(int level, long maxSSTableSizeInBytes)
{
return level == 0 ? 4 * maxSSTableSizeInBytes
: (long) Math.pow(levelFanoutSize, level) * maxSSTableSizeInBytes;
}
}The numbers that matter:
- Best for: Read-heavy workloads with bounded dataset size
- Space amplification: ~10%—much better than STCS
- Read amplification: 1 SSTable per level (typically 1-2 SSTables total)
- Write amplification: ~10x—data is rewritten at each level promotion
Configuration:
compaction:
class: LeveledCompactionStrategy
options:
sstable_size_in_mb: 160 # Default increased from 5MB to 160MB in 4.0
fanout_size: 10 # Level size multiplierFor your write-heavy workloads, this means: LCS will burn through SSD endurance 3-5x faster than STCS. Check your SSD DWPD ratings.
Time Window (TWCS): The Time-Series Specialist
Optimized for time-series data with TTL. If you're storing metrics, logs, or events with expiration, this is your strategy.
From TimeWindowCompactionStrategy.java:
public static Pair<Long,Long> getWindowBoundsInMillis(TimeUnit windowTimeUnit,
int windowTimeSize,
long timestampInMillis)
{
// Creates time-based buckets for SSTables
// Allows entire window drops when all data expires
}The numbers that matter:
- Best for: Time-series with TTL, append-only workloads
- Groups SSTables by time window—entire windows drop when all data expires
- Minimal compaction overhead for aged data
- ~20% space amplification
Configuration:
compaction:
class: TimeWindowCompactionStrategy
options:
compaction_window_unit: DAYS
compaction_window_size: 1
max_sstable_age_days: 365For your time-series data, this means: Match window size to your TTL. If data expires after 7 days, use 1-day windows. Entire SSTables drop—no compaction needed.
Unified (UCS): The "I Don't Want to Think About It" Strategy
The UCS, introduced in Cassandra 5.0, represents a major evolution. It adapts to workload changes automatically—no more "should I use STCS or LCS?" debates.
From UnifiedCompactionStrategy.md:
Core Concepts
1. Size-Based Levels:
- Similar to LCS but with adaptive sizing
- Levels sized by survival ratio, not fixed fanout
- Better handles variable write patterns
2. Density Leveling:
- Compacts based on data density (bytes per token)
- Prevents hot spots from blocking compaction
- More uniform SSTable distribution
3. Adaptive Sharding:
Basic Sharding: Splits output by token ranges
Full Sharding: Parallel compaction of independent shards
4. Output Parallelization (5.1+):
From UnifiedCompactionStrategy.java:
private List<AbstractCompactionTask> createParallelCompactionTasks(
LifecycleTransaction transaction,
long gcBefore)
{
// Splits compaction into per-shard tasks
// Dramatically reduces compaction duration
// Particularly beneficial for major compactions
}Configuration:
compaction:
class: UnifiedCompactionStrategy
options:
scaling_parameters: "T4" # Threshold and fanout
target_sstable_size: "1GiB"
base_shard_count: 4
parallelize_output_shards: true # Enable parallel compactionScaling Parameters Cheat Sheet:
T2: threshold=2, fanout=2 (minimal overhead, more SSTables)T4: threshold=4, fanout=4 (balanced, default for most workloads)L8: threshold=2, fanout=8 (LCS-like, better read latency)N: threshold=30, fanout=2 (STCS-like, best for pure writes)
For new deployments, this means: Use UCS with T4. It's the "I don't know my access patterns yet" safe choice.
Compaction Internals: What Happens During Compaction
The Compaction Manager
The CompactionManager orchestrates all compaction activity. Understanding its thread pools helps diagnose compaction issues:
public class CompactionManager implements CompactionManagerMBean
{
// Thread pools for different compaction types
private final CompactionExecutor executor;
private final ValidationExecutor validationExecutor;
private final ViewBuildExecutor viewBuildExecutor;
// Rate limiting for I/O control
private final RateLimiter rateLimiter;
}Compaction Types You'll Encounter:
- Background: Normal ongoing compactions—you want these running constantly
- Major: User-initiated full compaction—avoid in production if possible
- Validation: Merkle tree building for repair—CPU intensive
- Anticompaction: Post-repair SSTable segregation
- Cleanup: Remove out-of-range data after topology changes
- Scrub: Fix corrupted SSTables—last resort
- Upgrade: Rewrite to newer SSTable format
The Compaction Iterator: Where Merging Happens
The CompactionIterator efficiently merges multiple SSTables:
public class CompactionIterator extends CompactionInfo.Holder
implements UnfilteredPartitionIterator
{
// Merges N SSTables, purging tombstones and expired data
// Tracks merge statistics for monitoring
// Supports cancellation for operational flexibility
}Optimization Techniques the Iterator Uses:
- Zero-copy merging: Direct buffer manipulation, no serialization
- Lazy deserialization: Only parse data when needed
- Bloom filter short-circuits: Skip SSTables that can't contain the key
- Tombstone purging: Remove obsolete deletion markers when safe
Tombstones: The Hidden Performance Killer
Every DELETE creates a tombstone. Every null column creates a tombstone. They accumulate until compaction removes them—and they can't be removed until gc_grace_seconds passes and all replicas have the tombstone.
Tombstone Lifecycle:
- DELETE creates tombstone with timestamp
- Tombstone preserved for
gc_grace_seconds(default: 10 days) - After grace period + repair completion, tombstone eligible for removal
- Removed during compaction if all replicas confirmed to have seen it
Protection Mechanisms from cassandra.yaml:
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000For your data model, this means: Design to minimize deletes. Use TTL instead of DELETE when possible. If you must delete, ensure regular repairs run.
Advanced Purging (5.0+):
# Only purge tombstones from repaired data
compaction:
options:
only_purge_repaired_tombstones: truePerformance Optimization: Practical Tuning
Flush Optimization
From ColumnFamilyStore.java:
private final class Flush implements Runnable
{
final OpOrder.Barrier writeBarrier;
final Map<ColumnFamilyStore, Memtable> memtables;
// Coordinates memtable switch across base table and indexes
// Ensures atomic commit log position tracking
// Parallelizes flush across data directories
}Flush Triggers:
- Memtable size threshold reached
- Time-based expiration (
memtable_flush_period) - Commit log pressure (segment full)
- Manual
nodetool flush
Compaction Throttling: Don't Compete with Application I/O
Dynamic Rate Limiting:
protected void compactionRateLimiterAcquire(RateLimiter limiter,
long bytesScanned,
long lastBytesScanned,
double compressionRatio)
{
double bytesToThrottle = (bytesScanned - lastBytesScanned) * compressionRatio;
while (bytesToThrottle >= 1024)
{
limiter.acquire(1024);
bytesToThrottle -= 1024;
}
}For your SLA requirements, this means: If read latency spikes during compaction, increase throttling. If pending compactions grow, decrease it.
SSTable Preemptive Opening: Smooth Transitions
Enables reading from SSTables before compaction completes—reduces the "cliff" effect when old SSTables are replaced:
sstable_preemptive_open_interval: 50MiBBenefits:
- Smoother transition between old and new SSTables
- Reduced page cache churn—hot data stays warm
- Maintains "hot" data accessibility during compaction
Strategy Selection: A Decision Guide
Decision Matrix
| Workload Pattern | Recommended Strategy | Why |
|---|---|---|
| Heavy writes, few reads | STCS or UCS(N) | Lowest write amplification |
| Read-heavy, bounded dataset | LCS or UCS(L8) | Predictable read performance |
| Time-series with TTL | TWCS | Efficient expiration, entire SSTables drop |
| Mixed workload, unsure | UCS(T4) | Adapts to changing patterns |
| High update rate | UCS with density leveling | Handles overwritten data efficiently |
Migration Between Strategies
STCS → LCS (Warning: I/O intensive):
# All SSTables will be reorganized into levels—expect disk I/O spike
nodetool setcompactionstrategy keyspace table LeveledCompactionStrategyAny → UCS (Smoother transition):
# UCS adapts to existing SSTable distribution—less disruptive
nodetool setcompactionstrategy keyspace table UnifiedCompactionStrategy \
scaling_parameters=T4Advanced Features: When You Need More Control
Garbage Collection Compaction (4.0+)
Proactively remove deleted data by consulting overlapping SSTables—useful after bulk deletes:
nodetool garbagecollect -g CELL keyspace tableGranularity Levels:
ROW: Discard fully deleted rows onlyCELL: Discard individual deleted/overwritten cells—more aggressiveNONE: Standard compaction behavior
User-Defined Compaction
Force compaction of specific SSTables—useful for targeted cleanup or maintenance windows:
nodetool compact --user-defined keyspace table sstable1 sstable2Subrange Compaction
Compact specific token ranges—essential for partial maintenance:
nodetool compact -st <start_token> -et <end_token> keyspace tableMajor Compaction Parallelization (UCS, 5.1+)
Cassandra 5.1 introduced parallel major compaction for UCS:
# Control parallelism for major compactions
nodetool compact --jobs 4 keyspace tableCompaction Monitoring: Early Warning Systems
Key Metrics That Predict Problems
JMX MBeans:
org.apache.cassandra.metrics:type=Compaction
- PendingTasks: Estimated remaining compactions—watch for sustained growth
- CompletedTasks: Total compactions completed
- BytesCompacted: Total data processed
- TotalCompactionsCompleted: Historical count
nodetool Commands:
# Current compaction status—run this during incidents
nodetool compactionstats
# Compaction history—useful for capacity planning
nodetool compactionhistory
# Per-table compaction parameters—verify configuration
nodetool getcompactionstrategy keyspace tableFor your monitoring stack, this means: Alert on PendingTasks > 100 sustained for 30+ minutes. Track BytesCompacted/sec as a health indicator.
Symptoms That Indicate Compaction Problems
Too Many Pending Compactions:
- Root causes: Insufficient
concurrent_compactors, lowcompaction_throughput, write rate exceeding compaction capacity - Consequence: Read latency increases as queries hit more SSTables
- Fix: Increase throughput, add nodes, or reduce write rate temporarily
Compaction Stalls:
- Root causes: Disk space exhaustion, JVM GC pressure, large partition processing
- Consequence: Pending count grows indefinitely, cluster destabilizes
- Fix: Free disk space, tune JVM, consider partition splitting
Configuration Best Practices
Match Your Strategy to Data Access Patterns
-- Time-series data with TTL: TWCS drops entire SSTables efficiently
CREATE TABLE metrics.data_points (
sensor_id uuid,
timestamp timestamp,
value double,
PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'HOURS',
'compaction_window_size': 1};
-- User profiles with random reads: UCS with leveling for predictable latency
CREATE TABLE users.profiles (
user_id uuid PRIMARY KEY,
name text,
email text
) WITH compaction = {'class': 'UnifiedCompactionStrategy',
'scaling_parameters': 'L8'};Reserve Disk Space for Compaction
Compaction requires temporary space—plan for it:
- STCS: Up to 50% of data size (worst case: 2 SSTables merge)
- LCS: ~10% of level being compacted
- TWCS: One window's worth
- UCS: Configurable via
max_space_overhead, typically 20-30%
For your capacity planning, this means: Never run disks above 50% utilization with STCS. Monitor df -h alongside pending compactions.
Tune for Your Storage Hardware
SSD-Based Clusters (most modern deployments):
concurrent_compactors: <num_cores>
compaction_throughput: 0 # No throttling—SSDs handle concurrent I/O
disk_optimization_strategy: ssdHDD-Based Clusters (legacy or cost-optimized):
concurrent_compactors: <num_disks>
compaction_throughput: 64MiB/s # Throttle to avoid I/O contention
disk_optimization_strategy: spinningDefault to UCS for New Deployments
The Unified Compaction Strategy should be your default for:
- New clusters running Cassandra 5.0+
- Mixed or uncertain workloads
- Clusters with evolving access patterns
- Teams that want operational simplicity over fine-tuned control
Troubleshooting: When Things Go Wrong
Read Latency Increasing Despite Stable Write Rate
Diagnosis:
nodetool tablestats keyspace.table | grep "SSTable count"Root Cause Analysis:
- High SSTable count → compaction falling behind → increase
compaction_throughput - Normal SSTable count but slow → partition cache misses → check hot partition distribution
- SSTable count low but reads still slow → consider BTI format migration (28% faster reads)
Disk Space Filling During Compaction
Diagnosis:
nodetool compactionstats
df -h /var/lib/cassandra/dataFixes:
- Reduce
compaction_throughputtemporarily to slow consumption - Run
nodetool cleanupif recent topology changes freed token ranges - Migrate from STCS to LCS/UCS for lower space amplification
- Shorten
gc_grace_seconds(only after ensuring all replicas run regular repairs)
Write Amplification Causing I/O Saturation
Symptoms: High disk I/O despite moderate write load
Strategy-Specific Fixes:
- STCS: Expected behavior—consider if LCS/UCS better fits workload
- LCS: If update-heavy, LCS may not be ideal—evaluate UCS or TWCS
- UCS: Tune
scaling_parametersfor lower fanout (T→L transition)
What's Coming: Future Directions
Compaction Improvements on the Roadmap
- Continuous Compaction: Background compaction without explicit tasks
- ML-Driven Strategy Selection: Automatic strategy optimization based on workload analysis
- Cross-Node Compaction Coordination: Reduce cluster-wide compaction impact
- Tiered Storage Support: Automatic migration to cold storage tiers
Your Compaction Strategy Matters More Than You Think
The choice you make today—STCS, LCS, TWCS, or UCS—will determine your operational burden for years. Discord learned this the hard way with tombstone accumulation. Netflix optimized their way to handling petabytes.
Start here:
- Audit your current tables. Run
nodetool tablestatsand identify tables with high SSTable counts or pending compactions - Evaluate UCS migration. If you're on Cassandra 5.0+, test UCS with
scaling_parameters=T4on a non-critical table - Enable BTI format. The 28% read improvement requires no application changes—just a configuration update
- Set up monitoring. Track PendingTasks, BytesCompacted/sec, and disk utilization as leading indicators
The storage engine is where Cassandra's trade-offs become real. Understand the write path, respect tombstone lifecycle, and choose compaction strategies based on data, not defaults.
Next in Series: Distributed Systems Deep Dive →
Sources and References
Apache Cassandra Documentation and Blogs
- Apache Cassandra 5.0 Improvements Blog — TrieMemtable 43% GC reduction, 9% throughput improvement
- BTI Format Blog Post — 28% faster reads with Trie-based index format
- Direct I/O Support — Bypassing OS page cache for predictable performance
- Unified Compaction Strategy — The future of Cassandra compaction
Source Code References
- TrieMemtable.java — Memory-efficient memtable implementation
- CompactionStrategy.java — Base compaction strategy interface
- UnifiedCompactionStrategy.java — UCS implementation
- CompactionIterator.java — SSTable merging logic
- ColumnFamilyStore.java — Flush coordination
- cassandra.yaml — Tombstone threshold configuration
Industry Case Studies
- Discord: How Discord Stores Trillions of Messages — Tombstone accumulation lessons
- Netflix: Scaling Time Series Data Storage — Large-scale Cassandra operations
On this page
- Memtables: TrieMemtable Cuts GC Pauses by 43%
- SkipListMemtable vs. TrieMemtable: The Numbers Matter
- Commit Log: Your Durability Guarantee
- BTI SSTable Format: 28% Faster Reads, 28% Smaller Indexes
- BIG Format (Legacy) vs. BTI Format (5.0+)
- Compaction: Where Strategy Choice Makes or Breaks Performance
- Size-Tiered (STCS): The 50% Disk Tax
- Leveled (LCS): Predictable Reads, 10x Write Cost
- Time Window (TWCS): The Time-Series Specialist
- Unified (UCS): The "I Don't Want to Think About It" Strategy
- Core Concepts
- Compaction Internals: What Happens During Compaction
- The Compaction Manager
- The Compaction Iterator: Where Merging Happens
- Tombstones: The Hidden Performance Killer
- Performance Optimization: Practical Tuning
- Flush Optimization
- Compaction Throttling: Don't Compete with Application I/O
- SSTable Preemptive Opening: Smooth Transitions
- Strategy Selection: A Decision Guide
- Decision Matrix
- Migration Between Strategies
- Advanced Features: When You Need More Control
- Garbage Collection Compaction (4.0+)
- User-Defined Compaction
- Subrange Compaction
- Major Compaction Parallelization (UCS, 5.1+)
- Compaction Monitoring: Early Warning Systems
- Key Metrics That Predict Problems
- Symptoms That Indicate Compaction Problems
- Configuration Best Practices
- Match Your Strategy to Data Access Patterns
- Reserve Disk Space for Compaction
- Tune for Your Storage Hardware
- Default to UCS for New Deployments
- Troubleshooting: When Things Go Wrong
- Read Latency Increasing Despite Stable Write Rate
- Disk Space Filling During Compaction
- Write Amplification Causing I/O Saturation
- What's Coming: Future Directions
- Compaction Improvements on the Roadmap
- Your Compaction Strategy Matters More Than You Think
- Sources and References
- Apache Cassandra Documentation and Blogs
- Source Code References
- Industry Case Studies


