José David Baena

On This Page

On this page

Your Compaction Strategy Choice Will Cost You 10x in Write Amplification (Or Save It)

Banner.jpeg
Published on
/16 mins read

The moment you realize your 50% disk space headroom isn't a safety margin—it's STCS's tax on your infrastructure budget.

TL;DR: Compaction strategy choice impacts write amplification by 5-10x, disk usage by 10-50%, and read latency by 3-5x. The new Unified Compaction Strategy (UCS) in Cassandra 5.0 adapts to workload changes automatically, eliminating the "which strategy should I pick?" problem. TrieMemtable reduces GC pauses by 43% compared to SkipListMemtable. As of December 2025, there's no reason to use the old strategies on new deployments.

The production incident that taught me this: Your time-series cluster is running fine—until someone runs an ad-hoc analytics query that triggers a tombstone storm. Similar to teams at Discord who've documented their Cassandra scaling challenges, you discover that 100,000 tombstones per read request will timeout every query. The root cause? Wrong compaction strategy for the workload.

Series Navigation

Post 2 of 7 in the Apache Cassandra Exploration Series

This post covers: Memtables (TrieMemtable vs SkipList), SSTable formats (BTI vs BIG), compaction strategies (STCS, LCS, TWCS, UCS)

Prerequisites: Architecture Overview—understand token rings and gossip first

Next: Distributed Systems covers consistency levels and replication strategies

Related: Performance Benchmarks shows real numbers for each compaction strategy

Understanding compaction isn't optional—it's the difference between a cluster that purrs and one that pages you at 3 AM.

The storage engine is where Cassandra's performance magic happens, and compaction is where it goes wrong if you're not paying attention.

Strategy:

Memtable (0/4)

Empty - waiting for writes
Sorted in-memory buffer (Skip List)

SSTables by Level

SSTable Storage (Disk)

L0
Empty
L1
Empty
L2
Empty
L3
Empty
Total Writes
0
Flushes
0
Compactions
0
Total SSTables
0
Write Amplification
0.00x
Read Path:Memtable → L0 → L1 → L2 → ... (checked in order)Write Path:Write → Memtable → Flush to L0 → Compact down

Memtables: TrieMemtable Cuts GC Pauses by 43%

SkipListMemtable vs. TrieMemtable: The Numbers Matter

Cassandra uses memtables as an in-memory write buffer before data is persisted to disk. Version 5.0 introduced a pluggable memtable API, and the choice matters more than you'd think.

SkipListMemtable (Legacy):

  • Concurrent skip list data structure
  • Thread-safe without locks
  • Higher GC pressure with frequent writes
  • Well-tested, stable implementation

TrieMemtable (Modern - Recommended):

From cassandra.yaml:

memtable:
  configurations:
    skiplist:
      class_name: SkipListMemtable
    trie:
      class_name: TrieMemtable
    default:
      inherits: skiplist  # Conservative default

TrieMemtable advantages (Apache Cassandra 5.0 announcement):

For your production clusters, this means: Switch to TrieMemtable. The performance gains are free, and the reduced GC pressure alone justifies the change.

Implementation Note from ColumnFamilyStore.java:

public Memtable createMemtable(AtomicReference<CommitLogPosition> commitLogUpperBound)
{
    return memtableFactory.create(commitLogUpperBound, metadata, this);
}

Commit Log: Your Durability Guarantee

The commit log provides durability for writes before they're flushed to SSTables. This is the only synchronous disk I/O on the write path—and it's sequential, which is why SSDs aren't mandatory for Cassandra (though they help).

Sync Modes:

  1. Periodic (default): Fsync every 10 seconds—fast but up to 10s of data loss on crash
  2. Batch: Fsync before acknowledging write—slowest but zero data loss
  3. Group: Block for configurable period between fsyncs—middle ground

Direct I/O Support (5.0+) from Apache Cassandra blog:

  • Available when commit log is uncompressed and unencrypted
  • Reduces memory mapping overhead
  • Minimizes page cache pollution
  • Better for high-throughput workloads

From cassandra.yaml:

commitlog_disk_access_mode: legacy  # legacy, mmap, direct, standard

For your durability requirements, this means: Match sync mode to your tolerance for data loss. Financial transactions need batch; analytics can use periodic.

BTI SSTable Format: 28% Faster Reads, 28% Smaller Indexes

BIG Format (Legacy) vs. BTI Format (5.0+)

BIG Format:

  • Used since Cassandra 3.0
  • Partition index with index summary
  • Bloom filters for existence checks
  • Key cache for index summary entries
  • Column index for wide partitions

BTI Format (Trie-Indexed - 5.0+):

The BTI format represents a fundamental improvement (Apache Cassandra 5.0 Features):

Key Innovations:

  • Trie-based partition index (no index summary needed)
  • Eliminates key cache requirement—no warm-up time
  • More efficient for partitions with millions of rows
  • Smaller on-disk footprint
  • Faster point queries

Performance Impact (from Apache Cassandra benchmarks):

  • 20-30% smaller index size
  • 28% faster partition lookups
  • No warm-up time (no key cache to populate)
  • Better cache-line utilization

For your migration planning, this means: New clusters should default to BTI. Existing clusters can migrate by rewriting SSTables during compaction.

Index Granularity Configuration:

column_index_size: 4KiB  # BIG default: 64KiB, BTI default: 16KiB

Compaction: Where Strategy Choice Makes or Breaks Performance

Compaction is the process of merging SSTables to reclaim space, remove deleted data, and improve read performance. Pick wrong, and you'll pay with latency spikes, disk space explosions, or write amplification that burns through your SSDs.

Select Your Workload Profile

Recommended for Balanced

STCS Details

Groups similarly-sized SSTables together and compacts them when enough accumulate. Good for write-heavy workloads.

Best For

  • Write-heavy workloads
  • Insert-only patterns
  • Time-series data with TTL

Characteristics

Compaction: low
Predictability: low
Tombstones: poor
Time-aware: No

Key Parameters

min_threshold
Minimum SSTables to trigger compaction
Default: 4 | 4 for most workloads
max_threshold
Maximum SSTables to compact at once
Default: 32 | 32 for most workloads
bucket_low
SSTable size bucket low ratio
Default: 0.5 | 0.5 default

Amplification Comparison

Lower values are better (less amplification overhead)

Read Amplification
Write Amplification
Space Amplification

Size-Tiered (STCS): The 50% Disk Tax

Algorithm from SizeTieredCompactionStrategy.java:

public static List<SSTableReader> mostInterestingBucket(List<List<SSTableReader>> buckets, 
                                                        int minThreshold, 
                                                        int maxThreshold)
{
    // Buckets grouped by size similarity
    // Most interesting = largest average hotness
    // Hotness = read rate per byte
}

The numbers that matter:

  • Best for: Insert-heavy workloads, time-series data (without TTL)
  • Space amplification: ~50% (worst case: 100%)—you need 2x the data size in free disk
  • Read amplification: O(log N) SSTables—reads slow as data grows
  • Write amplification: ~2-3x—lowest of all strategies

Configuration:

compaction:
  class: SizeTieredCompactionStrategy
  options:
    min_threshold: 4      # Minimum SSTables to compact
    max_threshold: 32     # Maximum SSTables per compaction
    bucket_high: 1.5      # Size similarity factor
    bucket_low: 0.5

For your capacity planning, this means: Provision 2x your expected data size in disk space. STCS will use it during compaction spikes.

Leveled (LCS): Predictable Reads, 10x Write Cost

From LeveledCompactionStrategy.java:

Level Organization:

  • L0: Newly flushed SSTables (may overlap)
  • L1-LN: Fixed-size, non-overlapping SSTables
  • Each level is 10x the size of the previous (configurable via fanout_size)

Algorithm Highlights:

public class LeveledManifest
{
    // Maximum bytes for level = fanout^level * max_sstable_size
    public long maxBytesForLevel(int level, long maxSSTableSizeInBytes)
    {
        return level == 0 ? 4 * maxSSTableSizeInBytes 
                          : (long) Math.pow(levelFanoutSize, level) * maxSSTableSizeInBytes;
    }
}

The numbers that matter:

  • Best for: Read-heavy workloads with bounded dataset size
  • Space amplification: ~10%—much better than STCS
  • Read amplification: 1 SSTable per level (typically 1-2 SSTables total)
  • Write amplification: ~10x—data is rewritten at each level promotion

Configuration:

compaction:
  class: LeveledCompactionStrategy
  options:
    sstable_size_in_mb: 160  # Default increased from 5MB to 160MB in 4.0
    fanout_size: 10          # Level size multiplier

For your write-heavy workloads, this means: LCS will burn through SSD endurance 3-5x faster than STCS. Check your SSD DWPD ratings.

Time Window (TWCS): The Time-Series Specialist

Optimized for time-series data with TTL. If you're storing metrics, logs, or events with expiration, this is your strategy.

From TimeWindowCompactionStrategy.java:

public static Pair<Long,Long> getWindowBoundsInMillis(TimeUnit windowTimeUnit, 
                                                       int windowTimeSize, 
                                                       long timestampInMillis)
{
    // Creates time-based buckets for SSTables
    // Allows entire window drops when all data expires
}

The numbers that matter:

  • Best for: Time-series with TTL, append-only workloads
  • Groups SSTables by time window—entire windows drop when all data expires
  • Minimal compaction overhead for aged data
  • ~20% space amplification

Configuration:

compaction:
  class: TimeWindowCompactionStrategy
  options:
    compaction_window_unit: DAYS
    compaction_window_size: 1
    max_sstable_age_days: 365

For your time-series data, this means: Match window size to your TTL. If data expires after 7 days, use 1-day windows. Entire SSTables drop—no compaction needed.

Unified (UCS): The "I Don't Want to Think About It" Strategy

The UCS, introduced in Cassandra 5.0, represents a major evolution. It adapts to workload changes automatically—no more "should I use STCS or LCS?" debates.

From UnifiedCompactionStrategy.md:

Core Concepts

1. Size-Based Levels:

  • Similar to LCS but with adaptive sizing
  • Levels sized by survival ratio, not fixed fanout
  • Better handles variable write patterns

2. Density Leveling:

  • Compacts based on data density (bytes per token)
  • Prevents hot spots from blocking compaction
  • More uniform SSTable distribution

3. Adaptive Sharding:

Basic Sharding: Splits output by token ranges
Full Sharding: Parallel compaction of independent shards

4. Output Parallelization (5.1+):

From UnifiedCompactionStrategy.java:

private List<AbstractCompactionTask> createParallelCompactionTasks(
    LifecycleTransaction transaction, 
    long gcBefore)
{
    // Splits compaction into per-shard tasks
    // Dramatically reduces compaction duration
    // Particularly beneficial for major compactions
}

Configuration:

compaction:
  class: UnifiedCompactionStrategy
  options:
    scaling_parameters: "T4"  # Threshold and fanout
    target_sstable_size: "1GiB"
    base_shard_count: 4
    parallelize_output_shards: true  # Enable parallel compaction

Scaling Parameters Cheat Sheet:

  • T2: threshold=2, fanout=2 (minimal overhead, more SSTables)
  • T4: threshold=4, fanout=4 (balanced, default for most workloads)
  • L8: threshold=2, fanout=8 (LCS-like, better read latency)
  • N: threshold=30, fanout=2 (STCS-like, best for pure writes)

For new deployments, this means: Use UCS with T4. It's the "I don't know my access patterns yet" safe choice.

Compaction Internals: What Happens During Compaction

The Compaction Manager

The CompactionManager orchestrates all compaction activity. Understanding its thread pools helps diagnose compaction issues:

public class CompactionManager implements CompactionManagerMBean
{
    // Thread pools for different compaction types
    private final CompactionExecutor executor;
    private final ValidationExecutor validationExecutor;
    private final ViewBuildExecutor viewBuildExecutor;
    
    // Rate limiting for I/O control
    private final RateLimiter rateLimiter;
}

Compaction Types You'll Encounter:

  1. Background: Normal ongoing compactions—you want these running constantly
  2. Major: User-initiated full compaction—avoid in production if possible
  3. Validation: Merkle tree building for repair—CPU intensive
  4. Anticompaction: Post-repair SSTable segregation
  5. Cleanup: Remove out-of-range data after topology changes
  6. Scrub: Fix corrupted SSTables—last resort
  7. Upgrade: Rewrite to newer SSTable format

The Compaction Iterator: Where Merging Happens

The CompactionIterator efficiently merges multiple SSTables:

public class CompactionIterator extends CompactionInfo.Holder 
    implements UnfilteredPartitionIterator
{
    // Merges N SSTables, purging tombstones and expired data
    // Tracks merge statistics for monitoring
    // Supports cancellation for operational flexibility
}

Optimization Techniques the Iterator Uses:

  • Zero-copy merging: Direct buffer manipulation, no serialization
  • Lazy deserialization: Only parse data when needed
  • Bloom filter short-circuits: Skip SSTables that can't contain the key
  • Tombstone purging: Remove obsolete deletion markers when safe

Tombstones: The Hidden Performance Killer

Every DELETE creates a tombstone. Every null column creates a tombstone. They accumulate until compaction removes them—and they can't be removed until gc_grace_seconds passes and all replicas have the tombstone.

Tombstone Lifecycle:

  1. DELETE creates tombstone with timestamp
  2. Tombstone preserved for gc_grace_seconds (default: 10 days)
  3. After grace period + repair completion, tombstone eligible for removal
  4. Removed during compaction if all replicas confirmed to have seen it

Protection Mechanisms from cassandra.yaml:

tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000

For your data model, this means: Design to minimize deletes. Use TTL instead of DELETE when possible. If you must delete, ensure regular repairs run.

Advanced Purging (5.0+):

# Only purge tombstones from repaired data
compaction:
  options:
    only_purge_repaired_tombstones: true

Performance Optimization: Practical Tuning

Flush Optimization

From ColumnFamilyStore.java:

private final class Flush implements Runnable
{
    final OpOrder.Barrier writeBarrier;
    final Map<ColumnFamilyStore, Memtable> memtables;
    
    // Coordinates memtable switch across base table and indexes
    // Ensures atomic commit log position tracking
    // Parallelizes flush across data directories
}

Flush Triggers:

  • Memtable size threshold reached
  • Time-based expiration (memtable_flush_period)
  • Commit log pressure (segment full)
  • Manual nodetool flush

Compaction Throttling: Don't Compete with Application I/O

Dynamic Rate Limiting:

protected void compactionRateLimiterAcquire(RateLimiter limiter, 
                                            long bytesScanned,
                                            long lastBytesScanned, 
                                            double compressionRatio)
{
    double bytesToThrottle = (bytesScanned - lastBytesScanned) * compressionRatio;
    while (bytesToThrottle >= 1024)
    {
        limiter.acquire(1024);
        bytesToThrottle -= 1024;
    }
}

For your SLA requirements, this means: If read latency spikes during compaction, increase throttling. If pending compactions grow, decrease it.

SSTable Preemptive Opening: Smooth Transitions

Enables reading from SSTables before compaction completes—reduces the "cliff" effect when old SSTables are replaced:

sstable_preemptive_open_interval: 50MiB

Benefits:

  • Smoother transition between old and new SSTables
  • Reduced page cache churn—hot data stays warm
  • Maintains "hot" data accessibility during compaction

Strategy Selection: A Decision Guide

Decision Matrix

Workload PatternRecommended StrategyWhy
Heavy writes, few readsSTCS or UCS(N)Lowest write amplification
Read-heavy, bounded datasetLCS or UCS(L8)Predictable read performance
Time-series with TTLTWCSEfficient expiration, entire SSTables drop
Mixed workload, unsureUCS(T4)Adapts to changing patterns
High update rateUCS with density levelingHandles overwritten data efficiently

Migration Between Strategies

STCS → LCS (Warning: I/O intensive):

# All SSTables will be reorganized into levels—expect disk I/O spike
nodetool setcompactionstrategy keyspace table LeveledCompactionStrategy

Any → UCS (Smoother transition):

# UCS adapts to existing SSTable distribution—less disruptive
nodetool setcompactionstrategy keyspace table UnifiedCompactionStrategy \
  scaling_parameters=T4

Advanced Features: When You Need More Control

Garbage Collection Compaction (4.0+)

Proactively remove deleted data by consulting overlapping SSTables—useful after bulk deletes:

nodetool garbagecollect -g CELL keyspace table

Granularity Levels:

  • ROW: Discard fully deleted rows only
  • CELL: Discard individual deleted/overwritten cells—more aggressive
  • NONE: Standard compaction behavior

User-Defined Compaction

Force compaction of specific SSTables—useful for targeted cleanup or maintenance windows:

nodetool compact --user-defined keyspace table sstable1 sstable2

Subrange Compaction

Compact specific token ranges—essential for partial maintenance:

nodetool compact -st <start_token> -et <end_token> keyspace table

Major Compaction Parallelization (UCS, 5.1+)

Cassandra 5.1 introduced parallel major compaction for UCS:

# Control parallelism for major compactions
nodetool compact --jobs 4 keyspace table

Compaction Monitoring: Early Warning Systems

Key Metrics That Predict Problems

JMX MBeans:

org.apache.cassandra.metrics:type=Compaction
  - PendingTasks: Estimated remaining compactions—watch for sustained growth
  - CompletedTasks: Total compactions completed
  - BytesCompacted: Total data processed
  - TotalCompactionsCompleted: Historical count

nodetool Commands:

# Current compaction status—run this during incidents
nodetool compactionstats
 
# Compaction history—useful for capacity planning
nodetool compactionhistory
 
# Per-table compaction parameters—verify configuration
nodetool getcompactionstrategy keyspace table

For your monitoring stack, this means: Alert on PendingTasks > 100 sustained for 30+ minutes. Track BytesCompacted/sec as a health indicator.

Symptoms That Indicate Compaction Problems

Too Many Pending Compactions:

  • Root causes: Insufficient concurrent_compactors, low compaction_throughput, write rate exceeding compaction capacity
  • Consequence: Read latency increases as queries hit more SSTables
  • Fix: Increase throughput, add nodes, or reduce write rate temporarily

Compaction Stalls:

  • Root causes: Disk space exhaustion, JVM GC pressure, large partition processing
  • Consequence: Pending count grows indefinitely, cluster destabilizes
  • Fix: Free disk space, tune JVM, consider partition splitting

Configuration Best Practices

Match Your Strategy to Data Access Patterns

-- Time-series data with TTL: TWCS drops entire SSTables efficiently
CREATE TABLE metrics.data_points (
    sensor_id uuid,
    timestamp timestamp,
    value double,
    PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy',
                     'compaction_window_unit': 'HOURS',
                     'compaction_window_size': 1};
 
-- User profiles with random reads: UCS with leveling for predictable latency
CREATE TABLE users.profiles (
    user_id uuid PRIMARY KEY,
    name text,
    email text
) WITH compaction = {'class': 'UnifiedCompactionStrategy',
                     'scaling_parameters': 'L8'};

Reserve Disk Space for Compaction

Compaction requires temporary space—plan for it:

  • STCS: Up to 50% of data size (worst case: 2 SSTables merge)
  • LCS: ~10% of level being compacted
  • TWCS: One window's worth
  • UCS: Configurable via max_space_overhead, typically 20-30%

For your capacity planning, this means: Never run disks above 50% utilization with STCS. Monitor df -h alongside pending compactions.

Tune for Your Storage Hardware

SSD-Based Clusters (most modern deployments):

concurrent_compactors: <num_cores>
compaction_throughput: 0  # No throttling—SSDs handle concurrent I/O
disk_optimization_strategy: ssd

HDD-Based Clusters (legacy or cost-optimized):

concurrent_compactors: <num_disks>
compaction_throughput: 64MiB/s  # Throttle to avoid I/O contention
disk_optimization_strategy: spinning

Default to UCS for New Deployments

The Unified Compaction Strategy should be your default for:

  • New clusters running Cassandra 5.0+
  • Mixed or uncertain workloads
  • Clusters with evolving access patterns
  • Teams that want operational simplicity over fine-tuned control

Troubleshooting: When Things Go Wrong

Read Latency Increasing Despite Stable Write Rate

Diagnosis:

nodetool tablestats keyspace.table | grep "SSTable count"

Root Cause Analysis:

  • High SSTable count → compaction falling behind → increase compaction_throughput
  • Normal SSTable count but slow → partition cache misses → check hot partition distribution
  • SSTable count low but reads still slow → consider BTI format migration (28% faster reads)

Disk Space Filling During Compaction

Diagnosis:

nodetool compactionstats
df -h /var/lib/cassandra/data

Fixes:

  1. Reduce compaction_throughput temporarily to slow consumption
  2. Run nodetool cleanup if recent topology changes freed token ranges
  3. Migrate from STCS to LCS/UCS for lower space amplification
  4. Shorten gc_grace_seconds (only after ensuring all replicas run regular repairs)

Write Amplification Causing I/O Saturation

Symptoms: High disk I/O despite moderate write load

Strategy-Specific Fixes:

  • STCS: Expected behavior—consider if LCS/UCS better fits workload
  • LCS: If update-heavy, LCS may not be ideal—evaluate UCS or TWCS
  • UCS: Tune scaling_parameters for lower fanout (T→L transition)

What's Coming: Future Directions

Compaction Improvements on the Roadmap

  1. Continuous Compaction: Background compaction without explicit tasks
  2. ML-Driven Strategy Selection: Automatic strategy optimization based on workload analysis
  3. Cross-Node Compaction Coordination: Reduce cluster-wide compaction impact
  4. Tiered Storage Support: Automatic migration to cold storage tiers

Your Compaction Strategy Matters More Than You Think

The choice you make today—STCS, LCS, TWCS, or UCS—will determine your operational burden for years. Discord learned this the hard way with tombstone accumulation. Netflix optimized their way to handling petabytes.

Start here:

  1. Audit your current tables. Run nodetool tablestats and identify tables with high SSTable counts or pending compactions
  2. Evaluate UCS migration. If you're on Cassandra 5.0+, test UCS with scaling_parameters=T4 on a non-critical table
  3. Enable BTI format. The 28% read improvement requires no application changes—just a configuration update
  4. Set up monitoring. Track PendingTasks, BytesCompacted/sec, and disk utilization as leading indicators

The storage engine is where Cassandra's trade-offs become real. Understand the write path, respect tombstone lifecycle, and choose compaction strategies based on data, not defaults.


Next in Series: Distributed Systems Deep Dive →


Sources and References

Apache Cassandra Documentation and Blogs

Source Code References

Industry Case Studies