Care and Feeding of a Cassandra Cluster

Leveled Compaction

Leveled Compaction

The Algorithm

Leveled compaction assigns each SSTable a level. A partition will only exist in one SSTable per level.

Animation 1: Here’s our first SSTable flushed to disk. It has five partitions, each identified by their token key.

Animation 2: Since this is a newly flushed SSTable, level-tiered compaction considers this SSTable to be in level zero. Generally any table residing in level zero immediately compacts to lower levels. Since there are no other SSTables to compact with, this table remains in L0 for the time being.

Animation 3: Here’s a second flushed SSTable. Leveled compaction will now compact the two L0 SSTables together.

Animation 4: Leveled compaction always compacts down to the level immediately below the current one. In this case, that’s level 1 (L1).

Animation 5: As with all compaction, we combine partitions together from the two source SSTables into a new SSTable. In these illustrations, we are assuming no updates/deletes, hence our new partitions sizes will be the sum of the two source partitions.

Animation 6: Here we combine the two 12 token key partitions into a single partition in a new SSTable.

Animation 7: Leveled compaction maintains a max SSTable size. This is tunable. The default value is 160MB.

Animation 8: We continue adding source partitions to the new SSTable until the new SSTable meets or exceeds the max SSTable size.

Animation 9: We maxed our size, so we create a new SSTable.

Animation 10: Partition 52 is rather large and takes up the whole of our new SSTable.

Animation 11: So we make a new SSTable.

Animation 12: Combine 74

Animation 13: And also 88

Animation 14: We delete the old L0 SSTables.

Animation 15: Each level has a max size. In practice, this is 10x the size of the previous level, L0 considered 160MB. (However, the size of L0 depends on the size of the SSTables flushed to it, which is determined by factors that cause the MemTable to flush to disk.)

Animation 16: In our example and to save screen real-estate, we will use a multiplier of two instead of ten, but the algorithm still works the same.

Animation 17: You can see that L1 is too large, so leveled compaction further compacts SSTables from L1 to L2.

Animation 18: Since there are no existing tables in L2, compacting L1 to L2 is simply a matter of moving an existing L1 SSTable to L2. Leveled compaction tries to consider the SSTables in turn when selecting which SSTable to compact to a lower level.

Animation 19: L1 is still too large.

Animation 20: So we compact another SSTable down to L2.

Animation 21: (Cleaning up the visuals.)

Animation 22: L2 has a max size.

Animation 23: L2’s max size is twice the size of L1. (Remember, the real multiplier is ten, not two however.)

Animation 24: So now our levels are in a consistent state. Leveled compaction is complete.

Animation 25: A new SSTable flushes to disk into L0. Leveled compaction will immediately compact this new SSTable with the SSTables in L1.

Critical: The way leveled compaction chooses which SSTables to compact together is simply a matter of overlapping token keys. For example, this new SSTable covers token ranges 8 to 88 inclusive. The single SSTable in L1 covers the token range 74 to 88 inclusive. Since these two ranges overlap, leveled compaction will combine these two SSTables together.

Animation 26: So let’s compact these two SSTables together.

Animation 27: Partitions 8, 28, and 52 don’t have any pairs in the SSTable on the right, so they compact as is.

Animation 28: We hit the max SSTable size, so time to write a new one.

Animation 29: 74 is the lowest key between the two SSTables, so leveled compaction writes it next.

Animation 30: Finally the two 88 partitions from both SSTables combine to make a new partition.

Animation 31: Oh no, L1 is too big again. We must continue compacting down.

Important: Can you determine which SSTables will compact next?

Answer: Since both SSTables are new to the compaction party in L1, leveled compaction picks the first to compact down. Leveled compaction then determines which SSTables in L2 overlap with this SSTable. This SSTable has a token range of 8 to 52 inclusive. This range overlaps the first two SSTables in L2, each having a token range of 12 to 28 inclusive and 52 to 52 inclusive respectively.

Animation 32: Let’s compact!

Animation 33: We first compact 8 since it’s the lowest token between all three tables.

Animation 34: Then 12

Animation 35: Then 28. Notice there are two SSTables containing a partition of token 28. If all three SSTables contained 28, we would combine all three instead.

Animation 36: That new SSTable is full.

Animation 37: We combine the two 52 partitions into one huge partition. Again, they take up the entire SSTable.

Animation 38: We are now in a consistent state where all levels are not too large.

Animation 39: And the process continues as Cassandra flushes SSTables to L0.

Actual Implementation

  • We used a multiplier of two for our example
  • Leveled compaction uses a multiplier of 10 per level
  • SSTable max size is 160MB (sstable_size_in_mb)
  • SSTables exceed this amount to ensure the last partition written is complete
  • Our example data model had extremely large partitions
  • The more granular your partitions, the closer to 160MB the SSTables will be
  • Hence, more uniform

Reads

  • Leveled compaction is best for read-heavy workload

    • Occasional writes but high reads

  • Each partition resides in only one SSTable per level (max)
  • Generally reads handled by just a few SSTables

    • Partitions group together in a handful of levels as they compact down
    • 90% of the data resides in the lowest level (due to 10x rule)
    • Unless the lowest level is not yet full

Example:

  • L1: 1,600MB (1.6GB)
  • L2: 16,000MB (16GB)
  • L3: 160,000MB (160GB)
  • L4: 1,600,000MB (1.6TB)
  • L1 + L2 + L3 = 177,600GB
  • 177,600GB / 1.6TB ~= 10%

We ignore L0 here because L0’s size depends on factors that determine when a MemTable flushes to disk. When compaction is complete, L0 will always be empty because leveled compaction immediately compacts L0 down.

As the number of levels grow, this value approaches 10%.

Disk Usage

  • In general, an SSTable in one level overlaps 10(ish) SSTables in the level below
  • Therefore, compaction requires 11x SSTable max size to compact
  • One for the SSTable in the higher level
  • 10 for the overlapped SSTables in the next level
  • Leveled compaction wastes less disk space
  • Obsolete records compact out quickly

    • A single partition’s records group as they compact down
    • Updated records merge with older records due to this grouping

Disadvantages

  • IO intensive
  • Compacts many more SSTables at once over size tiered compaction
  • Compacts more frequently than size tiered
  • Can’t ingest data at high insert speeds

Lagging Behind

  • Leveled compaction switches to size-tiered compaction at level 0 when compaction is behind
  • Creates larger L0 SSTables to compact with lower levels
  • More optimal to compact a larger L0 SSTable to lower levels
  • Reduces number of SSTables required for a read

Size Tiered Compaction

Size Tiered Compaction

Consider a Perfect World

In this scenario we will consider a write-only workload with no updates or deletes. All values written are unique, so compacting SStables together merely means combining their data together without eliminating any old data.

Animation 1: Over time, we gain four 100mb SSTable files, each flushed from a MemTable. We will use nice clean numbers in this example to keep the math easy.

Animation 2: The compactor’s duty is to combine these SSTables into a single SSTable. In this example, the compactor is a size tiered compactor.

Animation 3: The compactor combines all values from the four source 100mb SSTables making a new 400mb SSTable.

Animation 4: The old SSTables are no longer necessary, and the compactor deletes them.

Animation 5 + 6: The scenario plays out again, and now we have two 400mb SSTables.

Animation 7: This occurs two more times until we have four SSTables (min_threshold).

Animation 8: Since we have four 400MB SSTables, the compactor combines them into a new 1600MB SSTable.

Animation 9: The compactor deleted the unneeded 400MB files.

Animation 10: The entire scenario plays out again making a second 1600MB SSTable file. Eventually, when there are four 1600MB SSTable files, the compactor will compact them together as well into a single 6.4GB SSTable file.

Worst Case Scenario

Requires 50% Hard Drive Space

In the absolute worst case scenario with size tiered compaction, you must have 50% of your disk free since a size tiered compaction simply copies (and merges) old, smaller SSTables together.

Unpredictable Reads

Partition Data Can Possibly be Scattered Amongst Several SSTables

One issue/side effect with size tiered compaction is that, in the worst case, a single partition read requires streaming data from all SSTables. For example, let’s consider a basic scenario of pulling records of users living in Texas.

Animation 1: Here are four new SSTables, each containing a record of a user in Texas. Of course, there would be other records in the SSTable file (including more Texas users), but we will keep the example simple.

Animation 2: These four SSTables compact together into a new SSTable.

Animation 3: Here are another four (unique) users, all in Texas. At this point, to read the entire Texas partition, we will have to read data from five SSTable files.

Animation 4: These four compact together making two SSTable files. The situation is a bit better for a read because we have only two SSTables to seek/stream from.

Animation 5: We continue handling writes of users in Texas, and size tiered compaction continues to operate making larger and larger tiers. However, now notice to read the entire Texas partition, we have to read/seek nine SSTable files! The problem propagates as we acquire more tiers/SSTables.

Stale Data

Stale Records in Larger SSTables Take Unnecessary Space

Although visually this example looks like the last, notice that each partition contains an update for a single user. These users (for whatever contrived example reason you like to come up with), constantly change their names (causing updates).

For example, user #1 Jim has changed names several times (Jeb, Joe, Jan, Jef, and Jon). Users #2 and #3 also change their names often as well. Since the latest names are in the highest tier, all the data in the lower tiers is no longer needed. They unnecessarily take hard drive space. However, there are not four SSTables in any tier, so compaction cannot eliminate the older data. Also, as data climbs up into higher tiers, these tiers compact less often, which possibly means very old stale data may be around for a while.

Realistic Scenario

SSTable sizes will vary

In a real world scenario, compaction drops values due to tombstones and updates. So compacted SSTables will vary in size. Here we examine how size tiered compaction determines what the tiers are and how tables group into those tiers. This image shows several SSTables of varying size. Let’s examine size tiered compaction’s algorithm in dividing these SSTables into different tiers.

Animation 1: [Original graphic fades.]

Animation 2: Size tiered compaction considers each SSTable one at a time in no particular order. Here we randomly chose to start with a 100MB partition (well, somewhat random, 100MB is an easy number to do some math on as you’ll see shortly).

Animation 3: Since this 100MB partition is the first one, we make a new tier.

Animation 4: We place the 100MB partition into this new tier.

Animation 5: From there, we calculate the average size of all the SSTables in the tier. Since this tier has a single 100MB partition, the average size is 100MB (denoted in yellow). We also calculate the minimum and maximum size another SSTable must be to be placed in this tier. The minimum size is 50% of the average (50MB, denoted on the left). The maximum size is 150% of the average (150MB, denoted on the right). You can tune the minimum and maximum percentages by setting bucket_low and bucket_high respectively. The term 'bucket' is synonymous with 'tier'.

Animation 6: Now we consider the 1GB SSTable file. Since 1GB is larger than 150MB…​

Animation 7: We create another bucket (tier).

Animation 8: The 1GB SSTable file goes into that new bucket.

Animation 9: And we calculate the average, min, and max for that bucket as well.

Animation 10: Here’s a 490MB SSTable file.

Animation 11: 490MB doesn’t fall into the range of any existing tiers, so we make a new bucket.

Animation 12: We place the 490MB SSTable file into the new tier and calculate the average, min, and max as well.

Animation 13: Now we consider a small 20MB SSTable. It doesn’t qualify for any of the existing buckets.

Animation 14: Size tiered compaction creates a new small bucket to drop all small SSTables into. Size tiered compaction doesn’t maintain a low or high threshold for this bucket as that is too fine grained for all small SSTables. So all small SSTables are placed in this bucket. You can tune this top value of this buecket (default 50MB) by setting min_sstable_size.

Animation 15: Here’s a 300MB SSTable.

Animation 16: 300MB falls between the min and max value of an existing bucket, so we place it in that bucket.

Animation 17: After doing so, we recalculate the average for that bucket and update the low and high thresholds for that bucket as well.

Animation 18: Here’s a 140MB SSTable file.

Animation 19: It falls into the range of an existing bucket as well.

Animation 20: And we update the average, min, and max for that bucket too.

Animation 21: This process of separating SSTables into buckets and updating the average, min, and max value for each bucket continues until all SSTables are sorted into a bucket.

Things to Note

  • Groups similarly sized tables together
  • Tiers with less than min_threshold (four) SSTables are not considered for compaction
  • The smaller the SStables, the "thinner" the distance between min_threshold and max_threshold
  • SStables qualifying for more than one tier distribute randomly amongst buckets
  • Buckets with more than max_threshold SSTables are trimmed to just that many SSTables

    • 32 by default
    • Coldest SSTables dropped

Hotness

hot

  • Size tiered compaction chooses the hottest tier first to compact
  • SSTable hotness determined by number of reads per second per partition key

Similar Sized Tables

  • Similar sized SSTables compact together better
  • SSTables of similar size will have a fair amount of overlap
  • Minimizes write amplification (rewriting large amounts of data simply to copy it)
  • Ex: Compacting a 1MB file with a 1TB file (not ideal)

Compacting a 1MB SSTable with a 1TB SSTable requires rewriting most of the 1TB file contents. There may be up to 1MB of overlap (thus dropping 1MB of data between the two files), but

Write amplification occurs when a large portion of memory (HDD or RAM) is required to update a small portion of data because the update requires rewriting all of the data to a new location. Compaction is one such scenario where to update values in an SSTable or multiple SSTables combined, we must rewrite source SSTables to new locations. Thus our writes have been "amplified" by data that requires rewriting but wasn’t updated.

Concurrency

  • Cassandra compacts several tiers concurrently
  • concurrent_compactors

    • Default to smaller of number of disks or number of cores, with a minimum of 2 and a maximum of 8 per CPU core

  • Tables concurrently compacting are not considered for new tiers

Triggering a Compaction

  • Compaction starts every time a MemTable flushes to an SSTable

    • MemTable too large, commit log too large, or manual flush

  • Or when the cluster streams SSTable segments to the node

    • Bootstrap, rebuild, repair

  • Compaction continues until there are no more tiers with at least min_threshold tables in it

Tombstones

tombstone

  • If no eligible buckets, size tiered compaction compacts a single SSTable
  • This eliminates expired tombstones
  • The number of expired tombstones must be above 20%
  • Largest SSTable chosen first
  • Table must be at least one day old before considered

    • tombstone_compaction_interval

  • Compaction ensures that tombstones DO NOT overlap old records in other SSTables

Size Tiered Compaction

  • As with everything, there are trade offs to using size tiered Compaction
  • Size tiered compaction is the default
  • Absorbs high write-heavy workloads by procrastinating compaction as long as possible
  • Other compaction strategies don’t handle ingesting data as well as size tiered
  • compaction_throughput_mb_per_sec controls the compaction IO load on a node

Major Compactions

  • You can issue a major compaction via nodetool
  • Compacts all SSTables into a single SSTable
  • New monolithic SSTable will qualify for the largest tier
  • Future updates/deletes will fall into smaller tiers
  • Data in largest tier will become obsolete yet still hog a lot of disk space
  • Takes a long time for changes to propagate up to large tier
  • Major compactions not recommended

Time Windowed Compaction

Time Windowed Compaction

Built for Time Series Data

Time windowed compaction simply combines all records in each time window into a single SSTable. This works especially well when you are TTLing your data as removing the expired records is simply a matter of deleting the entire SSTable file.

Here will will examine KillrVideo user activity (viewing videos, pause, playback, etc.) for one afternoon/evening. We will set our time windows to hourly.

Animation 1: We will first look at the time of 3:00pm to 4:00pm.

Animation 2: Within the active window, time windowed compaction reuses size tiered compaction. In fact, you can configure size tiered compaction options when you setup time windowed. Here we have three SSTables.

Animation 3: When the window completes (4:00pm), time windowed compaction combines all SSTables in that window into a single SSTable.

Animation 4: Let’s look at the next time window.

Animation 5: Here we have a bit more data because viewership is going up. Notice size tiered compacts the four yellow similarly sized SSTables into a new (green) SSTable.

Animation 6: 5:00pm rolls around, and time windowed compaction combines all those SSTables together.

Animation 7: Let’s look at the 5:00pm window.

Animation 8: Even more data this time.

Animation 9: 6:00pm rolls around; combine all the data.

Animation 10: And so on…​

Animation 11: And so forth…​

Notice the evening is a busy time for user activity on the KillrVideo website.

Time Window Details

  • An SSTable spanning two windows simply falls into the second window
  • Good practice to aim for 50ish max SSTables on disk

    • 20ish for active window
    • 30ish for all past windows combined

  • For example: one month of data would have window of a day

Tuning

Simply set the window size

  • compaction_window_unit

    • minutes
    • hours
    • days

  • compaction_window_size

    • Number of units in each window

  • Ex: 15 days, 10 minutes, 20 hours, etc.
  • expired_sstable_check_frequency_seconds determines how often to check for fully expired (tombstoned) SSTables
  • Good to tune when using a TTL

Repair

What is Repair?

  • Think of repair as synchronizing replicas
  • Repair ensures that all replicas have identical copies of a given partition
  • Repair occurs:

    • If necessary when detected by reads (e.g. CL=QUORUM)
    • Randomly with non-quorum reads (table property read_repair_chance or dclocal_read-repair_chance)
    • Manually using nodetool repair

Why is Repair Necessary?

  • Nodes may go down for a period of time and miss writes

    • Especially if down for more than max_hint_window_in_ms

  • If nodes become overloaded and drop writes

How Does Repair Work?

1) Nodes build Merkel trees from partitions to represent how current the data values are

2) Nodes exchange the Merkel trees

3) Nodes compare the Merkel trees to identify specific values that need synchronization

4) Nodes exchange data values and update their data

What is a Merkel Tree?

repair\merkel trees

  • A binary tree of hash values
  • The leaves of the tree represent hashes of the values in the partition
  • Each non-leaf tree node is a hash of its children’s hash values
  • When tree-nodes hashes are the same, the sub-trees are the same

When to Perform a Repair?

  • If a node has been down for a while
  • On a regular basis:

    • Once every gc_grace_seconds
    • Make sure the repair can complete within the gc_grace_seconds window
    • Schedule for lower utilization periods

Is Repair a Lot of Work for the Node?

  • A full repair can be a lot of work
  • But there are ways to mitigate the work:

    • Incremental repair
    • Primary range repair
    • Sub-range repair

What is Incremental Repair?

repair\incremental repairs

  • An improvement to Cassandra that keeps track of previously repaired data
  • Only calculate Merkel trees for partitions that haven’t previously undergone repairs
  • This allows the repair process to stay performant and lightweight

What is a Primary Range Repair?

repair\repair ring

  • The primary range is the set of tokens the node is assigned
  • Repairing only the node’s primary range will make sure that data is synchronized for that range
  • Repairing only the node’s primary range will eliminate redundant repairs

What is Sub-Range Repair?

  • Repairs can consume significant resources depending on how much data is under consideration
  • Targeting sub-ranges of the table will reduce the amount of work done by a single repair

How do I Use Repair?

nodetool <options> repair
  • Options:

    • --dc <dc_name> identify data centers
    • --et <end_token> used when repairing a subrange
    • --local repairs only in the local data center
    • --par (parallel repair)
    • --pr (partitioner-range) repairs only primary range
    • --st <start_token> used when repairing a subrange
    •  — <keyspace> <table>
    • --inc (incremental) do an incremental repair

sstablesplit

When would we need sstablesplit?

sstable split\dino

  • You did a nodetool compact (major compaction)
  • Maybe you used SizeTieredCompactionStrategy for a major compaction
  • This would result in a excessively large SSTable
  • Good idea to split the table because won’t get compacted again until the next huge compaction
  • Using size tiered compaction, we may have gotten some really large files over time
  • May find yourself with a 200GB file that you need to split up
  • It’s an anti-compaction in a way

Usage

sstable split\stop2

Firstly, Cassandra must be stopped to use this tool:

$ sudo service cassandra stop
  • You do this online and it will be bad!!
$ sstablesplit [options] <filename> [<filename>]*

sstable split\sstablesplit options

Example:

$ sstablesplit -s 40 /var/lib/cassandra/data/data/killrvideo/users/*
  • Take all my files in the killrvideo keyspace and make all of them 40mb

Multi Datacenter

Nodes, racks and data centers

A cluster of nodes can be logically grouped as racks and data centers

  • Node— the virtual or physical host of a single Cassandra instance
  • Rack— a logical grouping of physically related nodes
  • Data Center— a logical grouping of a set of racks
  • Enables geographically aware read and write request routing
  • Each node belongs to one rack in one data center
  • The identity of each node’s rack and data center may be configured in its conf/cassandra-rackdc.properties file

Adding a second data center

  • Ensures continuous availability of your data and application
  • Live backup
  • Improved performance
  • Analytics

How clusters operate between data centers

  • A data center is a grouping of nodes configured together for replication purposes
  • Data replicates across data centers automatically and transparently
  • Consistency level can be specified at LOCAL level for read/write operations
  • Consistency level can also be specified as EACH

What if one data center goes down?

  • Failure of a data center will likely go unnoticed
  • If node/nodes fail, they will stop communicating via gossip
  • Recovery can be accomplished with a rolling repair to all nodes in failed data center

Implementing a multiple data center cluster

  • Use the NetworkTopologyStrategy rather than SimpleStrategy
  • Use LOCAL_* consistency level for read/write operations to limit latency
  • If possible, define one rack for entire cluster
  • Specify the snitch

After starting Cassandra on the new nodes:

  • Change the keyspace properties to specify the desired replication factor for the new data center
  • For example, set strategy options to DC1:2, DC2:2:
ALTER KEYSPACE killrvideo
WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 2}
  • Run nodetool rebuild specifying the existing data center on all nodes in the new data center:
$ nodetool rebuild -- name_of_existing_data_center
  • Otherwise, requests to the new data center with LOCAL_ONE or ONE consistency levels may fail if the existing data centers are not completely in-sync.

Exercise 5—​Stand-up a Second DC