Backup/Restore and Moving Data

CQL COPY

  • Imports and exports delimited data to and from Cassandra
COPY table_name ( column, ...)
FROM ( 'file_name1', 'file_name2', ... | STDIN )
WITH option = 'value' AND ...

COPY table_name ( column , ... )
TO ( 'file_name1', 'file_name2', ... | STDOUT )
WITH option = 'value' AND ...
  • COPY uses an argument of one or more comma-separated file names

COPY is a cqlsh command that Imports and exports delimited data to and from Cassandra.

COPY FROM a delimited file

Why would we want to do this?

  • In a production database, inserting columns and column values programmatically is more practical than using cqlsh
  • But often testing queries using this SQL-like shell is very convenient
  • A delimited file is useful if several records need inserting
  • While not strictly an INSERT command, it is a common method for inserting data

Some rules:

  • Cassandra expects every row in the delimited input to contain the same number of columns
  • The number of columns in the delimited input is the same as the number of columns in the Cassandra table
  • Empty data for a column is assumed by default as NULL value
  • COPY FROM is intended for importing small datasets (a few million rows or less) into Cassandra
  • For importing larger datasets, use the Cassandra bulk loader

Procedure

  • Locate your delimited file and check options to use

    ---
    lastname|firstname|email|created_date
    Jones|Bob|bob@example.com|2016-02-03
    Walsh|Gary|gary@example.com|2016-02-04
    Nash|Sue|sue@example.com|2016-02-23
    ---

  • To insert the data, using the COPY command with delimited data:
$ COPY killrvideo.users FROM 'users.csv' WITH DELIMITER='|' AND HEADER=TRUE

What do some of these options mean?

Just some examples

  • DELIMITER - Set character that separates fields having newline characters in the file (default is comma (,))
  • HEADER - Set true to indicate that first row of the file is a header (default is false)
  • CHUNKSIZE - Set the size of chunks passed to worker processes (default value is 1000)
  • SKIPROWS - The number of rows to skip (default value is 0)

CQL COPY

COPY TO a delimited file

  • Assuming you have this user table in CQL:
cqlsh> SELECT * FROM users.killrvideo;
  • After inserting data into the table, you can copy the data to a delimited file in another order by specifying the column names in parentheses after the table name:
lastname   |   firstname    | email           | created_date
-----------+----------------+-----------------+-------------
    Jones  |      Bob       | bob@example.com | 2016-02-03

Assuming you have this user table in CQL.

After inserting data into the table, you can copy the data to a CSV file in another order by specifying the column names in parentheses after the table name:

COPY TO a delimited file

  • Copy the data to a delimited file in another order by specifying the column names in parentheses after the table name:
COPY users (firstname, lastname, created_date, email) TO 'users.csv'

Specifying the source or destination files

  • Specify the source file of the delimited input or the destination file of the delimited output by a file path.
  • You may also use the STDIN or STDOUT keywords to import from standard input and export to standard output.
  • When using stdin, signal the end of the delimited data with a backslash and period ("\.") on a separate line.

sstabledump

Allows you to see the raw data of a SSTable in a text format.

  • Dumps the contents of the specified SSTable in the JSON format
  • Maybe you’re curious about things like timestamps
  • Only way you can actually see the inner workings of an SSTable
  • You may wish to flush the table to disk (using nodetool flush) before dumping its contents

sstabledump\dumptruck

Cassandra 3.0 sstabledump output

sstabledump\sstabledump

Internal Representation Format

$ sstabledump data/data/killrvideo/users-b75e90f0226811e6a172cfc782e39035/ma-1-big-Data.db -d

[14c532ac-f5ae-479a-9d0a-36604732e01d]@0 Row[info=[ts=1464175401324444] ]:  |
[created_date=2016-05-11 00:00-0230 ts=1464175401324444], [email=jane@example.com ts=1464175401324444], [firstname=Jane ts=1464175401324444], [lastname=Doe ts=1464175401324444]

sstableloader (Cassandra bulk loader)

sstableloader

  • Provides the ability to:

    • Bulk load external data into a cluster
    • Load pre-existing SSTables into

      • an existing cluster or new cluster
      • a cluster with the same number of nodes or a different number of nodes
      • a cluster with a different replication strategy or partitioner

  • Very flexible
  • The sstableloader streams a set of SSTable data files to a live cluster; it does not simply copy the set of SSTables to every node, but transfers the relevant part of the data to each node, conforming to the replication strategy of the cluster
  • The table into which the data is loaded does not need to be empty
  • If tables are repaired in a different cluster, after being loaded, the tables are not repaired

sstableloader\Front End Loader

Prerequisites

  • Since sstableloader uses Cassandra gossip to figure out the topology of the cluster, make sure of the following:

    • The cassandra.yaml is in the classpath and properly configured
    • At least one node in the cluster is configured as seed
    • The following properties are properly configured in cassandra.yaml for cluster that you are importing into:

      • cluster_name
      • listen_address
      • storage_port
      • rpc_address
      • rpc_port

Because sstableloader uses Cassandra gossip to figure out the topology of the cluster, make sure of the following:

The cassandra.yaml configuration file is in the classpath and properly configured.

At least one node in the cluster is configured as seed.

In the cassandra.yaml file, the following properties are properly configured for the cluster that you are importing into.

When using sstableloader to load external data, you must first generate SSTables.

If using DataStax Enterprise, you can use Sqoop to migrate external data to Cassandra.

How does sstableloader work?

  • When it runs, it actually reads every single sstable and streams that data back into the cluster
  • It repartitions the data. The snitches get involved so that the data will fall into the correct place
  • Great for switching between non-like clusters, as it can re-stream the data and repartition it (going from a 30 to 40 node cluster, for example)
  • Example of usage: To bulk load the files, specify the path to killrvideo/users/ in the target cluster:
$ sstableloader -d 110.82.155.1 /var/lib/cassandra/data/killrvideo/users/

Spark

Use Spark to load data into Cassandra

Spark provides convenient functionality for loading large external datasets into Cassandra tables in parallel

Ingesting files in CSV, TSV, JSON, XML, and other formats

  • If file is stored in Cassandra File System

    • File blocks are ingested concurrently (64 MBs per block)
    • May still need to re-partition to optimize

  • If file is stored in a local file system, to achieve a desired level of parallelism

    • Use Spark to repartition the dataset or
    • Use Spark’s partitionBy to pre-partition the dataset by a Cassandra table partition key

case class User (userid: java.util.UUID,
                 firstname: String,
                 lastname: String,
                 email: String,
                 created_date: java.util.Date)
val users =
sc.textFile("cfs:.../users.csv")
  .repartition(2 * sc.defaultParallelism)
  .map( line => line.split(",") match
 { case Array(id, firstname, lastname, email, created_date) =>
User(java.util.UUID.fromString(id),
     firstname,
     lastname,
     email,
     new java.text.SimpleDateFormat("yyyy-mm-dd").parse(created_date))
 } )
users.saveToCassandra("killrvideo", "users")

Loading a CSV file into a Cassandra table with validation

val beforeCount = sc.cassandraTable("killrvideo", "users").count

val users =
sc.textFile("file:///home/student/users.csv")
  .repartition(2 * sc.defaultParallelism)
  .cache // The RDD is used in two actions

val loadCount = users.count

users.map(...).saveToCassandra("killrvideo", "users")

val afterCount = sc.cassandraTable("killrvideo", "users").count

if (loadCount - (afterCount - beforeCount) > 0)
  println ("Errors or upserts - further validation required")

Performing Backup and Recovery

Why should we backup our data?

"There are two types of users: those who have lost data and those who are about to lose data."

  • Programmatic accidental deletion or overwriting of data
  • For single node failure, recovery can be from a live replica
  • So that we may recover from catastrophic data center failure

Why snapshots

  • We don’t do backup like traditional databases where we copy out all the data
  • It’s a distributed system; every server or node has a portion of the data
  • SSTables are immutable, which is great! Makes them easy to back up.
  • Snapshots create hardlinks on the file system as opposed to copying data

    • This is DIFFERENT than copying actual data files offline (takes less disk space)

  • Therefore very fast! (Not a lot of data in flight)
  • Represents the state of the data files at a particular point in time
  • Can consist of a single table, single keyspace, or multiple keyspaces

What is a snapshot?

  • Represents the state of the data files at a particular point in time
  • Snapshot directory is created (this has pointers)
  • Then you can either leave it there or copy it offline to an NFS mount or copy to S3 etc.

Represents the state of the data files at a particular point in time.

Consists of a single table, single keyspace, or multiple keyspaces.

How do incremental backups work?

  • Incremental backups create a hard link to every SSTable upon flush

    • User must manually delete them after creating a new snapshot

  • Incremental backups are disabled by default

    • Configured in the cassandra.yaml file by setting incremental_backups to true

  • Need a snapshot before taking incremental backups
  • Snapshot information is stored in a snapshots directory under each table directory

    • Snapshot need only be stored once offsite

  • Incremental backups are all stored in a backups directory under the keyspace data directory

    • Enables storing incremental backups offsite more easily

  • Both snapshot and incrementals are needed to restore data
  • Incremental backup files are not automatically cleared

    • Clear when a new snapshot is created

Where are snapshots stored?

  • Snapshots and incremental backups are stored on each Cassandra node

    • Handy, if a simple restore is needed
    • Not so good if there is a hardware failure

  • Commonly, files are copied to an off-node location

    • Open source program, tablesnap, is useful for backing up to S3
    • Scripts can be used to automate backing up files to another machine

      • cron + bash script, rsync, etc

Auto snapshot

  • CRITICAL safety factor!
  • A configuration setting in cassandra.yaml that indicates whether or not a snapshot is taken of data before tables are truncated and tables and keyspaces are dropped
  • STRONGLY advise using the default setting of true

How do we use nodetool to create snapshots?

bin/nodetool [options] snapshot (-cf <table> | -t <tag> -- keyspace)
  • We can specify to take a snapshot of:

    • one or more keyspaces
    • a table specified to backup data

  • Flags
  • -h [host] | [IP address]
  • -p port
  • -pw password
  • -u username
  • parallel ssh tool to snapshot entire cluster

We can use the nodetool snapshot command to take a snapshot, specifying a table or one or more keyspace

Bare in mind that all keyspaces will be snapshotted if one is not specified

We can also use the usual nodetool options like host and port

there is also a parallel ssh tool to snapshot an entire cluster

How do we remove snapshot?

nodetool clearsnapshot
  • The nodetool clearsnapshot command removes snapshots
  • Same options as nodetool command
  • Specify the snapshot file and keyspace
  • Not specifying a snapshot name removes all snapshots
  • Remember to remove old snapshots before taking new ones --previous snapshots are not automatically deleted
  • To clear snapshots on all nodes at once, use a parallel ssh utility

How do we restore snapshots?

"You get 1 point for a backup, you get 99 for a restore."

  • Most common method is to delete the current data files and copy the snapshot and incremental files to the appropriate data directories

    • If using incremental backups, copy the contents of the backups directory to each table directory
    • Table schema must already be present in order to use this method
    • Restart and repair the node after the file copying is done

  • Another method is to use the sstableloader

    • Great if you’re loading it into a different size cluster
    • Must be careful about its use as it can add significant load to cluster while loading

If using incremental backups, copy the contents of the backups directory to each table directory

Table schema must already be present in order to use this method

Restart and repair the node after the file copying is done

Another method is to use the sstableloader

Performing cluster-wide backup and restore

  • OpsCenter
  • SSH programs

    • pssh
    • clusterssh is another tool that can be used to make changes on multiple servers at the same time

  • Honorable mention —tablesnap and tablerestore

    • For Cassandra backup to AWS S3

  • Recovery

Exercise 6—​Backup/Restore