Cassandra Snapshot and Restore

Enable incremental backup

Set increamental_backups to true in cassandra.yaml file and restart the node.

cqlsh> create KEYSPACE ks1 WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
cqlsh:ks1> use ks1;
cqlsh:ks1> create table user(email text primary key, name text, age int);
cqlsh:ks1> insert into user (email, name, age) values('a@b.com', 'ab', 23);

Take full data backup at a node using snapshot command

[training@localhost apache-cassandra-3.10]$ bin/nodetool snapshot ks1
Requested creating snapshot(s) for [ks1] with snapshot name [1498071498398] and options {skipFlush=false}
Snapshot directory: 1498071498398
[training@localhost apache-cassandra-3.10]$ ls -ltr data/data/ks1/user-64c3fcb056b311e79c839104c563abca/snapshots/
total 4
drwxrwxr-x 2 training training 4096 Jun 22 00:28 1498071498398

Delete all snapshots

[training@localhost apache-cassandra-3.10]$ bin/nodetool clearsnapshot 
Requested clearing snapshot(s) for [all keyspaces]

View snapshot folder

[training@localhost apache-cassandra-3.10]$ ls -ltr data/data/ks1/user-64c3fcb056b311e79c839104c563abca/
total 40
drwxrwxr-x 2 training training 4096 Jun 22 00:26 backups
-rw-rw-r-- 1 training training   16 Jun 22 00:28 mc-1-big-Filter.db
-rw-rw-r-- 1 training training   11 Jun 22 00:28 mc-1-big-Index.db
-rw-rw-r-- 1 training training   65 Jun 22 00:28 mc-1-big-Summary.db
-rw-rw-r-- 1 training training   42 Jun 22 00:28 mc-1-big-Data.db
-rw-rw-r-- 1 training training   10 Jun 22 00:28 mc-1-big-Digest.crc32
-rw-rw-r-- 1 training training   43 Jun 22 00:28 mc-1-big-CompressionInfo.db
-rw-rw-r-- 1 training training 4651 Jun 22 00:28 mc-1-big-Statistics.db
-rw-rw-r-- 1 training training   92 Jun 22 00:28 mc-1-big-TOC.txt

Make a change. Shortly, you will revert this change in the database by restoring the old spapshot.

cqlsh:ks1> insert into user (email, name, id) values('a@b.com', 'ab.new', 23);

Truncate the table that you want to restore.

cqlsh:ks1> truncate user;

Copy the appropriate (by looking at the datetime) snapshot and backup files from the /backups directory to table dir at the linux file system.

Recreate the schema if necessary. You can find schema in schema.cql in the snapshot directory.

Refresh the node using nodetool command. You have to do this operation at each node.

[training@localhost apache-cassandra-3.10]$ bin/nodetool refresh

Alternatively, you can run SS table loader

[training@localhost apache-cassandra-3.10]$ tools/bin/sstableloader -d <contact point, ip> <snapshot dir of SStable>

Steps

1. Create keyspace with RF=1, and create a table movies inside it and load the data into the table

2. Record the number of rows and verify there is a record with movieId = 1

3. Take a snapshot on all nodes of one dataceneter

4. Delete record with movieId = 1 to simulate a descructive change. You would like to restore the database to previous step.

5. Truncate the table to ensure all existing records are wiped clean. Note, during truncation all nodes in the cluster must be up.

6. Restore the snapshot you want. Just copy over the SStables from snapshots directory.

7. Run nodetool refresh command

$ nodetool -h 127.0.0.1 -p 7199 refresh demo movies  

Above steps will help you restore a table without shutting down the nodes.

Load table from SStables using SStableLoader

If you are restoring the SStables from one cluster to another cluster having different network topology, you can use SStableloader to restore data from the SStables.

Copy the SStables that you want to load to the data directory of the table.

$ cp data/data/demo/movies*/snapshots/1510828229153 data/data/demo/movies*/

Run SStable loader

$ bin/sstableloader -d 127.0.0.1 --username cassandra --password cassandra data/data/demo/movies-d92bf7a0cab811e7878601b6d23bab63/

Key different between nodetool refresh and sstable loader approach

While there are a similarity between the above 2 approaches to restore the snapshot, there is one fundamental difference is there. In refresh command approach, assumption is data in SStable belogs to the current node. So, this approach is applicable, when you want to restore the snapshot of an already running node. SStable loader approach on the another hand can stream out data from the running node to any other node in the clusters. This is suitable approach when the tokens has been re-distributed due cluster resizing.