AWS Ephemeral Data re-instating

Ephemeral storage on AWS is a wonderful thing, right up until the point that you need to restart the machine and keep your data intact. As soon as that instance goes offline, poof! , your data is gone forever. Volatile memory has always been an issue, from the earliest days of computing - a double-edged sword of speed versus persistence. The only solution is to take your volatile data and put it onto your primary volume - but how can we easily reinstate that data when our primary disk may only be a few gigabytes in size?

Volatile data stores

One of the work projects I was tasked with recently had such a problem - multiple Cassandra nodes with ephemeral storage that needed to persist after a reboot and instance size increase. "but surely, can't you just re-sync your data from it's sibblings? After all, that's the whole POINT in distributed database systems like Cassandra" I hear you utter in contempt? Well, the truth as is with most things production-sensitive is that it's not easily and quickly achieved with the size of data I was working with, lamentably. In my case, it was going to take more than eighteen hours to synchronise my data per node. Way too long for a nine-hour maintenance window.

The problem...

The data is huge; we can't take the unit offline until we're absolutely sure we have the data and that we can create a multi-stage backup process; We have a limited maintenance window; We need something we can test on a duplicate cluster that is reliable and reproducable.
Thinking hats on, fellow geeks...

So, a cunning method was floated and I managed a proof of concept with the test cluster to replicate data and reconstruct it ( relatively ) quickly after a power cycle on the AWS instance.

The Rsync & extra volume approach

Using the volume management in AWS I attached a separate high IO SSD to the live instances. I used a high IOPS volume but you could achieve the same results with any SSD-based volume.
Once attached we just need to connect the volume to the live running unit with a few simple commands:


This gives details of connected storage items - in this case we had a new volume displaying on /dev/xvdf ( AWS default ). Depending on what your server's drive mapping is, you may find your new drive shows up under a different xvd{letter}.

file -s /dev/xvdf

It's always best to check there is no file system on the newly mapped volume - just in case you mapped the wrong unit to your live node. If this simply says "data" there's no filesystem there.
Once you're sure on the correct mount point you can create the file system and mount your new drive:

mkdir /syncbackup
mount /dev/xvdf /syncbackup

Once the drive is mounted, you can use Rsync to transfer from your ephemeral storage across to the new drive. This is best achieved initially live with low impact processes and bandwidth settings - with a second-pass operated once you have shut down all of your processing on the unit. You want to be sure to get all of the changes from any tasks on the server and drain any pending tasks before your final synchronisation.

Low-impact CPU settings can be achieved using the linux nice command. Set your value lower than other tasks on the box so that it doesn't impede the processing of other tasks. In my example, I set a nice value of 19 so that the Rsync task would not absorb too much vital CPU time.

Bandwidth limiting on the actual process of copying can be achieved with Rsync and the bwlimit command ( which is set in KB/s ). This helps to reduce disk IO overloading risks.

If you know that your task will take time to run, it's best setting the task as a background task or using a tool like tmux to allow you to disconnect your Linux session and come back later without worrying about dropping the process on disconnection.

nice -n 19 rsync --bwlimit=10000 -av --hard-links --progress /path/to/ephemeral/mount /syncbackup/

So here we have the unit starting it's copy process. Once completed in it's entirety, you can restart your server and undertake the maintenance you need. It might be wise to take a snapshot of your data drive before you restart just to make sure you definitely have a copy of your data if something goes wrong!

Upon bringing the unit back online, you can reattach your storage with the usual mount command:

mount /dev/xvdf /syncbackup

Now the process is in reverse - it's pretty much guaranteed you want the fastest possible transfer back in reverse so remove your nice and bwlimit values and swap your source and destination paths thus:

rsync -av --hard-links --progress /syncbackup/ /path/to/ephemeral/mount

Once you're happy your system is back online, you can unmount your backup drive, detach it from your instance and delete it if it's no longer required.

So, there we go - an old-school new-school way to make the volatile slightly less volatile...

Happy hacking!

Next Item.. Saturating Home Network Equipment

Previous Item.. Default Placeholders, a security hole