In this post, I'm going to discuss my backup solution that I use for my personal computer. I use duplicity and push the files up to Amazon s3.

In the past, I've had good experiences with Crashplan and linux, but I'm hoping that over time s3 will save me money.

Secure backups

To ensure our backup is secure, we'll make us of GPG.Perhaps you've met GPG before in the context of sending secure messages. Typically if someone wants to send a secure message, say a mail that contains some SSH keys, they'd turn to GPG to get the job done. GPG relies on a security concept known as public key encryption. The information is encrypted with a public key (available for the world to see, hence the name), but the so-called private key, capable of decrypting the message, is only distributed to trusted parties. Since we have one key for decryption and another for encryption, this type of encryption is also known as asymmetric encryption, as oppose to traditional symmetric forms of encryption were the same key is used for both.

Setting up GPG

On Ubuntu, just type:

sudo apt-get install gnupg

Now we need to generate the key pair:

gpg --gen-key

The defaults are fine (RSA, 4096, no-expiry), then enter your name and email, and a passphrase (make a note of this). At this point you'll need to do some work to generate the required entropy; install some software, ssh, generate some random numbers with rng-tools).

Revocation certificate

If you ever get hacked, or you lose your key, you will need a way to invalidate your key par. For this purpose, you should create a revocation certificate straight away and copy and paste it to some safe place (pref a different computer or print out):

 gpg --gen-revoke someone@somewhere.com

Most of the usual uses of gpg are beyond the scope of a backup tutorial, but you can read more here.

Backing up your gpg keys

There is little point in having a fancy gpg encrypted cloud backup if when you drop your laptop in a lake, you loose the keys to decrypt it! Some people like to print the gpg keys or store on a USB drive, but for me that is not practical. The best thing (and maybe it's not so great?) that I could think of was to make a small file volume with TrueCrypt, store the keys in there, and upload to Google Drive.

First export your keys with:

gpg --export-secret-keys > secret-backup.gpg

and move secret-backup.gpg to the TrueCrypt vol, before uploading it to gdrive.

In the event of losing your laptop, download the TrueCrypt drive mount it (you'll need to remember the password or have it stored elsewhere!) then restore the keys to gpg with

gpg --import secret-backup.gpg

You'll finally need to tell gpg that you trust this key by doing:

gpg --edit-key 126877DC trust 5 quit

Things should be good to go again.

Creating your S3 bucket

After you've created your s3 bucket (if don't choose US standard the URL may be different to the one below, double check things), you'll probably want to create a dedicated user who only has access to the backup bucket and nothing else. First go to "Identity & Access Management" then "Users", "Create New User". Enter the user name, e.g. "duplicity", ensuring you tick to generate keys, and download the keys generated (this is the only chance you'll have to do this). Make a note of the "User ARN", we'll need it next.

Now return to your bucket, and click properties > permissions. Now edit the bucket policy, so that it read

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": "<YOUR_USER_ARN>"
            },
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::<YOUR_BUCKET_NAME>",
                "arn:aws:s3:::<YOUR_BUCKET_NAME>/*"
            ]
        }
    ]
}

This gives your user full permissions on this bucket only.

The backup script

I put this script in /etc/cron.monthly/, as I only want monthly backups, but you could put it in the daily or weekly equivalents. Make sure you remember to give it proper permissions (chmod +x ...), and you don't need a suffix on the filename.

#!/bin/bash
# Check s3 bucket size with
# s3cmd du -H  s3://<YOUR_BUCKET>


# Secrets file should be owned by root and 700 perms
. /root/.dup.secrets  # Recall '.' syn with 'source', which executes script
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
export PASSPHRASE

# Your GPG key 
GPG_KEY="<YOUR_GPG_KEY>"

# Set up some variables for logging
LOGFILE=$(mktemp)  # makes a temp file in /tmp
HOST=`hostname`
DATE=`date +%Y-%m-%d`
MAILADDR="YOU@SOMEWHERE.COM"
TODAY=$(/bin/date +%d%m%Y)

# The S3 destination followed by bucket name
# DEST=file:///path/to/some/dir/instead
DEST="s3://s3.amazonaws.com/<YOUR_BUCKET>/"

# Test if duplicity already running
is_running=$(ps -ef | grep duplicity  | grep python | wc -l)

# How long to keep backups for
# OLDER_THAN="1M"
BK_KEEP_FULL="2"  # How many full+inc cycle to keep
BK_FULL_FREQ="6M" # create a new full backup every...

COMMON_OPTS="--asynchronous-upload --numeric-owner --full-if-older-than=$BK_FULL_FREQ --encrypt-key=${GPG_KEY} --sign-key=${GPG_KEY}"

if [ $is_running -eq 0 ]; then
    # if dup not already running.

    # Trace function for logging, don't change this
    trace () {
            stamp=`/bin/date +%Y-%m-%d_%H:%M:%S`
            echo "$stamp: $*" >> ${LOGFILE}
    }

    # Take a snapshot of all packages installed too, and put it in /root
    dpkg --get-selections | grep -v deinstall > /root/packages_installed.txt

    # The source of your backup
    SOURCE=/

    trace "Backup for local filesystem started"

    trace "... removing old backups"

    # duplicity remove-older-than ${OLDER_THAN} ${DEST} >> ${LOGFILE} 2>&1
    $(which duplicity) remove-all-but-n-full $BK_KEEP_FULL --force $DEST >> ${LOGFILE} 2>&1

    trace "... backing up filesystem"

    # If last full not older than BK_FULL_FREQ, it will do incremental, i.e.
    # just files new or change, you can use -time flag to restore whichever
    # \ at end just means same continue same line

    # Note full-if-older-than makes full backup if last *full* backup older than time
    # if no gpg use: --no-encryption in COMMON_OPTS instead
    $(which duplicity) ${COMMON_OPTS} \
        --exclude-globbing-filelist=/root/backup.exclude \
        ${SOURCE} \
    ${DEST} >> ${LOGFILE} 2>&1

    # Add status to log
    trace "------------------------------------"
    trace "The status of the collection is:"
    duplicity collection-status $DEST >> $LOGFILE

    trace "Backup for local filesystem complete"
    trace "------------------------------------"

    # Send me a local email to check on Thunderbird with the log
    cat $LOGFILE | sendmail -t $MAILADDR

    # Reset the ENV variables. Don't need them sitting around
    unset AWS_ACCESS_KEY_ID
    unset AWS_SECRET_ACCESS_KEY
    unset PASSPHRASE
fi

Credit should go a variety of blogposts for the end product, as it's a hodge poge of snippets I gathered from a few places.

Firstly, you should store your AWS keys and gpg passphrase is a file with strict permissions in the root directory

 #/root/.dup.secrets
 # chmod 700 /root/.dup/secrets
AWS_ACCESS_KEY_ID="YOUR_ID"
AWS_SECRET_ACCESS_KEY="YOUR_KEY"
PASSPHRASE="YOUR_GPG_PASSPHRASE"

The backup script will execute this file using the . operator, synonyms with source. In this case it just sets some variables, which our backup script immediately exports.

Set your GPG key. If you forgot what it was then (as root) type gpg --list-keys

pub   2048R/BC956C98 2015-07-15
uid                  Joe Bloggs
sub   2048R/3DD81C72 2015-07-15

(It's the string of numbers and lets on the pub line, just after 2048R)

I use the handy $(mtemp) for a temporary scratchpad file for the log, since I'll have it emailed to me, and I won't be so interested in keeping lots of logs for my local backup. I then set some more convenience variables like DATE, which will come in handy with the logs (N.B. then when creating a cron script, it's important to use the full path to things, as cron doesn't always have access to the PATH).

My script sets the number of simultaneous full (i.e. not incremental differences, but full on brand new) backups we should keep as only 2 at any given time, and tells duplicity to make a new full backup every six months.

In addition to gpg, in the common options we set:

  1. Asynchronous upload: duplicity can upload a volume while, at the same time, preparing the next volume for upload (so in theory makes for a faster backups)
  2. numeric-owner: use numeric gid/uid not user name. Recommended when you want to use your backup to restore from a live cd if your computer died.

The script checks if duplicity is not already running (by counting processes of that name). Then if not, we're off.

The script has a handy "trace" function, which will give you a nice timestamped logger to file by just writing "trace hello word......".

As I've said, I run this monthly, so get an incremental backup once per month (not as diligent as some I know, but I want to keep things cheap), after 6 months a full backup is created, so now I have two full backup chains in parallel. After 12 months, another full backup set is begun, and the very first backup set is removed, keeping me with 2 backup sets at any given time.

What files to backup?

My policy is set the source as everything / then exclude lots of things using a globbing file list --exclude-globbing-filelist=/root/backup.exclude \ . This exclusion file includes (I use the format /dir/* so that I still get the parent dir as a placeholder, but not contents)

Common (if not mandatory) system dir exclusions

- /proc/*     # virtual tempfs, don't backup 
- /dev/*   # dyn created at boot, futile to backup
- /sys/*   # virttual fs, don't backup
- /media/*  # media, obviously don't backup
- /run/*   # volatile runtime data, don't backup
- /lost+found/*  # file recovery etc, don't need
- /boot/efi/*  # this is a mount point for a small 200mb efi partition, don't need.
- /mnt/*   # external devices like usb may mount here, don't backup
- /tmp/*   # ephemeral files, futile to backup
 - /root/.cache/*   # definitely don't want cache
- /root/.dup.secrets  # pointless locking the key in the chest
- /var/cache/*   # don't want cache

Some more stringent system exclusions

I toyed with keeping these so I could easily restore in one shot from a live CD, but in the end the space they took was just too big, and I decided to instead to keep a snapshot of packages installed (along with config files in /etc and other not excluded root dirs, and reinstall from repos)

- /var/lib/*
- /bin/*
- /lib/*
- /lib32/*
- /lib64/*
- /opt/*
- /sbin/*
- /usr/*

Home exclusions

- /home/otheruser/*    # some other user I don't want to backup
- /home/.ecryptfs/*   #  encrypted home
  # junk
- /home/lee/.pulse/*
- /home/lee/.pulse-cookie
- /home/lee/.cache/*
- /home/lee/.gksu.lock
- /home/lee/.adobe/*
- /home/lee/.macromedia/*
- /home/lee/.recently-used
- /home/lee/.dbus/*
- /home/lee/Downloads/*  # don't want to end up backing up stuff I dloaded
- /home/lee/Dropbox/*
- /home/lee/.ecryptfs/*
- /home/lee/gPodder/*
- /home/lee/.local/share/Trash/*
- /home/lee/.Trash
- /home/lee/.local/share/gvfs-metadata/*
- /home/lee/.gvfs/*
- /home/lee/.Private/*
- /home/lee/.vagrant.d/*
- /home/lee/.thumbnails


  # These are too big for this kind of backup and mostly static
  # I back them up with a sep script that just increments monthly
  # and only does full destory/create backups every year or two!
- /home/lee/Desktop/*  # personal pref
- /home/lee/Pictures/*
- /home/lee/Documents/*
- /home/lee/StaticBackup/*  # things that never change but I want stored

Restoring the backup

(Script completely stolen from here)

#!/bin/bash
# Export some ENV variables
. /root/.dup.secrets 
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
export PASSPHRASE

# The S3 destination followed by bucket name
DEST="s3://s3.amazonaws.com/<YOUR_BUCKET>/"

# Ensure user has entered cmd line args required
if [ $# -lt 3 ]; then echo "Usage $0 <date> <file> <restore-to>"; exit; fi

$(which duplicity) \
    --restore-time $1 \
    --file-to-restore $2 \
    ${DEST} $3

# Reset the ENV variables. Don't need them sitting around
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset PASSPHRASE

Typically files are timestamped like duplicity-full.20150716T030320Z.vol11.difftar.gpg which means 2015-07-16 (20150716) at time (T) 03:03:20 UTC or GMT time (that's the Z).

Perhaps you'll want to first list all the files in the backup to see what you want to restore:

duplicity list-current-files s3://s3.amazonaws.com/<YOUR_BUCKET>/ >> list_of_files.txt

you could feed in a time for the listing too, with the --time flag. Remember however, you need to export the various keys if running this from command line.

Verifying the backup

The duplicity verify command is best explained by the docs:

verify [--compare-data] [--time <time>] [--file-to-restore <rel_path>] <url> 
<local_path>
Restore backup contents temporarily file by file and compare against the local path’s contents. duplicity will exit with a non-zero error level if any files are different. On verbosity level info (4) or higher, a message for each file that has changed will be logged. 
The --file-to-restore option restricts verify to that file or folder. The --time option allows to select a backup to verify against. The --compare-data option enables data comparison

To this end we can write the short script:

#!/bin/bash
# Export some ENV variables 
. /root/.dup.secrets 
export AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY
export PASSPHRASE

# The S3 destination followed by bucket name
DEST="s3://s3.amazonaws.com/<YOUR_BUCKET>/"

# The source of your backup
SOURCE=/

# -v4 sets verbosity level so we see differing files
duplicity verify -v4 ${DEST} ${SOURCE}\
    --exclude-globbing-filelist=/root/backup.exclude

# Reset the ENV variables. Don't need them sitting around
unset AWS_ACCESS_KEY_ID
unset AWS_SECRET_ACCESS_KEY
unset PASSPHRASE

Policy for static and humongous directories

I excluded some dirs in my backup script like ~/Pictures and ~/Documents. It's not that I do not want to backup the contents of these dirs, it's just that they are huge (over 50b combined), and the contents are fairly unchanging. If I have unlimited super fast bandwidth (and didn't care about costs), I'd just throw them into the above backup instead of excluding, but in reality I think it makes little sense to have duplicity destroy the remote data and create full backups from scratch every few months.

The way I have decided to backup these dirs is to have a folder ~/StaticBackups (which again is excluded from the first backup). In here I manually archive (tar.bz2) the contents of dirs like ~/Pictures (leaving the ~/Pictures dir just as a dumping ground for recent, well..pictures). Now I have a second duplicity script, exactly the same as the above, but which only does incremental backups. There is no forcing of full backups, and no deletion of old full backups. Just one primary chain that increments on a monthly cron. Now after the first big push, the monthly incremental backups are small (perhaps I should change it to just have a reaaaally long cycle, like every 2 years we do a full backup, I don't know...)

I'll talk about some other tips and tricks regarding duplicity in the follow-up post.

Current rating: 5

About Lee

I am a Theoretical Physics PhD graduate now working in the technology sector. I have strong mathematical skills and originally started in heavy-duty scientific computing, but now I work mostly with Python and the Django framework. I am available for hire now, so check out my resume and get in touch.

Comments