Backing Up the Cluster with Duplicacy

Motivation

I wanted to ensure any data I put into my ARM k3s cluster is backed up to prevent data loss.

I no longer recommend duplicacy. Instead, read my article on restic backups on TrueNas instead.

Backup Contenders

I took a look at CrashPlan, restic and Duplicacy.

CrashPlan - Even though they have a decent linux client, I eliminated Crashplan because:
- They’ve already abandoned the home market. I currently use their CrashPlan for Small Business account for my Mac. I suspect they’ll also abandon this market because accounts with a small number of licenses also aren’t worth their time.
- They bill per-machine instead of by the amount of storage used.
restic - Open source, which is great, but I ended up eliminating them because their deduplication wasn’t as strong as Duplicacy. It also seemed a little awkward to prune snapshots when I experimented with it.
Duplicacy - I chose Duplicacy because it:
- Supports cross-source deduplication
- Works well with B2
- Runs on Linux, Windows and Mac
- Allows multiple source directories to be backed up simultaneously to the same B2 bucket.
- Continues backing up where it leaves off after being interrupted and restarted. This eliminates having to completely restart the backup and re-upload everything.

It didn’t hurt that I know several people using it with large amounts of data who are happy with it.

On to the Backups

I made a docker image, thoth-duplicacy, which installs duplicacy and duplicacy-util on top of debian buster-slim, along with some helper scripts to make using it more convenient.

The image is published on docker hub, with both an Intel and and ARM7 version - the most current builds are tagged unixorn/thoth-duplicacy:armv7l and unixorn/thoth-duplicacy:x86_64.

Usage

For simplicity, I’m running my backups as kubernetes cron jobs. This allows me to easily run backups of multiple directory trees at once, and the kubernetes scheduler will automagically spread them around the cluster to the least loaded nodes.

Pre-requisites

Create a Kubernetes Namespace

I like my cluster neat and organized, with different services in their own namespaces, so I created a backups namespace by running kubectl create namespace backup.

Set up a B2 Storage Bucket for Duplicacy

Create a B2 Bucket and App Key

Create a bucket in B2. Only use this bucket for duplicacy backups.. Do this first so that when you create the app key, using the dropdown menu you can easily restrict its access to only this bucket.
Create an app key in B2 that you only use with Duplicacy. Definitely do not use the root account’s credentials. When you create it, specify that it’s only allowed to use your backups bucket. Make sure to copy the app key information when you create it, it will only be displayed once.

Now you’re ready to initialize the bucket for the first directory you want to back up. The easiest way to do this is by running duplicacy inside the thoth-duplicacy container with docker-compose.

Set up thoth-duplicacy container

git clone git@github.com:unixorn/thoth-duplicacy.git
BACKUP_LOCATION=/that/first/directory docker-compose run thoth-duplicacy bash
cd /data
mkdir -p .duplicacy

Initialize the B2 Bucket.

duplicacy init -encrypt -storage-name b2 STORAGEID b2://yourbucket. STORAGEID cannot have spaces or any special characters besides - and _. duplicacy will prompt you for the B2 app ID, app key, and the encryption password for your backups. Store the password in your secure password manager - without it, you can’t restore any of your data. Annoyingly you have to also set the password, B2 id and key again after initializing the bucket so that backups won’t prompt you for them.
Set the B2 ID - duplicacy set -storage b2://net-unixorn-blog-test -key b2_id -value YOUR_APP_ID
Set the B2 key - duplicacy set -storage b2://net-unixorn-blog-test -key b2_key -value YOUR_APP_KEY
Set the password - duplicacy set -storage b2://net-unixorn-blog-test -key password -value YOURPASSWORD

You can now run backup-cronjob and watch the first backup grind.

After I configured duplicacy for the first time, it was much less hassle to copy the .duplicacy/preferences json file to each new directory tree. I wanted to back up the .duplicacy directory and change the id key to a new unique one — don’t put spaces or any special characters in the id other than _ and -. You don’t have to change the storage key, and actually shouldn’t - sharing the same storage bucket is what allows duplicacy to deduplicate your files across multiple source directories, which keeps your storage bill down.

Here’s an example preferences file -

[
    {
        "name": "b2",
        "id": "UNIQUE_ID_FOR_YOUR_DIRECTORY",
        "repository": "",
        "storage": "b2://your-backups-bucket",
        "encrypted": true,
        "no_backup": false,
        "no_restore": false,
        "no_save_password": false,
        "nobackup_file": "",
        "keys": {
            "b2_id": "ROLE_ACCOUNT_B2_ID",
            "b2_key": "ROLE_ACCOUNT_B2_KEY",
            "password": "SUPER_SECRET_ENCRYPTION_PASSWORD_FOR_YOUR_DATA"
        }
    }
]

Backing up a Directory Tree

Here’s a sample job that backs up one of my directory trees. I’m using the backups namespace that I created earlier to keep things tidy - if you want to use the default namespace instead, delete the namespace entry in the metadata section.

Here’s some things you’ll need to customize if you base a job on this example:

Change the namespace entry in the metadata section to match whichever namespace you decided to use.
I run this on Odroid HC2s and Raspberry Pis, which both use ARM CPUs. If you’re using x86, change the image entry to unixorn/thoth-duplicacy:x86_64 in the template spec section.
I work from home, so I want to restrict the number of upload threads so that running backups don’t burn all my upload bandwidth. Change DUPLICACY_BACKUP_THEAD_COUNT in the env section if you want more simultaneous threads. The odroids only have 8 cores, but I had no issues running 12 threads other than gobbling up upstream bandwith.
The B2_STORAGE_NAME environment variable is used by the backup-cronjob script to determine where to write the backup, so alter the value according to your setup.
I’m backing up a moosefs distributed file system. I had already tagged all my chunk servers with kubectl label node NODENAME odroid=true and I use a nodeSelector stanza in the backup cron jobs to restrict the backup to only run on one of the chunk servers where the data resides. The moosefs data is distributed across all the chunk servers and each chunk server in the cluster currently contains 33% of the files, so running the backup on a chunk server maximizes the amount of data that can be local reads and don’t have to go across the network. Update or delete the nodeSelector clause to work with your environment.

Once you’ve updated the file, install the cronjob with kubectl apply -f backup-example-directory-tree.yml.

You can download this from backup-example-directory-tree.yml instead of hassling with copy and paste.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: backup-exampledir
  namespace: backups
spec:
  schedule: "35 */2 * * *"
  jobTemplate:
    spec:
      # Ensure only one copy of the backup is running, even if it takes
      # so long to run that it is still running when the next cron slot
      # occurs
      concurrencyPolicy: Forbid
      template:
        spec:
          containers:
          - name: backup-exampledir
            # I'm running this on the odroids in my cluster, so I'm specifying
            # the ARM7 build
            image: unixorn/thoth-duplicacy:arm7l
            # Use the x86_64 tag if you're on Intel
            # image: unixorn/thoth-duplicacy:x86_64
            args:
            - /bin/sh
            - -c
            - /usr/local/bin/backup-cronjob

            volumeMounts:
              - name: data-volume
                mountPath: /data/

            env:
                # I want to restrict the number of threads used for uploads
                # so that duplicacy doesn't consume all my upload bandwidth.
                # I don't care if it makes my backups slower.
              - name: DUPLICACY_BACKUP_THEAD_COUNT
                value: "3"
                # backup-cronjob needs to know what defined storage to back up
                # files to.
              - name: B2_STORAGE_NAME
                value: "b2"

          restartPolicy: OnFailure

          # Keep it running on a chunkserver so that at least part of the
          # I/O is to local disk instead of across the network. Remove if
          # you don't care what node backups happen on.
          nodeSelector:
            odroid: "true"

          volumes:
            - name: data-volume
              hostPath:
                # This will be remapped to /data which is where duplicacy
                # expects to find the data it is backing up, and the .duplicacy
                # directory with its settings.
                path: /dfs/volumes/exampledir
                # this field is optional
                type: Directory

Pruning Snapshots

I don’t want to keep snapshots forever, so I made a kubernetes cron job to clean them up.

Briefly, you can specify multiple -keep X:Y arguments, where you keep one snapshot for every X days after the snapshots are older than Y days.

For example, in the purge-stale-duplicacy-snapshots.yml job below, I have it set with -keep 0:365 -keep 30:90 -keep 7:30 -keep 1:2, which means keep no snapshots more than 365 days old, for snapshots older than 90 days keep one every 30 days, after fourteen days keep one every seven days, and after two days keep one every day.

Warning: Notice that I specified the expiration rules starting with the longest (365 days) and continuing in descending age order - a minor annoyance with duplicacy is that you have to specify the -keep clauses starting with the longest age threshold and then specify the rules for shorter thresholds, or duplicacy will ignore the rules specified out of order, which could lead to more snapshots being purged than you would expect. Run with -dry-run first so you can see whether all your rules are being applied as you expect.

Before using this job definition, at a minimum you should set the namespace for the cron job, update the image if you’re running on x86, and update the -keep X:Y statements to correspond with your snapshot retention policy.

Once you’ve updated the configuration, install the cron job with kubectl apply -f purge-stale-duplicacy-snapshots.yml

You can download this from purge-stale-duplicacy-snapshots.yml instead of hassling with copy and paste.

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: purge-stale-duplicacy-snapshots
  namespace: backups
spec:
  schedule: "48 */3 * * *"
  jobTemplate:
    spec:
      concurrencyPolicy: Forbid
      template:
        spec:
          containers:
          - name: purge-stale-duplicacy-snapshots
            # I'm running this on the odroids in my cluster, so I'm specifying
            # the ARM7 build
            image: unixorn/thoth-duplicacy:arm7l
            # Use the x86_64 tag if you're on Intel
            # image: unixorn/thoth-duplicacy:x86_64

            # Make sure we run inside /data so that duplicacy can find
            # the configuration directory.
            workingDir: /data

            # Remember that the -keep arguments must be listed from longest
            # time frame to shortest, otherwise the disordered ones will be
            # ignored, which could mean deleting snapshots you want to keep.
            #
            # I'm specifying to keep no snapshots more than 365 days old, keep
            # a single snapshot every 30 days for snapshots older than 90 days,
            # a single snapshot a week for snapshots older than 30 days, and
            # finally keep only a single snapshot per day for snapshots
            # older than 2 days.
            #
            # Also note that the duplicacy verb (prune) has to come before
            # any of the settings command line options.
            args:
            - duplicacy
            - prune
            - -storage
            - b2
            - -all
            - -keep 0:365
            - -keep 30:90
            - -keep 7:14
            - -keep 1:2
            - -exhaustive

            volumeMounts:
              - name: data-volume
                mountPath: /data/

            env:
              - name: DUPLICACY_BACKUP_THEAD_COUNT
                value: "3"
              - name: B2_STORAGE_NAME
                value: "b2"

          restartPolicy: OnFailure

          volumes:
            - name: data-volume
              hostPath:
                # This will be remapped to /data which is where duplicacy
                # expects to find the data it is backing up, and the .duplicacy
                # directory with its settings.
                path: /dfs/volumes/exampledir
                # this field is optional
                type: Directory

Restoring Files

Backups are useless if you can’t restore.

To restore, use docker-compose and the thoth-duplicacy repository. I only did my test restores with the command line, I haven’t bothered experimenting with the GUI from https://duplicacy.com.

Use git clone git@github.com:unixorn/thoth-duplicacy.git if you didn’t keep the checkout when you initialized your B2 bucket
Make a directory to restore to, and a .duplicacy subdirectory for the configuration with mkdir -p /path/to/restore/.duplicacy. While you can restore in place over the live directory, I’m a bit too cautious to do that especially if I’m doing a restore after having already lost files.
Copy the preferences file from the directory tree you want to restore to /path/to/restore/.duplicacy.
Start a container with BACKUP_LOCATION=/path/to/restore docker-compose run thoth-duplicacy bash

Now that you’re in a running thoth-duplicacy container with your restore directory mounted as /data, you can restore files. cd /data before running duplicacy commands so it can find its configuration.

You can look at the available snapshots with duplicacy list. It will list snapshot name, revision number, and the timestamp when each snapshot was created.
Once you know what snapshots are in the bucket, you can examine the files available in a specific snapshot with duplicacy list -files -r REVISION_NUMBER.
Now you can restore files - if you want to restore just the foo directory, from revision 99, you’d run duplicacy restore -r 99 'foo/*'.

Motivation#

Backup Contenders#

On to the Backups#

Usage#

Pre-requisites#

Create a Kubernetes Namespace#

Set up a B2 Storage Bucket for Duplicacy#

Create a B2 Bucket and App Key#

Set up thoth-duplicacy container#

Initialize the B2 Bucket.#

Backing up a Directory Tree#

Pruning Snapshots#

Restoring Files#