Motivation
I wanted to ensure any data I put into my ARM k3s cluster is backed up to prevent data loss.
I no longer recommend duplicacy. Instead, read my article on restic backups on TrueNas instead.
Backup Contenders
I took a look at CrashPlan, restic and Duplicacy.
-
CrashPlan - Even though they have a decent linux client, I eliminated Crashplan because:
- They’ve already abandoned the home market. I currently use their CrashPlan for Small Business account for my Mac. I suspect they’ll also abandon this market because accounts with a small number of licenses also aren’t worth their time.
- They bill per-machine instead of by the amount of storage used.
-
restic - Open source, which is great, but I ended up eliminating them because their deduplication wasn’t as strong as Duplicacy. It also seemed a little awkward to prune snapshots when I experimented with it.
-
Duplicacy - I chose
Duplicacy
because it:- Supports cross-source deduplication
- Works well with B2
- Runs on Linux, Windows and Mac
- Allows multiple source directories to be backed up simultaneously to the same B2 bucket.
- Continues backing up where it leaves off after being interrupted and restarted. This eliminates having to completely restart the backup and re-upload everything.
It didn’t hurt that I know several people using it with large amounts of data who are happy with it.
On to the Backups
I made a docker image, thoth-duplicacy, which installs duplicacy and duplicacy-util on top of debian buster-slim, along with some helper scripts to make using it more convenient.
The image is published on docker hub, with both an Intel and and ARM7 version - the most current builds are tagged unixorn/thoth-duplicacy:armv7l
and unixorn/thoth-duplicacy:x86_64
.
Usage
For simplicity, I’m running my backups as kubernetes cron jobs. This allows me to easily run backups of multiple directory trees at once, and the kubernetes scheduler will automagically spread them around the cluster to the least loaded nodes.
Pre-requisites
Create a Kubernetes Namespace
I like my cluster neat and organized, with different services in their own namespaces, so I created a backups namespace by running kubectl create namespace backup
.
Set up a B2 Storage Bucket for Duplicacy
Create a B2 Bucket and App Key
- Create a bucket in B2. Only use this bucket for
duplicacy
backups.. Do this first so that when you create the app key, using the dropdown menu you can easily restrict its access to only this bucket. - Create an app key in B2 that you only use with Duplicacy. Definitely do not use the root account’s credentials. When you create it, specify that it’s only allowed to use your backups bucket. Make sure to copy the app key information when you create it, it will only be displayed once.
Now you’re ready to initialize the bucket for the first directory you want to back up. The easiest way to do this is by running duplicacy
inside the thoth-duplicacy
container with docker-compose
.
Set up thoth-duplicacy container
git clone git@github.com:unixorn/thoth-duplicacy.git
BACKUP_LOCATION=/that/first/directory docker-compose run thoth-duplicacy bash
cd /data
mkdir -p .duplicacy
Initialize the B2 Bucket.
duplicacy init -encrypt -storage-name b2 STORAGEID b2://yourbucket
. STORAGEID cannot have spaces or any special characters besides-
and_
.duplicacy
will prompt you for the B2 app ID, app key, and the encryption password for your backups. Store the password in your secure password manager - without it, you can’t restore any of your data. Annoyingly you have to also set the password, B2 id and key again after initializing the bucket so that backups won’t prompt you for them.- Set the B2 ID -
duplicacy set -storage b2://net-unixorn-blog-test -key b2_id -value YOUR_APP_ID
- Set the B2 key -
duplicacy set -storage b2://net-unixorn-blog-test -key b2_key -value YOUR_APP_KEY
- Set the password -
duplicacy set -storage b2://net-unixorn-blog-test -key password -value YOURPASSWORD
You can now run backup-cronjob
and watch the first backup grind.
After I configured duplicacy
for the first time, it was much less hassle to copy the .duplicacy/preferences
json file to each new directory tree. I wanted to back up the .duplicacy
directory and change the id key to a new unique one — don’t put spaces or any special characters in the id other than _
and -
. You don’t have to change the storage key, and actually shouldn’t - sharing the same storage bucket is what allows duplicacy
to deduplicate your files across multiple source directories, which keeps your storage bill down.
Here’s an example preferences
file -
[
{
"name": "b2",
"id": "UNIQUE_ID_FOR_YOUR_DIRECTORY",
"repository": "",
"storage": "b2://your-backups-bucket",
"encrypted": true,
"no_backup": false,
"no_restore": false,
"no_save_password": false,
"nobackup_file": "",
"keys": {
"b2_id": "ROLE_ACCOUNT_B2_ID",
"b2_key": "ROLE_ACCOUNT_B2_KEY",
"password": "SUPER_SECRET_ENCRYPTION_PASSWORD_FOR_YOUR_DATA"
}
}
]
Backing up a Directory Tree
Here’s a sample job that backs up one of my directory trees. I’m using the backups namespace that I created earlier to keep things tidy - if you want to use the default namespace instead, delete the namespace entry in the metadata section.
Here’s some things you’ll need to customize if you base a job on this example:
- Change the namespace entry in the metadata section to match whichever namespace you decided to use.
- I run this on Odroid HC2s and Raspberry Pis, which both use ARM CPUs. If you’re using x86, change the image entry to
unixorn/thoth-duplicacy:x86_64
in the template spec section. - I work from home, so I want to restrict the number of upload threads so that running backups don’t burn all my upload bandwidth. Change
DUPLICACY_BACKUP_THEAD_COUNT
in the env section if you want more simultaneous threads. The odroids only have 8 cores, but I had no issues running 12 threads other than gobbling up upstream bandwith. - The
B2_STORAGE_NAME
environment variable is used by the backup-cronjob script to determine where to write the backup, so alter the value according to your setup. - I’m backing up a moosefs distributed file system. I had already tagged all my chunk servers with
kubectl label node NODENAME odroid=true
and I use anodeSelector
stanza in the backup cron jobs to restrict the backup to only run on one of the chunk servers where the data resides. The moosefs data is distributed across all the chunk servers and each chunk server in the cluster currently contains 33% of the files, so running the backup on a chunk server maximizes the amount of data that can be local reads and don’t have to go across the network. Update or delete thenodeSelector
clause to work with your environment.
Once you’ve updated the file, install the cronjob with kubectl apply -f backup-example-directory-tree.yml
.
You can download this from backup-example-directory-tree.yml instead of hassling with copy and paste.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: backup-exampledir
namespace: backups
spec:
schedule: "35 */2 * * *"
jobTemplate:
spec:
# Ensure only one copy of the backup is running, even if it takes
# so long to run that it is still running when the next cron slot
# occurs
concurrencyPolicy: Forbid
template:
spec:
containers:
- name: backup-exampledir
# I'm running this on the odroids in my cluster, so I'm specifying
# the ARM7 build
image: unixorn/thoth-duplicacy:arm7l
# Use the x86_64 tag if you're on Intel
# image: unixorn/thoth-duplicacy:x86_64
args:
- /bin/sh
- -c
- /usr/local/bin/backup-cronjob
volumeMounts:
- name: data-volume
mountPath: /data/
env:
# I want to restrict the number of threads used for uploads
# so that duplicacy doesn't consume all my upload bandwidth.
# I don't care if it makes my backups slower.
- name: DUPLICACY_BACKUP_THEAD_COUNT
value: "3"
# backup-cronjob needs to know what defined storage to back up
# files to.
- name: B2_STORAGE_NAME
value: "b2"
restartPolicy: OnFailure
# Keep it running on a chunkserver so that at least part of the
# I/O is to local disk instead of across the network. Remove if
# you don't care what node backups happen on.
nodeSelector:
odroid: "true"
volumes:
- name: data-volume
hostPath:
# This will be remapped to /data which is where duplicacy
# expects to find the data it is backing up, and the .duplicacy
# directory with its settings.
path: /dfs/volumes/exampledir
# this field is optional
type: Directory
Pruning Snapshots
I don’t want to keep snapshots forever, so I made a kubernetes cron job to clean them up.
Briefly, you can specify multiple -keep X:Y
arguments, where you keep one snapshot for every X
days after the snapshots are older than Y
days.
For example, in the purge-stale-duplicacy-snapshots.yml
job below, I have it set with -keep 0:365 -keep 30:90 -keep 7:30 -keep 1:2
, which means keep no snapshots more than 365 days old, for snapshots older than 90 days keep one every 30 days, after fourteen days keep one every seven days, and after two days keep one every day.
Warning: Notice that I specified the expiration rules starting with the longest (365 days) and continuing in descending age order - a minor annoyance with duplicacy
is that you have to specify the -keep
clauses starting with the longest age threshold and then specify the rules for shorter thresholds, or duplicacy
will ignore the rules specified out of order, which could lead to more snapshots being purged than you would expect. Run with -dry-run
first so you can see whether all your rules are being applied as you expect.
Before using this job definition, at a minimum you should set the namespace for the cron job, update the image if you’re running on x86, and update the -keep X:Y
statements to correspond with your snapshot retention policy.
Once you’ve updated the configuration, install the cron job with kubectl apply -f purge-stale-duplicacy-snapshots.yml
You can download this from purge-stale-duplicacy-snapshots.yml instead of hassling with copy and paste.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: purge-stale-duplicacy-snapshots
namespace: backups
spec:
schedule: "48 */3 * * *"
jobTemplate:
spec:
concurrencyPolicy: Forbid
template:
spec:
containers:
- name: purge-stale-duplicacy-snapshots
# I'm running this on the odroids in my cluster, so I'm specifying
# the ARM7 build
image: unixorn/thoth-duplicacy:arm7l
# Use the x86_64 tag if you're on Intel
# image: unixorn/thoth-duplicacy:x86_64
# Make sure we run inside /data so that duplicacy can find
# the configuration directory.
workingDir: /data
# Remember that the -keep arguments must be listed from longest
# time frame to shortest, otherwise the disordered ones will be
# ignored, which could mean deleting snapshots you want to keep.
#
# I'm specifying to keep no snapshots more than 365 days old, keep
# a single snapshot every 30 days for snapshots older than 90 days,
# a single snapshot a week for snapshots older than 30 days, and
# finally keep only a single snapshot per day for snapshots
# older than 2 days.
#
# Also note that the duplicacy verb (prune) has to come before
# any of the settings command line options.
args:
- duplicacy
- prune
- -storage
- b2
- -all
- -keep 0:365
- -keep 30:90
- -keep 7:14
- -keep 1:2
- -exhaustive
volumeMounts:
- name: data-volume
mountPath: /data/
env:
- name: DUPLICACY_BACKUP_THEAD_COUNT
value: "3"
- name: B2_STORAGE_NAME
value: "b2"
restartPolicy: OnFailure
volumes:
- name: data-volume
hostPath:
# This will be remapped to /data which is where duplicacy
# expects to find the data it is backing up, and the .duplicacy
# directory with its settings.
path: /dfs/volumes/exampledir
# this field is optional
type: Directory
Restoring Files
Backups are useless if you can’t restore.
To restore, use docker-compose
and the thoth-duplicacy repository. I only did my test restores with the command line, I haven’t bothered experimenting with the GUI from https://duplicacy.com.
- Use
git clone git@github.com:unixorn/thoth-duplicacy.git
if you didn’t keep the checkout when you initialized your B2 bucket - Make a directory to restore to, and a
.duplicacy
subdirectory for the configuration withmkdir -p /path/to/restore/.duplicacy
. While you can restore in place over the live directory, I’m a bit too cautious to do that especially if I’m doing a restore after having already lost files. - Copy the
preferences
file from the directory tree you want to restore to/path/to/restore/.duplicacy
. - Start a container with
BACKUP_LOCATION=/path/to/restore docker-compose run thoth-duplicacy bash
Now that you’re in a running thoth-duplicacy
container with your restore directory mounted as /data
, you can restore files. cd /data
before running duplicacy
commands so it can find its configuration.
-
You can look at the available snapshots with
duplicacy list
. It will list snapshot name, revision number, and the timestamp when each snapshot was created. -
Once you know what snapshots are in the bucket, you can examine the files available in a specific snapshot with
duplicacy list -files -r REVISION_NUMBER
. -
Now you can restore files - if you want to restore just the
foo
directory, from revision 99, you’d runduplicacy restore -r 99 'foo/*'
.