You are currently viewing all posts tagged with backups.

On E-Books

The Kindle Paperwhite has been my primary medium for consuming books since the beginning of 2014. E Ink is a great display technology that I wish was more wide spread, but beyond the fact that the Kindle (and I assume other e-readers) makes for a pleasant reading experience, the real value in electronic books is storage.

At its peak my physical collection was somewhere north of 200 books. As I mentioned years ago I took inspiration from Gary Snyder’s character in The Dharma Bums and stored my books in milk crates, which stack like a bookcase for normal use and kept the collection pre-boxed for moving. But that many books still take up space, and are still annoying to move. And in some regards they are fragile – redundant data storage is expensive in meatspace.

My digital library currently sits at 572 books and 13 gigabytes (the size skyrocketed after I began to archive a few comics). I could not justify that many physical books in my life. I still have a collection of dead trees, but I’m down to 3 milk crates. I store my digital library in git-annex, allowing me to redundantly replicate my collection across the globe, as well as keep copies in cold storage. I also burn yearly optical backups of the library to M-DISC. The library is managed with Calibre.

When I first bought the Kindle it required internet access to associate with my Amazon account. Ever since then, it has been in airplane mode. I spun up a temporary wireless network for the setup that I then deleted after the process was complete, ensuring that even if Amazon’s airplane mode was untrustworthy, the device would not be able to phone home. The advantages of giving the Kindle internet access seem minute, and are far outweighed by the disadvantage of having to trust Amazon.

If I purchase a book from Amazon, I select the “Download & Transfer via USB” option. This results in a crippled AZW file. I am under the radical delusion that I should own what I purchase, so I import that file into Calibre using the DeDRM_tools plugin. This strips any DRM, making the book ready to be consumed and archived. Books are transferred between my computer and the Kindle via USB, which Calibre makes simple.

When I acquire books through other channels, my preferred format is always EPUB: an open format that is simply a zip archive of HTML files. Calibre’s built-in conversion tools are quite good, giving me confidence that any e-book format I import into the library will be readable at any point in the future, but my preference is to store data in formats that are open, accessible, and understandable. The closer one gets to well-formatted plain text, the closer one gets to god.

While the Kindle excels at the linear reading of novels, I’ve also come to appreciate digital copies of reference books and technical manuals. Often the first reading of these types of books involves lots of flipping back and forth, which is easier in the dead tree variant, but after that first reading the searchability of the digital copy is far more useful for reference. The physical size of these types of books also makes them even more difficult to carry and store than other books, all but guaranteeing you won’t have access to them when you need to reference them. Digital books solve that problem.

I’m confident in my ability to securely store digital data. Whenever I import a book into my library, I know that I now have permanent access to that knowledge for the rest of my life, regardless of environmental disaster, the whims of publishing houses, or the size of my living quarters.

LUKS Header Backup

I’d neglected backup LUKS headers until Gwern’s data loss postmortem last year. After reading his post I dumped the headers of the drives I had accessible, but I never got around to performing the task on my less frequently accessed drives. Last month I had trouble mounting one of those drives. It turned out I was simply using the wrong passphrase, but the experience prompted me to make sure I had completed the header backup procedure for all drives.

I dump the header to memory using the procedure from the Arch wiki. This is probably unnecessary, but only takes a few extra steps. The header is stored in my password store, which is obsessively backed up.

$ sudo mkdir /mnt/tmp
$ sudo mount ramfs /mnt/tmp -t ramfs
$ sudo cryptsetup luksHeaderBackup /dev/sdc --header-backup-file /mnt/tmp/dump
$ sudo chown pigmonkey:pigmonkey /mnt/tmp/dump
$ pass insert -m crypt/luksheader/themisto < /mnt/tmp/dump
$ sudo umount /mnt/tmp
$ sudo rmdir /mnt/tmp

Borg Assimilation

For years the core of my backup strategy has been rsnapshot via cryptshot to various external drives for local backups, and Tarsnap for remote backups.

Tarsnap, however, can be slow. It tends to take somewhere between 15 to 20 minutes to create my dozen or so archives, even if little has changed since the last run. My impression is that this is simply due to the number of archives I have stored and the number of files I ask it to archive. Once it has decided what to do, the time spent transferring data is negligible. I run Tarsnap hourly. Twenty minutes out of every hour seems like a lot of time spent Tarsnapping.

I’ve eyed Borg for a while (and before that, Attic), but avoided using it due to the rapid development of its earlier days. While activity is nice, too many changes too close together do not create a reassuring image of a backup project. Borg seems to have stabilized now and has a large enough user base that I feel comfortable with it. About a month ago, I began using it to backup my laptop to rsync.net.

Initially I played with borgmatic to perform and maintain the backups. Unfortunately it seems to have issues with signal handling, which caused me to end up with annoying lock files left over from interrupted backups. Borg itself has good documentation and is easy to use, and I think it is useful to build familiarity with the program itself instead of only interacting with it through something else. So I did away with borgmatic and wrote a small bash script to handle my use case.

Creating the backups is simple enough. Borg disables compression by default, but after a little experimentation I found that LZ4 seemed to be a decent compromise between compression and performance.

Pruning backups is equally easy. I knew I wanted to match roughly what I had with Tarsnap: hourly backups for a day or so, daily backups for a week or so, then a month or two of weekly backups, and finally a year or so of monthly backups.

My only hesitation was in how to maintain the health of the backups. Borg provides the convenient borg check command, which is able to verify the consistency of both a repository and the archives themselves. Unsurprisingly, this is a slow process. I didn’t want to run it with my hourly backups. Daily, or perhaps even weekly, seemed more reasonable, but I did want to make sure that both checks were completed successfully with some frequency. Luckily this is just the problem that I wrote backitup to solve.

Because the consistency checks take a while and consume some resources, I thought it would also be a good idea to avoid performing them when I’m running on battery. Giving backitup the ability to detect if the machine is on battery or AC power was a simple hack. The script now features the -a switch to specify that the program should only be executed when on AC power.

My completed Borg wrapper is thus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/sh
export BORG_PASSPHRASE='supers3cr3t'
export BORG_REPO='borg-rsync:borg/nous'
export BORG_REMOTE_PATH='borg1'

# Create backups
echo "Creating backups..."
borg create --verbose --stats --compression=lz4             \
    --exclude ~/projects/foo/bar/baz                        \
    --exclude ~/projects/xyz/bigfatbinaries                 \
    ::'{hostname}-{user}-{utcnow:%Y-%m-%dT%H:%M:%S}'        \
    ~/documents                                             \
    ~/projects                                              \
    ~/mail                                                  \
    # ...etc

# Prune backups
echo "Pruning backups..."
borg prune --verbose --list --prefix '{hostname}-{user}-'    \
    --keep-within=2d                                         \
    --keep-daily=14                                          \
    --keep-weekly=8                                          \
    --keep-monthly=12                                        \

# Check backups
echo "Checking repository..."
backitup -a                                             \
    -p 172800                                           \
    -l ~/.borg_check-repo.lastrun                       \
    -b "borg check --verbose --repository-only"         \
echo "Checking archives..."
backitup -a                                             \
    -p 259200                                           \
    -l ~/.borg_check-arch.lastrun                       \
    -b "borg check --verbose --archives-only --last 24" \

This is executed by a systemd service.

[Unit]
Description=Borg Backup

[Service]
Type=oneshot
ExecStart=/home/pigmonkey/bin/borgwrapper.sh

[Install]
WantedBy=multi-user.target

The service is called hourly by a systemd timer.

[Unit]
Description=Borg Backup Timer

[Timer]
Unit=borg.service
OnCalendar=hourly
Persistent=True

[Install]
WantedBy=timers.target

I don’t enable the timer directly, but add it to /usr/local/etc/trusted_units so that nmtrust activates it when I’m connected to trusted networks.

$ echo "borg.timer,user:pigmonkey" >> /usr/local/etc/trusted_units

I’ve been running this for about a month now and have been pleased with the results. It averages about 30 seconds to create the backups every hour, and another 30 seconds or so to prune the old ones. As with Tarsnap, deduplication is great.

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               19.87 GB             18.41 GB             10.21 MB
All archives:              836.02 GB            773.35 GB             19.32 GB
                       Unique chunks         Total chunks
Chunk index:                  371527             14704634
------------------------------------------------------------------------------

The most recent repository consistency check took about 30 minutes, but only runs every 172800 seconds, or once every other day. The most recent archive consistency check took about 40 minutes, but only runs every 259200 seconds, or once per 3 days. I’m not sure that those schedules are the best option for the consistency checks. I may tweak their frequencies, but because I know they will only be executed when I am on a trusted network and AC power, I’m less concerned about the length of time.

With Borg running hourly, I’ve reduced Tarsnap to run only once per day. Time will tell if Borg will slow as the number of stored archives increase, but for now running Borg hourly and Tarsnap daily seems like a great setup. Tarsnap and Borg both target the same files (with a few exceptions). Tarsnap runs in the AWS us-east-1 region. I’ve always kept my rsync.net account in their Zurich datacenter. This provides the kind of redundancy that lets me rest easy.

Contrary to what you might expect given the number of blog posts on the subject, I actually spend close to no time worrying about data loss in my day to day life, thanks to stuff like this. An ounce of prevention, and all that. (Maybe a few kilograms of prevention in my case.)

Automated Repository Tracking

I have confidence in my backup strategies for my own data, but until recently I had not considered backing up other people’s data.

Recently, the author of a repository that I tracked on GitHub deleted his account and disappeared from the information super highway. I had a local copy of the repository, but I had not pulled it for a month. A number of recent changes were lost to me. This inspired me to setup the system I now use to automatically update local copies of any code repositories that are useful or interesting to me.

I clone the repositories into ~/library/src and use myrepos to interact with them. I use myrepos for work and personal repositories as well, so to keep this stuff segregated I setup a separate config file and a shell alias to refer to it.

alias lmr='mr --config $HOME/library/src/myrepos.conf --directory=$HOME/library/src'

Now when I want to add a new repository, I clone it normally and register it with myrepos.

$ cd ~/library/src
$ git clone https://github.com/warner/magic-wormhole
$ cd magic-wormhole && lmr register

The ~/library/src/myrepos.conf file has a default section which states that no repository should be updated more than once every 24 hours.

[DEFAULT]
skip = [ "$1" = update ] && ! hours_since "$1" 24

Now I can ask myrepos to update all of my tracked repositories. If it sees that it has already updated a repository within 24 hours, myrepos will skip the repository.

$ lmr update

To automate this I create a systemd service.

[Unit]
Description=Update library repositories

[Service]
Type=oneshot
ExecStart=/usr/bin/mr --config %h/library/src/myrepos.conf -j5 update

[Install]
WantedBy=multi-user.target

And a systemd timer to run the service every hour.

[Unit]
Description=Update library repositories timer

[Timer]
Unit=library-repos.service
OnCalendar=hourly
Persistent=True

[Install]
WantedBy=timers.target

I don’t enable this timer directly, but instead add it to my trusted_units file so that nmtrust will enable it only when I am on a trusted network.

$ echo "library-repos.timer,user:pigmonkey" >> /usr/local/etc/trusted_units

If I’m curious to see what has been recently active, I can ls -ltr ~/library/src. I find this more useful than GitHub stars or similar bookmarking.

I currently track 120 repositories. This is only 3.3 GB, which means I can incorporate it into my normal backup strategies without being concerned about the extra space.

The internet can be fickle, but it will be difficult for me to loose a repository again.

Cold Storage

This past spring I mentioned my cold storage setup: a number of encrypted 2.5” drives in external enclosures, stored inside a Pelican 1200 case, secured with Abloy Protec2 321 locks. Offline, secure, and infrequently accessed storage is an important component of any strategy for resilient data. The ease with which this can be managed with git-annex only increases my infatuation with the software.

Data Data Data Data Data

I’ve been happy with the Seagate ST2000LM003 drives for this application. Unfortunately the enclosures I first purchased did not work out so well. I had two die within a few weeks. They’ve been replaced with the SIG JU-SA0Q12-S1. These claim to be compatible with drives up to 8TB (someday I’ll be able to buy 8TB 2.5” drives) and support USB 3.1. They’re also a bit thinner than the previous enclosures, so I can easily fit five in my box. The Seagate drives offer about 1.7 terabytes of usable space, giving this setup a total capacity of 8.5 terabytes.

Setting up git-annex to support this type of cold storage is fairly straightforward, but does necessitate some familiarity with how the program works. Personally, I prefer to do all my setup manually. I’m happy to let the assistant watch my repositories and manage them after the setup, and I’ll occasionally fire up the web app to see what the assistant daemon is doing, but I like the control and understanding provided by a manual setup. The power and flexibility of git-annex is deceptive. Using it solely through the simplified interface of the web app greatly limits what can be accomplished with it.

Encryption

Before even getting into git-annex, the drive should be encrypted with LUKS/dm-crypt. The need for this could be avoided by using something like gcrypt, but LUKS/dm-crypt is an ingrained habit and part of my workflow for all external drives. Assuming the drive is /dev/sdc, pass cryptsetup some sane defaults:

$ sudo cryptsetup --cipher aes-xts-plain64 --key-size 512 --hash sha512 luksFormat /dev/sdc

With the drive encrypted, it can then be opened and formatted. I’ll give the drive a human-friendly label of themisto.

$ sudo cryptsetup luksOpen /dev/sdc themisto_crypt
$ sudo mkfs.ext4 -L themisto /dev/mapper/themisto_crypt

At this point the drive is ready. I close it and then mount it with udiskie to make sure everything is working. How the drive is mounted doesn’t matter, but I like udiskie because it can integrate with my password manager to get the drive passphrase.

$ sudo cryptsetup luksClose /dev/mapper/themisto_crypt
$ udiskie-mount -r /dev/sdc

Git-Annex

With the encryption handled, the drive should now be mounted at /media/themisto. For the first few steps, we’ll basically follow the git-annex walkthrough. Let’s assume that we are setting up this drive to be a repository of the annex ~/video. The first step is to go to the drive, clone the repository, and initialize the annex. When initializing the annex I prepend the name of the remote with satellite :. My cold storage drives are all named after satellites, and doing this allows me to easily identify them when looking at a list of remotes.

$ cd /media/themisto
$ git clone ~/video
$ cd video
$ git annex init "satellite : themisto"

Disk Reserve

Whenever dealing with a repository that is bigger (or may become bigger) than the drive it is being stored on, it is important to set a disk reserve. This tells git-annex to always keep some free space around. I generally like to set this to 1 GB, which is way larger than it needs to be.

$ git config annex.diskreserve "1 gb"

Adding Remotes

I’ll then tell this new repository where the original repository is located. In this case I’ll refer to the original using the name of my computer, nous.

$ git remote add nous ~/video

If other remotes already exist, now is a good time to add them. These could be special remotes or normal ones. For this example, let’s say that we have already completed this whole process for another cold storage drive called sinope, and that we have an s3 remote creatively named s3.

$ git remote add sinope /media/sinope/video
$ export AWS_ACCESS_KEY_ID="..."
$ export AWS_SECRET_ACCESS_KEY="..."
$ git annex enableremote s3

Trust

Trust is a critical component of how git-annex works. Any new annex will default to being semi-trusted, which means that when running operations within the annex on the main computer – say, dropping a file – git-annex will want to confirm that themisto has the files that it is supposed to have. In the case of themisto being a USB drive that is rarely connected, this is not very useful. I tell git-annex to trust my cold storage drives, which means that if git-annex has a record of a certain file being on the drive, it will be satisfied with that. This increases the risk for potential data-loss, but for this application I feel it is appropriate.

$ git annex trust .

Preferred Content

The final step that needs to be taken on the new repository is to tell it what files it should want. This is done using preferred content. The standard groups that git-annex ships with cover most of the bases. Of interest for this application is the archive group, which wants all content except that which has already found its way to another archive. This is the behaviour I want, but I will duplicate it into a custom group called satellite. This keeps my cold storage drives as standalone things that do not influence any other remotes where I may want to use the default archive.

$ git annex groupwanted satellite "(not copies=satellite:1) or approxlackingcopies=1"
$ git annex group . satellite
$ git annex wanted . groupwanted

For other repositories, I may want to store the data on multiple cold storage drives. In that case I would create a redundantsatellite group that wants all content which is not already present in two other members of the group.

$ git annex groupwanted redundantsatellite "(not copies=redundantsatellite:2) or approxlackingcopies=1"
$ git annex group . redundantsatellite
$ git annex wanted . groupwanted

Syncing

With everything setup, the new repository is ready to sync and to start to ingest content from the remotes it knows about!

$ git annex sync --content

However, the original repository also needs to know about the new remote.

$ cd ~/video
$ git remote add themisto /media/themisto/video
$ git annex sync

The same is the case for any other previously existing repository, such as sinope.

Redundant File Storage

As I’ve mentioned previously, I store just about everything that matters in git-annex (the only exception is code, which is stored directly in regular git). One of git-annex’s many killer features is special remotes. They make tenable this whole “cloud storage” thing that we do now.

A special remote allows me to store my files with a large number of service providers. It makes this easy to do by abstracting away the particulars of the provider, allowing me to interact with all of them in the same way. It makes this safe to do by providing encryption. These factors encourage redundancy, reducing my reliance on any one provider.

Recently I began playing with rclone. Rclone is a program that supports file syncing for a handful of cloud storage providers. That’s semi-interesting by itself but, more significantly, there is a git-annex special remote wrapper. That means any of the providers supported by rclone can be used as a special remote. I looked through all of rclone’s supported providers and decided there were a few that I had no reason not to use.

Hubic

Hubic is a storage provider from OVH with a data center in France. Their pricing is attractive. I’d happily pay €50 per year for 10TB of storage. Unfortunately they limit connections to 10 Mbit/s. In my experience they ended up being even slower than this. Slow enough that I don’t want to give them money, but there’s still no reason not to take advantage of their free 25 GB plan.

After signing up, I setup a new remote in rclone.

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> hubic-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 7
Hubic Client Id - leave blank normally.
client_id> 
Hubic Client Secret - leave blank normally.
client_secret> 
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id = 
client_secret = 
token = {"access_token":"XXXXXX"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

With that setup, I went into my ~/documents annex and added the remote.

$ git annex initremote hubic type=external externaltype=rclone target=hubic-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512

I want git-annex to automatically send everything to Hubic, so I took advantage of standard groups and put the repository in the backup group.

$ git annex wanted hubic standard
$ git annex group hubic backup

Given Hubic’s slow speed, I don’t really want to download files from it unless I need to. This can be configured in git-annex by setting the cost of the remote. Local repositories default to 100 and remote repositories default to 200. I gave the Hubic remote a high cost so that it will only be used if no other remotes are available.

$ git config remote.hubic.annex-cost 500

If you would like to try Hubic, I have a referral code which gives us both an extra 5GB for free.

Backblaze B2

B2 is the cloud storage offering from backup company Backblaze. I don’t know anything about them, but at $0.005 per GB I like their pricing. A quick search of reviews shows that the main complaint about the service is that they offer no geographic redundancy, which is entirely irrelevant to me since I build my own redundancy with my half-dozen or so remotes per repository.

Signing up with Backblaze took a bit longer. They wanted a phone number for 2-factor authentication, I wanted to give them a credit card so that I could use more than the 10GB they offer for free, and I had to generate an application key to use with rclone. After that, the rclone setup was simple.

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> b2-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 3
Account ID
account> 123456789abc
Application Key
key> 0123456789abcdef0123456789abcdef0123456789
Endpoint for the service - leave blank normally.
endpoint> 
Remote config
--------------------
[remote]
account = 123456789abc
key = 0123456789abcdef0123456789abcdef0123456789
endpoint = 
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

With that, it was back to ~/documents to initialize the remote and send it all the things

$ git annex initremote b2 type=external externaltype=rclone target=b2-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512
$ git annex wanted b2 standard
$ git annex group b2 backup

While I did not measure the speed with B2, it feels as fast as my S3 or rsync.net remotes, so I didn’t bother setting the cost.

Google Drive

While I do not regularly use Google services for personal things, I do have a Google account for Android stuff. Google Drive offers 15 GB of storage for free and rclone supports it, so why not take advantage?

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> gdrive-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 6
Google Application Client Id - leave blank normally.
client_id> 
Google Application Client Secret - leave blank normally.
client_secret> 
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine or Y didn't work
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id = 
client_secret = 
token = {"AccessToken":"xxxx.x.xxxxx_xxxxxxxxxxx_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","RefreshToken":"1/xxxxxxxxxxxxxxxx_xxxxxxxxxxxxxxxxxxxxxxxxxx","Expiry":"2014-03-16T13:57:58.955387075Z","Extra":null}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

And again, to ~/documents.

$ git annex initremote gdrive type=external externaltype=rclone target=gdrive-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512
$ git annex wanted gdrive standard
$ git annex group gdrive backup

Rinse and repeat the process for other annexes. Revel in having simple, secure, and redundant storage.

I celebrated World Backup Day by increasing the resiliency of data in my life.

Four encrypted 2TB hard drives, stored in a Pelican 1200, with Abloy Protec2 PL 321 padlocks as tamper-evident seals. Having everything that matters stored in git-annex makes projects like this simple: just clone the repositories, define the preferred content expressions, and watch the magic happen.

Cold Storage

Optical Backups of Photo Archives

I store my photos in git-annex. A full copy of the annex exists on my laptop and on an external drive. Encrypted copies of all of my photos are stored on Amazon S3 (which I pay for) and box.com (which provides 50GB for free) via git-annex special remotes. The photos are backed-up to an external drive daily with the rest of my laptop hard drive via backitup.sh and cryptshot. My entire laptop hard drive is also mirrored monthly to an external drive stored off-site.

(The majority of my photos are also on Flickr, but I don’t consider that a backup or even reliable storage.)

All of this is what I consider to be the bare minimum for any redundant data storage. Photos have special value, above the value that I assign to most other data. This value only increases with age. As such they require an additional backup method, but due to the size of my collection I want to avoid backup methods that involve paying for more online storage, such as Tarsnap.

I choose optical discs as the medium for my photo backups. This has the advantage of being read-only, which makes it more difficult for accidental deletions or corruption to propagate through the backup system. DVD-Rs have a capacity of 4.7 GBs and a cost of around $0.25 per disc. Their life expectancy varies, but 10-years seem to be a reasonable low estimate.

Preparation

I keep all of my photos in year-based directories. At the beginning of every year, the previous year’s directory is burned to a DVD.

Certain years contain few enough photos that the entire year can fit on a single DVD. More recent years have enough photos of a high enough resolution that they require multiple DVDs.

Archive

My first step is to build a compressed archive of each year. I choose tar and bzip2 compression for this because they’re simple and reliable.

1
2
$ cd ~/pictures
$ tar cjhf ~/tmp/pictures/2012.tar.bz 2012

If the archive is larger than 3.7 GB, it needs to be split into multiple files. The resulting files will be burned to different discs. The capacity of a DVD is 4.7 GB, but I place the upper file limit at 3.7 GB so that the DVD has a minimum of 20% of its capacity available. This will be filled with parity information later on for redundancy.

1
$ split -d -b 3700M 2012.tar.bz 2012.tar.bz.

Encrypt

Leaving unencrypted data around is bad form. The archive (or each of the files resulting from splitting the large archive) is next encrypted and signed with GnuPG.

1
2
$ gpg -eo 2012.tar.bz.gpg 2012.tar.bz
$ gpg -bo 2012.tar.bz.gpg.sig 2012.tar.bz.gpg

Imaging

The encrypted archive and the detached signature of the encrypted archive are what will be burned to the disc. (Or, in the case of a large archive, the encrypted splits of the full archive and the associated signatures will be burned to one disc per split/signature combonation.) Rather than burning them directly, an image is created first.

1
$ mkisofs -V "Photos: 2012 1/1" -r -o 2012.iso 2012.tar.bz.gpg 2012.tar.bz.gpg.sig

If the year has a split archive requiring multiple discs, I modify the sequence number in the volume label. For example, a year requiring 3 discs will have the label Photos: 2012 1/3.

Parity

When I began this project I knew that I wanted some sort of parity information for each disc so that I could potentially recover data from slightly damaged media. My initial idea was to use parchive via par2cmdline. Further research led me to dvdisaster which, despite being a GUI-only program, seemed more appropriate for this use case.

Both dvdisaster and parchive use the same Reed–Solomon error correction codes. Dvdidaster is aimed at optical media and has the ability to place the error correction data on the disc by augmenting the disc image, as well as storing the data separately. It can also scan media for errors and assist in judging when the media is in danger of becoming defective. This makes it an attractive option for long-term storage.

I use dvdisaster with the RS02 error correction method, which augments the image before burning. Depending on the size of the original image, this will result in the disc having anywhere from 20% to 200% redundancy.

Verify

After the image has been augmented, I mount it and verify the signature of the encrypted file on the disc against the local copy of the signature. I’ve never had the signatures not match, but performing this step makes me feel better.

1
2
3
$ sudo mount -o loop 2012.iso /mnt/disc
$ gpg --verify 2012.tar.bz.gpg.sig /mnt/disc/2012.tar.bz.gpg
$ sudo umount /mnt/disc

Burn

The final step is to burn the augmented image. I always burn discs at low speeds to diminish the chance of errors during the process.

1
$ cdrecord -v speed=4 dev=/dev/sr0 2012.iso

Similar to the optical backups of my password database, I burn two copies of each disc. One copy is stored off-site. This provides a reasonably level of assurance against any loss of my photos.