You are currently viewing all posts tagged with backups.

Git Annex Recovery

Occasionally I’ll come across some sort of corruption on one of my cold storage drives. This can typically repaired in-place via git-annex-repair, but I usually take it as a sign that the hard drive itself is beginning to fail. I prefer to replace the drive. At the end of the process, I want the new drive to be mounted at the same location as the old one was, and I want the repository on the new drive to have the same UUID as the old one. This way the migration is invisible to all other copies of the repository.

To do this, I first prepare the new drive using whatever sort of LUKS encryption and formatting I want, and then mount it at the same location as wherever the old drive was normally mounted to. Call this path $good. The old drive I’ll mount to some other location. Call this path $bad.

Next I create a new clone of the repository on the new drive. Most recently I did this for my video repo, which lives at ~/library/video.

$ git clone ~/library/video $good/video

The .git/config file from the old drive will have the UUID of the annex and other configuration options, as well as any knowledge about other remotes. I copy that into the new repo.

$ cp $bad/video/.git/config $good/video/.git/config

The actual file contents are stored in the .git/annex/objects/ directory. I copy those over to the new drive.

$ mkdir $good/video/.git/annex
$ rsync -avhP --no-compress --info=progress2 $bad/video/.git/annex/objects $good/video/.git/annex/

Next I initialize the new annex. It will recognize the old config and existing objects that were copied over.

$ cd $good/video
$ git annex init

At this point I could be done. But if I suspect that there was corruption in one of the files in the .git/annex/objects directory that I copied over, I will next tell the annex to run a check on all its files. I’ll usually start this with --incremental in case I want to kill it before it completes and resume it later. I’ll provide some integer to --jobs depending on how many cores I want to devote to hashing and what I think is appropriate for the disk read and transfer speeds.

$ git annex fsck --incremental --jobs=N

If any of the files did fail, I’ll make sure one of the other remotes is available and then tell the new annex to get whatever it wants.

$ git annex get --auto

Finally, I would want to get rid of any of those corrupt objects that are now just wasting space.

$ git annex unused
$ git annex dropunused all

Optimizing Local Munitions

As previously mentioned, I use myrepos to keep local copies of useful code repositories. While working with backups yesterday I noticed that this directory had gotten quite large. I realized that in the 8 years that I’ve been using this system, I’ve never once run git gc in any of the repos.

Fortunately this is the sort of thing that myrepos makes simple – even providing it as an example on its homepage. I added two new lines to the [DEFAULT] section of my ~/library/src/myrepos.conf file: one telling it that it can run 3 parallel jobs, and one teaching it how to run git gc.

[DEFAULT]
skip = [ "$1" = update ] && ! hours_since "$1" 24
jobs = 3
git_gc = git gc "$@"

That allowed me to use my existing lmr alias to clean up all the git repositories. The software knows which repositories are git, and only attempts to run the command in those.

$ lmr gc

After completing this process – which burned through a lot of CPU – my ~/library/src directory dropped from 70 GB to 15 GB.

So that helped.

Wherein the Author Learns to Compact Borg Archives

I noticed that my Borg directory on The Cloud was 239 GB. This struck me as problematic, as I could see in my local logs that Borg itself reported the deduplicated size of all archives to be 86 GB.

A web search revealed borg compact, which apparently I have been meant to run manually since 2019. Oops. After compacting, the directory dropped from 239 GB to 81 GB.

My borg wrapper script now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/sh
source ~/.keys/borg.sh
export BORG_REPO='borg-rsync:borg/nous'
export BORG_REMOTE_PATH='borg1'

# Create backups
echo "Creating backups..."
borg create --verbose --stats --compression=lz4             \
    --exclude ~/projects/foo/bar/baz                        \
    --exclude ~/projects/xyz/bigfatbinaries                 \
    ::'{hostname}-{user}-{utcnow:%Y-%m-%dT%H:%M:%S}'        \
    ~/documents                                             \
    ~/projects                                              \
    ~/mail                                                  \
    # ...etc

# Prune
echo "Pruning backups..."
borg prune --verbose --list --glob-archives '{hostname}-{user}-*'   \
    --keep-within=1d                                                \
    --keep-daily=14                                                 \
    --keep-weekly=8                                                 \
    --keep-monthly=12                                               \

# Compact
echo "Compacting repository..."
backitup                                \
    -p 604800                           \
    -l ~/.borg_compact-repo.lastrun     \
    -b "borg compact --verbose"         \

# Check
echo "Checking repository..."
backitup -a                                                         \
    -p 172800                                                       \
    -l ~/.borg_check-repo.lastrun                                   \
    -b "borg check --verbose --repository-only --max-duration=1200" \

echo "Checking archives..."
backitup -a                                             \
    -p 259200                                           \
    -l ~/.borg_check-arch.lastrun                       \
    -b "borg check --verbose --archives-only --last 18" \

Other than the addition of a weekly compact, my setup is the same as it ever was.

I published my script for creating optical backups.

Optician archives a directory, optionally encrypts it, records the integrity of all the things, and burns it to disc. I created it last year after writing about the steps I took to create optical backups of financial archives. Since then I’ve used it to create my monthly password database backups, yearly e-book library backups, and this year’s annual financial backup.

New Year, New Drive

My first solid state drive was a Samsung 850 Pro 1TB purchased in 2015. Originally I installed it in my T430s. The following year it migrated to my new X260, where it has served admirably ever since. It still seems healthy, as best as I can tell. Sometime ago I found a script for measuring the health of Samsung SSDs. It reports:

------------------------------
 SSD Status:   /dev/sda
------------------------------
 On time:      17,277 hr
------------------------------
 Data written:
           MB: 47,420,539.560
           GB: 46,309.120
           TB: 45.223
------------------------------
 Mean write rate:
        MB/hr: 2,744.720
------------------------------
 Drive health: 98 %
------------------------------

The 1 terabyte of storage has begun to feel tight over the past couple of years. I’m not sure where it all goes, but I regularly only have about 100GB free, which is not much of a buffer. I’ve had my eye on a Samsung 860 Evo 2TB as a replacement. Last November my price monitoring tool notified me of a significant price drop for this new drive, so I snatched one up. This weekend I finally got around to installing it.

The health script reports that my new drive is, in fact, both new and healthy:

------------------------------
 SSD Status:   /dev/sda
------------------------------
 On time:      17 hr
------------------------------
 Data written:
           MB: 872,835.635
           GB: 852.378
           TB: .832
------------------------------
 Mean write rate:
        MB/hr: 51,343.272
------------------------------
 Drive health: 100 %
------------------------------

When migrating to a new drive, the simple solution is to just copy the complete contents of the old drive. I usually do not take this approach. Instead I prefer to imagine that the old drive is lost, and use the migration as an exercise to ensure that my excessive backup strategies and OS provisioning system are both fully operational. Successfully rebuilding my laptop like this, with a minimum expenditure of time and effort – and no data loss – makes me feel good about my backup and recovery tooling.

Optical Backups of Financial Archives

Every year I burn an optical archive of my financial documents, similar to how (and why) I create optical backups of photos. I schedule this financial archive for the spring, after the previous year’s taxes have been submitted and accepted. Taskwarrior solves the problem of remembering to complete the archive.

$ task add project:finance due:2019-04-30 recur:yearly wait:due-4weeks "burn optical financial archive with parity"

The archive includes two git-annex repositories.

The first is my ledger repository. Ledger is the double-entry accounting system I began using in 2012 to record the movement of every penny that crosses one of my bank accounts (small cash transactions, less than about $20, are usually-but-not-always except from being recorded). In addition to the plain-text ledger files, this repository also holds PDF or JPG images of receipts.

The second repository holds my tax information. Each tax year gets a ctmg container which contains any documents used to complete my tax returns, the returns themselves, and any notifications of those returns being accepted.

The yearly optical archive that I create holds the entirety of these two repositories – not just the information from the previous year – so really each disc only needs to have a shelf life of 12 months. Keeping the older discs around just provides redundancy for prior years.

Creating the Archive

The process of creating the archive is very similar to the process I outlined six years ago for the photo archives.

The two repositories, combined, are about 2GB (most of that is the directory of receipts from the ledger repository). I burn these to a 25GB BD-R disc, so file size is not a concern. I’ll tar them, but skip any compression, which would just add extra complexity for no gain.

$ mkdir ~/tmp/archive
$ cd ~/library
$ tar cvf ~/tmp/archive/ledger.tar ledger
$ tar cvf ~/tmp/archive/tax.tar tax

The ledger archive will get signed and encrypted with my PGP key. The contents of the tax repository are already encrypted, so I’ll skip encryption and just sign the archive. I like using detached signatures for this.

$ cd ~/tmp/archive
$ gpg -e -r peter@havenaut.net -o ledger.tar.gpg ledger.tar
$ gpg -bo ledger.tar.gpg.sig ledger.tar.gpg
$ gpg -bo tax.tar.sig tax.tar
$ rm ledger.tar

Previously, when creating optical photo archives, I used DVDisaster to create the disc image with parity. DVDisaster no longer exists. The code can still be found, and the program still works, but nobody is developing it and it doesn’t even an official web presence. This makes me uncomfortable for a tool that is part of my long-term archiving plans. As a result, I’ve moved back to using Parchive for parity. Parchive also does not have much in the way of active development around it, but it is still maintained, has been around for a long period of time, is still used by a wide community, and will probably continue to exist as long as people share files on less-than-perfectly-reliable mediums.

As previously mentioned, I’m not worried about the storage space for these files, so I tell par2create to create PAR2 files with 30% redundancy. I suppose I could go even higher, but 30% seems like a good number. By default this process will be allowed to use 16MB of memory, which is cute, but RAM is cheap and I usually have enough to spare so I’ll give it permission to use up to 8GB.

$ par2create -r30 -m8000 recovery.par2 *

Next I’ll use hashdeep to generate message digests for all the files in the archive.

$ hashdeep * > hashes

At this point all the file processing is completed. I’ll put a blank disc in my burner (a Pioneer BDR-XD05B) and burn the directory using growisofs.

$ growisofs -Z /dev/sr0 -V "Finances 2019" -r *

Verification

The final step is to verify the disc. I have a few options on this front. These are the same steps I’d take years down the road if I actually needed to recover data from the archive.

I can use the previous hashes to find any files that do not match, which is a quick way to identify bit rot.

$ hashdeep -x -k hashes *.{gpg,tar,sig,par2}

I can check the integrity of the PGP signatures.

$ gpg --verify tax.tar.gpg{.sig,}
$ gpg --verify tax.tar{.sig,}

I can use the PAR2 files to verify the original data files.

$ par2 verify recovery.par2

Archiving Bookmarks

I signed-up for Pinboard in 2014. It provides everything I need from a bookmarking service, which is mostly, you know, bookmarking. I pay for the archival account, meaning that Pinboard downloads a copy of everything I bookmark and provides me with full-text search. I find this useful and well worth the $25 yearly fee, but Pinboard’s archive is only part of the solution. I also need an offline copy of my bookmarks.

Pinboard provides an API that makes it easy to acquire a list of bookmarks. I have a small shell script which pulls down a JSON-formatted list of my bookmarks and adds the file to git-annex. This is controlled via a systemd service and timer, which wraps the script in backitup to ensure daily dumps. The systemd timer itself is controlled by nmtrust, so that it only runs when I am connected to a trusted network.

This provides data portability, ensuring that I could import my tagged URLs to another bookmarking service if I ever found something better than Pinboard (unlikely, competing with Pinboard is futile). But I also want a locally archived copy of the pages themselves, which Pinboard does not offer through the API. I carry very much about being able to work offline. The usefulness of a computer is directly propertional to the amount of data that is accessible without a network connection.

To address this I use bookmark-archiver, a Python script which reads URLs from a variety of input files, including Pinboard’s JSON dumps. It archives each URL via wget, generates a screenshot and PDF via headless Chromium, and submits the URL to the Internet Archive (with WARC hopefully on the way). It will then generate an HTML index page, allowing the archives to be easily browsed. When I want to browse the archive, I simply change into the directory and use python -m http.server to serve the bookmarks at localhost:8000. Once downloaded locally, the archives are of course backed up, via the usual suspects like borg and cryptshot.

The archiver is configured via environment variables. I configure my preferences and point the program at the Pinboard JSON dump in my annex via a shell script (creatively also named bookmark-archiver). This wrapper script is called by the previous script which dumps the JSON from Pinboard.

The result of all of this is that every day I get a fresh dump of all my bookmarks, each URL is archived locally in multiple formats, and the archive enters into my normal backup queue. Link rot may defeat the Supreme Court, but between this and my automated repository tracking I have a pretty good system for backing up useful pieces of other people’s data.

On E-Books

The Kindle Paperwhite has been my primary medium for consuming books since the beginning of 2014. E Ink is a great display technology that I wish was more wide spread, but beyond the fact that the Kindle (and I assume other e-readers) makes for a pleasant reading experience, the real value in electronic books is storage.

At its peak my physical collection was somewhere north of 200 books. As I mentioned years ago I took inspiration from Gary Snyder’s character in The Dharma Bums and stored my books in milk crates, which stack like a bookcase for normal use and kept the collection pre-boxed for moving. But that many books still take up space, and are still annoying to move. And in some regards they are fragile – redundant data storage is expensive in meatspace.

My digital library currently sits at 572 books and 13 gigabytes (the size skyrocketed after I began to archive a few comics). I could not justify that many physical books in my life. I still have a collection of dead trees, but I’m down to 3 milk crates. I store my digital library in git-annex, allowing me to redundantly replicate my collection across the globe, as well as keep copies in cold storage. I also burn yearly optical backups of the library to M-DISC. The library is managed with Calibre.

When I first bought the Kindle it required internet access to associate with my Amazon account. Ever since then, it has been in airplane mode. I spun up a temporary wireless network for the setup that I then deleted after the process was complete, ensuring that even if Amazon’s airplane mode was untrustworthy, the device would not be able to phone home. The advantages of giving the Kindle internet access seem minute, and are far outweighed by the disadvantage of having to trust Amazon.

If I purchase a book from Amazon, I select the “Download & Transfer via USB” option. This results in a crippled AZW file. I am under the radical delusion that I should own what I purchase, so I import that file into Calibre using the DeDRM_tools plugin. This strips any DRM, making the book ready to be consumed and archived. Books are transferred between my computer and the Kindle via USB, which Calibre makes simple.

When I acquire books through other channels, my preferred format is always EPUB: an open format that is simply a zip archive of HTML files. Calibre’s built-in conversion tools are quite good, giving me confidence that any e-book format I import into the library will be readable at any point in the future, but my preference is to store data in formats that are open, accessible, and understandable. The closer one gets to well-formatted plain text, the closer one gets to god.

While the Kindle excels at the linear reading of novels, I’ve also come to appreciate digital copies of reference books and technical manuals. Often the first reading of these types of books involves lots of flipping back and forth, which is easier in the dead tree variant, but after that first reading the searchability of the digital copy is far more useful for reference. The physical size of these types of books also makes them even more difficult to carry and store than other books, all but guaranteeing you won’t have access to them when you need to reference them. Digital books solve that problem.

I’m confident in my ability to securely store digital data. Whenever I import a book into my library, I know that I now have permanent access to that knowledge for the rest of my life, regardless of environmental disaster, the whims of publishing houses, or the size of my living quarters.