You are currently viewing all posts tagged with linux.

Cloning Backup Drives

Continuing with the theme of replacing drives, recently I decided to preemptively replace one of the external drives that I backup to via rsnapshot – or, more specifically, via cryptshot. The drive was functioning nominally, but its date of manufacture was 2014. That’s way too long to trust spinning rust.

rsnapshot implements deduplication via hard links. Were I to just rsync the contents of the old drive to the new drive without any special consideration for the links, it would dereference the links, copying them over as separate files. This would cause the size of the backups to balloon past the capacity of the drive. Rsync provides the --hard-links flag to address this, but I’ve heard some stories about this failing to act as expected when the source directory has a large number of hard links (for some unknown definition of “large”). I’ve been rsnapshotting since 2012 (after a pause sometime after 2006, apparently) and feel safe assuming that my rsnapshot repository does have a “large” number of hard links.

I also do not really care about syncing. The destination is completely empty. There’s no file comparison that needs to happen. I don’t need to the ability to pause partway through the transfer and resume later. Rsync is my default solution for pushing files around, but in this case it is not really needed. I only want to mirror the contents of the old drive onto the new drive, exactly as they exist on the old drive. So I avoided the problem all together and just copied the partition via dd.

Both drives are encrypted with LUKS, so first I decrypt them. Importantly, I do not mount either decrypted partition. I don’t want to risk any modifications being made to either while the copy is ongoing.

$ sudo cryptsetup luksOpen /dev/sda old
$ sudo cryptsetup luksOpen /dev/sdb new

Then I copy the old partition to the new one.

$ sudo dd if=/dev/mapper/old of=/dev/mapper/new bs=32M status=progress

My new drive is the same size as my old drive, so after dd finished I was done. If the sizes differed I would need to use resize2fs to resize the partition on the new drive.

If I was replacing the old drive not just because it was old and I was ageist, but because I thought it may be corrupted, I would probably do this with GNU ddrescue rather than plain old dd. (Though, realistically, if that was the case I’d probably just copy the contents of my other rsnapshot target drive to the new drive, and replace the corrupt drive with that. Multiple backup mediums make life easier.)

Git Annex Recovery

Occasionally I’ll come across some sort of corruption on one of my cold storage drives. This can typically repaired in-place via git-annex-repair, but I usually take it as a sign that the hard drive itself is beginning to fail. I prefer to replace the drive. At the end of the process, I want the new drive to be mounted at the same location as the old one was, and I want the repository on the new drive to have the same UUID as the old one. This way the migration is invisible to all other copies of the repository.

To do this, I first prepare the new drive using whatever sort of LUKS encryption and formatting I want, and then mount it at the same location as wherever the old drive was normally mounted to. Call this path $good. The old drive I’ll mount to some other location. Call this path $bad.

Next I create a new clone of the repository on the new drive. Most recently I did this for my video repo, which lives at ~/library/video.

$ git clone ~/library/video $good/video

The .git/config file from the old drive will have the UUID of the annex and other configuration options, as well as any knowledge about other remotes. I copy that into the new repo.

$ cp $bad/video/.git/config $good/video/.git/config

The actual file contents are stored in the .git/annex/objects/ directory. I copy those over to the new drive.

$ mkdir $good/video/.git/annex
$ rsync -avhP --no-compress --info=progress2 $bad/video/.git/annex/objects $good/video/.git/annex/

Next I initialize the new annex. It will recognize the old config and existing objects that were copied over.

$ cd $good/video
$ git annex init

At this point I could be done. But if I suspect that there was corruption in one of the files in the .git/annex/objects directory that I copied over, I will next tell the annex to run a check on all its files. I’ll usually start this with --incremental in case I want to kill it before it completes and resume it later. I’ll provide some integer to --jobs depending on how many cores I want to devote to hashing and what I think is appropriate for the disk read and transfer speeds.

$ git annex fsck --incremental --jobs=N

If any of the files did fail, I’ll make sure one of the other remotes is available and then tell the new annex to get whatever it wants.

$ git annex get --auto

Finally, I would want to get rid of any of those corrupt objects that are now just wasting space.

$ git annex unused
$ git annex dropunused all

Optimizing Local Munitions

As previously mentioned, I use myrepos to keep local copies of useful code repositories. While working with backups yesterday I noticed that this directory had gotten quite large. I realized that in the 8 years that I’ve been using this system, I’ve never once run git gc in any of the repos.

Fortunately this is the sort of thing that myrepos makes simple – even providing it as an example on its homepage. I added two new lines to the [DEFAULT] section of my ~/library/src/myrepos.conf file: one telling it that it can run 3 parallel jobs, and one teaching it how to run git gc.

[DEFAULT]
skip = [ "$1" = update ] && ! hours_since "$1" 24
jobs = 3
git_gc = git gc "$@"

That allowed me to use my existing lmr alias to clean up all the git repositories. The software knows which repositories are git, and only attempts to run the command in those.

$ lmr gc

After completing this process – which burned through a lot of CPU – my ~/library/src directory dropped from 70 GB to 15 GB.

So that helped.

Wherein the Author Learns to Compact Borg Archives

I noticed that my Borg directory on The Cloud was 239 GB. This struck me as problematic, as I could see in my local logs that Borg itself reported the deduplicated size of all archives to be 86 GB.

A web search revealed borg compact, which apparently I have been meant to run manually since 2019. Oops. After compacting, the directory dropped from 239 GB to 81 GB.

My borg wrapper script now looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/sh
source ~/.keys/borg.sh
export BORG_REPO='borg-rsync:borg/nous'
export BORG_REMOTE_PATH='borg1'

# Create backups
echo "Creating backups..."
borg create --verbose --stats --compression=lz4             \
    --exclude ~/projects/foo/bar/baz                        \
    --exclude ~/projects/xyz/bigfatbinaries                 \
    ::'{hostname}-{user}-{utcnow:%Y-%m-%dT%H:%M:%S}'        \
    ~/documents                                             \
    ~/projects                                              \
    ~/mail                                                  \
    # ...etc

# Prune
echo "Pruning backups..."
borg prune --verbose --list --glob-archives '{hostname}-{user}-*'   \
    --keep-within=1d                                                \
    --keep-daily=14                                                 \
    --keep-weekly=8                                                 \
    --keep-monthly=12                                               \

# Compact
echo "Compacting repository..."
backitup                                \
    -p 604800                           \
    -l ~/.borg_compact-repo.lastrun     \
    -b "borg compact --verbose"         \

# Check
echo "Checking repository..."
backitup -a                                                         \
    -p 172800                                                       \
    -l ~/.borg_check-repo.lastrun                                   \
    -b "borg check --verbose --repository-only --max-duration=1200" \

echo "Checking archives..."
backitup -a                                             \
    -p 259200                                           \
    -l ~/.borg_check-arch.lastrun                       \
    -b "borg check --verbose --archives-only --last 18" \

Other than the addition of a weekly compact, my setup is the same as it ever was.

Working with ACSM Files on Linux

I acquire books from various OverDrive instances. OverDrive provides an ACSM file, which is not a book, but instead an XML ticket meant to be exchanged for the actual book file – similar to requesting a book in meatspace by turning in a catalog card to a librarian. Adobe Digital Editions is used to perform this exchange. As one would expect from Adobe, this software does not support Linux.

Back in 2013 I setup a Windows 7 virtual machine with Adobe Digital Editions v2.0.1.78765, which I used exclusively for turning ACSM files into EPUB files. A few months ago I was finally able to retire that VM thanks to the discovery of libgourou, which is both a library and a suite of utilities that can be used to work with ACSM files.

To use, I first register an anonymous account with Adobe.

$ adept_activate -a

Next I export the private key that the files will be encrypted to.

$ acsmdownloader --export-private-key

This key can then be imported into the DeDRM_tools plugin of Calibre.

Whenever I receive an ACSM file, I can just pass it to the acsmdownloader utility from libgourou.

$ acsmdownloader -f foobar.acsm

This spits out the EPUB, which may be imported into my standard Calibre library.

The Things I Do for Time

I am a believer in the sacred word as defined in ISO 8601, and the later revelations such as RFC 3339. Numerical dates should be formatted as YYYY-MM-DD. Hours should be written in 24-hour time. I will die on this hill.

Since time immemorial, this has been accomplished on Linux systems by setting LC_TIME to the en_DK locale. More specifically, the git history for glibc shows that en_DK was added (with ISO 8601 date formatting) by Ulrich Drepper on 1997-03-05.

A few years ago, this stopped working in Firefox. Instead Firefox started to think that numerical dates were supposed to be formatted as DD/MM/YYYY, which is at least as asinine as the typical American MM-DD-YYYY format. I finally got fed up with this and decided to investigate.

The best discussion of the issue is in Thunderbird bug 1426907. Here I learned that the problem is caused by Thunderbird (and by extension Firefox) no longer respecting glibc locales. Mozilla software simply takes the name of the system locale, ignores its definition, and looks up formatting in the Unicode CLDR. The CLDR has redefined en_DK to use DD/MM/YYYY1.

The hack to address the problem was also documented in the Thunderbird bug report. The CLDR includes a definition for en_SE which uses YYYY-MM-DD2 and 24-hour time. (It also separates the time from the date with a comma, which is weird, but Sweden is weird, so I’ll allow it.) There is no en_SE locale in glibc. But it can be created by linking to the en_DK locale. This new locale can then be used for LC_TIME.

$ sudo ln -s /usr/share/i18n/locales/en_DK /usr/share/i18n/locales/en_SE
$ echo 'en_SE.UTF-8 UTF-8' | sudo tee -a /etc/locale.gen
$ sudo locale-gen
$ sed -i 's/^LC_TIME=.*/LC_TIME=en_SE.UTF-8/' /etc/locale.conf

Now anything that respects glibc locales will effectively use en_DK, albeit under a different name. Anything that uses CLDR will just see that it is supposed to use a locale named en_SE, which still results in sane formatting. Thus one can use HTML date input fields without going crazy.

Notes

  1. The Unicode specification defines this pattern as "dd/MM/y", which is rather unintuitive, but worth including here for search engines.
  2. The Unicode specification defines this pattern as "y-MM-DD".

Redswitch

Redshift is a program that adjusts the color temperature of the screen based on time and location. It can automatically fetch one’s location via GeoClue. I’ve used it for years. It works most of the time. But, more often than I’d like, it fails to fetch my location from GeoClue. When this happens, I find GeoClue impossible to debug. Redshift does not cache location information, so when it fails to fetch my location the result is an eye-meltingly bright screen at night. To address this, I wrote a small shell script to avoid GeoClue entirely.

Redswitch fetches the current location via the Mozilla Location Service (using GeoClue’s API key, which may go away). The result is stored and compared against the previous location to determine if the device has moved. If a change in location is detected, Redshift is killed and relaunched with the new location (this will result in a noticeable flash, but there seems to be no alternative since Redshift cannot reload its settings while running). If Redshift is not running, it is launched. If no change in location is detected and Redshift is already running, nothing happens. Because the location information is stored, this can safely be used to launch Redshift when the machine is offline (or when the Mozilla Location Service API is down or rate-limited).

My laptop does not experience frequent, drastic changes in location. I find that having the script automatically execute once upon login is adequate for my needs. If you’re jetting around the world, you could periodically execute the script via cron or a systemd timer.

This solves all my problems with Redshift. I can go back to forgetting about its existence, which is my goal for software of this sort.

Searching Books

ripgrep-all is a small wrapper around ripgrep that adds support for additional file formats.

I discovered it while looking for a program that would allow me to search my e-book library without needing to open individual books and search their contents via Calibre. ripgrep-all accomplishes this by using Pandoc to convert files to plain text and then running ripgrep on the output. One of the numerous formats supported by Pandoc is EPUB, which is the format I use to store books.

Running Pandoc on every book in my library to extract its text can take some time, but ripgrep-all caches the extracted text so that subsequent runs are similar in speed to simply searching plain text – which is blazing fast thanks to ripgrep’s speed. It takes around two seconds to search 1,706 books.

$ time(rga -li 'pandemic' ~/library/books/ | wc -l)
33

real    0m1.225s
user    0m2.458s
sys     0m1.759s