You are currently viewing all posts tagged with annex.

2016-08-19

Redundant File Storage

As I’ve mentioned previously, I store just about everything that matters in git-annex (the only exception is code, which is stored directly in regular git). One of git-annex’s many killer features is special remotes. They make tenable this whole “cloud storage” thing that we do now.

A special remote allows me to store my files with a large number of service providers. It makes this easy to do by abstracting away the particulars of the provider, allowing me to interact with all of them in the same way. It makes this safe to do by providing encryption. These factors encourage redundancy, reducing my reliance on any one provider.

Recently I began playing with rclone. Rclone is a program that supports file syncing for a handful of cloud storage providers. That’s semi-interesting by itself but, more significantly, there is a git-annex special remote wrapper. That means any of the providers supported by rclone can be used as a special remote. I looked through all of rclone’s supported providers and decided there were a few that I had no reason not to use.

Hubic

Hubic is a storage provider from OVH with a data center in France. Their pricing is attractive. I’d happily pay €50 per year for 10TB of storage. Unfortunately they limit connections to 10 Mbit/s. In my experience they ended up being even slower than this. Slow enough that I don’t want to give them money, but there’s still no reason not to take advantage of their free 25 GB plan.

After signing up, I setup a new remote in rclone.

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> hubic-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 7
Hubic Client Id - leave blank normally.
client_id> 
Hubic Client Secret - leave blank normally.
client_secret> 
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id = 
client_secret = 
token = {"access_token":"XXXXXX"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

With that setup, I went into my ~/documents annex and added the remote.

$ git annex initremote hubic type=external externaltype=rclone target=hubic-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512

I want git-annex to automatically send everything to Hubic, so I took advantage of standard groups and put the repository in the backup group.

$ git annex wanted hubic standard
$ git annex group hubic backup

Given Hubic’s slow speed, I don’t really want to download files from it unless I need to. This can be configured in git-annex by setting the cost of the remote. Local repositories default to 100 and remote repositories default to 200. I gave the Hubic remote a high cost so that it will only be used if no other remotes are available.

$ git config remote.hubic.annex-cost 500

If you would like to try Hubic, I have a referral code which gives us both an extra 5GB for free.

Backblaze B2

B2 is the cloud storage offering from backup company Backblaze. I don’t know anything about them, but at $0.005 per GB I like their pricing. A quick search of reviews shows that the main complaint about the service is that they offer no geographic redundancy, which is entirely irrelevant to me since I build my own redundancy with my half-dozen or so remotes per repository.

Signing up with Backblaze took a bit longer. They wanted a phone number for 2-factor authentication, I wanted to give them a credit card so that I could use more than the 10GB they offer for free, and I had to generate an application key to use with rclone. After that, the rclone setup was simple.

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> b2-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 3
Account ID
account> 123456789abc
Application Key
key> 0123456789abcdef0123456789abcdef0123456789
Endpoint for the service - leave blank normally.
endpoint> 
Remote config
--------------------
[remote]
account = 123456789abc
key = 0123456789abcdef0123456789abcdef0123456789
endpoint = 
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

With that, it was back to ~/documents to initialize the remote and send it all the things

$ git annex initremote b2 type=external externaltype=rclone target=b2-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512
$ git annex wanted b2 standard
$ git annex group b2 backup

While I did not measure the speed with B2, it feels as fast as my S3 or rsync.net remotes, so I didn’t bother setting the cost.

Google Drive

While I do not regularly use Google services for personal things, I do have a Google account for Android stuff. Google Drive offers 15 GB of storage for free and rclone supports it, so why not take advantage?

$ rclone config
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> gdrive-annex
Type of storage to configure.
Choose a number from below, or type in your own value
 1 / Amazon Drive
   \ "amazon cloud drive"
 2 / Amazon S3 (also Dreamhost, Ceph)
   \ "s3"
 3 / Backblaze B2
   \ "b2"
 4 / Dropbox
   \ "dropbox"
 5 / Google Cloud Storage (this is not Google Drive)
   \ "google cloud storage"
 6 / Google Drive
   \ "drive"
 7 / Hubic
   \ "hubic"
 8 / Local Disk
   \ "local"
 9 / Microsoft OneDrive
   \ "onedrive"
10 / Openstack Swift (Rackspace Cloud Files, Memset Memstore, OVH)
   \ "swift"
11 / Yandex Disk
   \ "yandex"
Storage> 6
Google Application Client Id - leave blank normally.
client_id> 
Google Application Client Secret - leave blank normally.
client_secret> 
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine or Y didn't work
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id = 
client_secret = 
token = {"AccessToken":"xxxx.x.xxxxx_xxxxxxxxxxx_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","RefreshToken":"1/xxxxxxxxxxxxxxxx_xxxxxxxxxxxxxxxxxxxxxxxxxx","Expiry":"2014-03-16T13:57:58.955387075Z","Extra":null}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

And again, to ~/documents.

$ git annex initremote gdrive type=external externaltype=rclone target=gdrive-annex prefix=annex-documents chunk=50MiB encryption=shared rclone_layout=lower mac=HMACSHA512
$ git annex wanted gdrive standard
$ git annex group gdrive backup

Rinse and repeat the process for other annexes. Revel in having simple, secure, and redundant storage.

2016-03-31

micro, linux, backups, crypto, annex

I celebrated World Backup Day by increasing the resiliency of data in my life.

Four encrypted 2TB hard drives, stored in a Pelican 1200, with Abloy Protec2 PL 321 padlocks as tamper-evident seals. Having everything that matters stored in git-annex makes projects like this simple: just clone the repositories, define the preferred content expressions, and watch the magic happen.

This article was modified 2016-08-19.

2015-08-04

bicycle, physical training, run, telephone, annex

Antisocial Activity Tracking

A GPS track provides useful a useful log of physical activities. Beyond simply recording a route, the series of coordinate and time mappings allow statistics like distance, speed, elevation, and time to be calculated. I recently decided that I wanted to start recording this information, but I was not interested in any of the plethora of social, cloud-based services that are hip these days. A simple GPX track gives me all the information I care about, and I don’t have a strong desire to share them with a third party provider or a social network.

Recording Tracks

The discovery of GPSLogger is what made me excited to start this project. A simple but powerful Android application, GPSLogger will log to a number of different formats and, when a track is complete, automatically distribute it. This can be done by uploading the file to a storage provider, emailing it, or posting it to a custom URL. It always logs in metric units but optionally displays in Imperial.

What makes GPSLogger really stand out are its performance features. It allows very fine-grained control over GPS use, which allows tracks to be recorded for extended periods of times (such as days) with a negligible impact on battery usage.

For activities like running, shorter hikes and bicycle rides I tend to err on the side of accuracy. I set GPSLogger to log a coordinate every 10 seconds, with a minimum distance of 5 meters between points and a minimum accuracy of 10 meters. It will try to get a fix for 120 seconds before timing out, and attempt to meet the accuracy requirement for 60 seconds before giving up.

For a longer day-hike, the time between points could be increased to something in the neighborhood of 60 seconds. For a multi-day backpacking trip, a setting of 10 minutes or more would still provide great enough accuracy to make for a useful record of the route. I’ve found that being able to control these settings really opens up a lot of tracking possibilities that I would otherwise not consider for fear of battery drain.

Storing Tracks

After a track has been recorded, I transfer it to my computer and store it with git-annex.

Everything in my home directory that is not a temporary file is stored either in git or git-annex. By keeping my tracks in an annex rather than directly in git, I can take advantage of git-annex’s powerful metadata support. GPSLogger automatically names tracks with a time stamp, but the annex for my tracks is also configured to automatically set the year and month when adding files.

$ cd ~/tracks
$ git config annex.genmetadata true

After moving a track into the annex, I’ll tag it with a custom activity field, with values like run, hike, or bike.

$ git annex metadata --set activity=bike 20150725110839.gpx

I also find it useful to tag tracks with a gross location value so that I can get an idea of where they were recorded without loading them on a map. Counties tend to work well for this.

$ git annex metadata --set county=sanfrancisco 20150725110839.gpx

Of course, a track may span multiple counties. This is easily handled by git-annex.

$ git annex metadata --set county+=marin 20150725110839.gpx

One could also use fields to store location values such as National Park, National Forest or Wilderness Area.

Metadata Views

The reason for storing metadata is the ability to use metadata driven views. This allows me to alter the directory structure of the annex based on the metadata. For instance, I can tell git-annex to show me all tracks grouped by year followed by activity.

$ git annex view "year=*" "activity=*"
$ tree -d
.
└── 2015
    ├── bike
    ├── hike
    └── run

Or, I could ask to see all the runs I went on this July.

$ git annex view year=2015 month=07 activity=run

I’ve found this to be a super powerful tool. It gives me the simplicity and flexibility of storing the tracks as plain-text on the filesystem, with some of the querying possibilities of a database. Its usefulness is only limited by the metadata stored.

Viewing Tracks

For simple statistics, I’ll use the gpxinfo command provided by gpxpy. This gives me the basics of time, distance and speed, which is generally all I care about for something like a weekly run.

$ gpxinfo 20150725110839.gpx
File: 20150725110839.gpx
    Length 2D: 6.081km
    Length 3D: 6.123km
    Moving time: 00:35:05
    Stopped time: n/a
    Max speed: 3.54m/s = 12.74km/h
    Total uphill: 96.50m
    Total downhill: 130.50m
    Started: 2015-07-25 18:08:45
    Ended: 2015-07-25 18:43:50
    Points: 188
    Avg distance between points: 32.35m

    Track #0, Segment #0
        Length 2D: 6.081km
        Length 3D: 6.123km
        Moving time: 00:35:05
        Stopped time: n/a
        Max speed: 3.54m/s = 12.74km/h
        Total uphill: 96.50m
        Total downhill: 130.50m
        Started: 2015-07-25 18:08:45
        Ended: 2015-07-25 18:43:50
        Points: 188
        Avg distance between points: 32.35m

For a more detailed inspection of the tracks, I opt for Viking. This allows me to load the tracks and view the route on a OpenStreetMap map (or any number of other map layers, such as USGS quads or Bing aerial photography). It includes all the detailed statistics you could care about extracting from a GPX track, including pretty charts of elevation, distance, time and speed.

If I want to view the track on my phone before I’ve transferred it to my computer, I’ll load it in either BackCountry Navigator or OsmAnd, depending on what kind of map layers I am interested in seeing. For simply viewing the statistics of a track on the phone, I go with GPS Visualizer (by the same author as GPSLogger).

This article was modified 2019-11-04.

2013-05-29

backups, crypto, linux, annex

Optical Backups of Photo Archives

I store my photos in git-annex. A full copy of the annex exists on my laptop and on an external drive. Encrypted copies of all of my photos are stored on Amazon S3 (which I pay for) and box.com (which provides 50GB for free) via git-annex special remotes. The photos are backed-up to an external drive daily with the rest of my laptop hard drive via backitup.sh and cryptshot. My entire laptop hard drive is also mirrored monthly to an external drive stored off-site.

(The majority of my photos are also on Flickr, but I don’t consider that a backup or even reliable storage.)

All of this is what I consider to be the bare minimum for any redundant data storage. Photos have special value, above the value that I assign to most other data. This value only increases with age. As such they require an additional backup method, but due to the size of my collection I want to avoid backup methods that involve paying for more online storage, such as Tarsnap.

I choose optical discs as the medium for my photo backups. This has the advantage of being read-only, which makes it more difficult for accidental deletions or corruption to propagate through the backup system. DVD-Rs have a capacity of 4.7 GBs and a cost of around $0.25 per disc. Their life expectancy varies, but 10-years seem to be a reasonable low estimate.

Preparation

I keep all of my photos in year-based directories. At the beginning of every year, the previous year’s directory is burned to a DVD.

Certain years contain few enough photos that the entire year can fit on a single DVD. More recent years have enough photos of a high enough resolution that they require multiple DVDs.

Encrypt

Leaving unencrypted data around is bad form. The archive (or each of the files resulting from splitting the large archive) is next encrypted and signed with GnuPG.

$ gpg -eo 2012.tar.bz.gpg 2012.tar.bz
$ gpg -bo 2012.tar.bz.gpg.sig 2012.tar.bz.gpg

Imaging

The encrypted archive and the detached signature of the encrypted archive are what will be burned to the disc. (Or, in the case of a large archive, the encrypted splits of the full archive and the associated signatures will be burned to one disc per split/signature combonation.) Rather than burning them directly, an image is created first.

$ mkisofs -V "Photos: 2012 1/1" -r -o 2012.iso 2012.tar.bz.gpg 2012.tar.bz.gpg.sig

If the year has a split archive requiring multiple discs, I modify the sequence number in the volume label. For example, a year requiring 3 discs will have the label Photos: 2012 1/3.

Parity

When I began this project I knew that I wanted some sort of parity information for each disc so that I could potentially recover data from slightly damaged media. My initial idea was to use parchive via par2cmdline. Further research led me to dvdisaster which, despite being a GUI-only program, seemed more appropriate for this use case.

Both dvdisaster and parchive use the same Reed–Solomon error correction codes. Dvdidaster is aimed at optical media and has the ability to place the error correction data on the disc by augmenting the disc image, as well as storing the data separately. It can also scan media for errors and assist in judging when the media is in danger of becoming defective. This makes it an attractive option for long-term storage.

I use dvdisaster with the RS02 error correction method, which augments the image before burning. Depending on the size of the original image, this will result in the disc having anywhere from 20% to 200% redundancy.

Verify

After the image has been augmented, I mount it and verify the signature of the encrypted file on the disc against the local copy of the signature. I’ve never had the signatures not match, but performing this step makes me feel better.

$ sudo mount -o loop 2012.iso /mnt/disc
$ gpg --verify 2012.tar.bz.gpg.sig /mnt/disc/2012.tar.bz.gpg
$ sudo umount /mnt/disc

Burn

The final step is to burn the augmented image. I always burn discs at low speeds to diminish the chance of errors during the process.

$ cdrecord -v speed=4 dev=/dev/sr0 2012.iso

Similar to the optical backups of my password database, I burn two copies of each disc. One copy is stored off-site. This provides a reasonably level of assurance against any loss of my photos.

2012-11-12

micro, censorship, annex

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

Terence Eden points out that censorship becomes more difficult as flash memory devices become smaller and gain greater capacity. Case in point: Director Jafar Panahi smuggled This Is Not a Film out of Iran on a flash-drive hidden in a cake. For me, the practicality of the sneakernet became revitalized after I began using git-annex earlier this year.

This article was modified 2012-11-18.