pig-monkey.com

Archiving Bookmarks

I signed-up for Pinboard in 2014. It provides everything I need from a bookmarking service, which is mostly, you know, bookmarking. I pay for the archival account, meaning that Pinboard downloads a copy of everything I bookmark and provides me with full-text search. I find this useful and well worth the $25 yearly fee, but Pinboard’s archive is only part of the solution. I also need an offline copy of my bookmarks.

Pinboard provides an API that makes it easy to acquire a list of bookmarks. I have a small shell script which pulls down a JSON-formatted list of my bookmarks and adds the file to git-annex. This is controlled via a systemd service and timer, which wraps the script in backitup to ensure daily dumps. The systemd timer itself is controlled by nmtrust, so that it only runs when I am connected to a trusted network.

This provides data portability, ensuring that I could import my tagged URLs to another bookmarking service if I ever found something better than Pinboard (unlikely, competing with Pinboard is futile). But I also want a locally archived copy of the pages themselves, which Pinboard does not offer through the API. I carry very much about being able to work offline. The usefulness of a computer is directly propertional to the amount of data that is accessible without a network connection.

To address this I use bookmark-archiver, a Python script which reads URLs from a variety of input files, including Pinboard’s JSON dumps. It archives each URL via wget, generates a screenshot and PDF via headless Chromium, and submits the URL to the Internet Archive (with WARC hopefully on the way). It will then generate an HTML index page, allowing the archives to be easily browsed. When I want to browse the archive, I simply change into the directory and use python -m http.server to serve the bookmarks at localhost:8000. Once downloaded locally, the archives are of course backed up, via the usual suspects like borg and cryptshot.

The archiver is configured via environment variables. I configure my preferences and point the program at the Pinboard JSON dump in my annex via a shell script (creatively also named bookmark-archiver). This wrapper script is called by the previous script which dumps the JSON from Pinboard.

The result of all of this is that every day I get a fresh dump of all my bookmarks, each URL is archived locally in multiple formats, and the archive enters into my normal backup queue. Link rot may defeat the Supreme Court, but between this and my automated repository tracking I have a pretty good system for backing up useful pieces of other people’s data.