Monitoring Legible News
I was sent a link to Legible News last November by someone who had read my post on the now-defunct Breaking News. Legible News is a website that simply scrapes headlines from Wikipedia’s Current Events once per day and presents them in a legible format. This seems like a simple thing, but is far beyond the capabilities of most news organizations today.
Legible News provides no update notification mechanism. I addressed this by plugging it into my urlwatch system. Initially this presented two problems: the email notification included the HTML markup, which I didn’t care about, and it included both the old and new content of every changed line – effectively sending me the news from today and yesterday.
The first problem was easily solved by using the
html2text filter provided by urlwatch. This strips out all markup, which is what I thought I wanted. I ran this for a bit before deciding that I did want the output to contain links. What I really wanted was some sort of
I also realized I did not just want to be sent new lines, but every line anytime there was a change. If the news yesterday included a section titled “Armed conflicts and attacks”, and the news today included a section with the same title, I wanted that in my output despite it not having changed.
I solved both of these problems using the
diff_tool argument of urlwatch. This allows the user to pass in a special tool to replace the default use of
diff to generate the notification output. The tool will be called with two arguments: the filename of the previously downloaded version of the URL and the filename of the current version. I wrote a simple script called
html2markdown.sh which ignores the first argument and simply passes the second argument to Pandoc for formatting.
1 2 3 4 5 6 7
This script is used as the
diff_tool in the urlwatch job definition.
1 2 3 4
The result is the latest version of Legible News, nicely converted to Markdown, delivered to my inbox every day. The output would be even better if Legible News used semantic markup – specifically heading elements – but it is perfectly serviceable as is.
After I built this I discovered that somebody had created an RSS feed for Legible News using a service called Feed43.