manningtree

Fixing Crossoak

Category: TheShed
#blog #SysAdmin

C r o s s o a k is a photo blog that goes back to Lost Something in Cromer in May 2005. It's really a photo journal. Or a log of things illustrated by photos that's available on the web, a web log. It's been through a couple of iterations since starting out on Blogger with snaps from a Sony DSC-V1 processed in Picasa.

Sony DSC-V1: my first digital camera Sony DSC-V1: my first digital camera

For the longest time the core workflow was:

  1. Take photo
  2. Import to Adobe Lightroom
  3. Tweak photo
  4. Upload to Flickr
  5. Draft new post in Wordpress
  6. Publish

That had a couple of downsides. First, it's quite manual. Second, it's hard to do when travelling light. This meant that posts for Crossoak tended to batch up waiting for some time for me to publish.

The best camera is the one you have with you, in this case from a iPhone. The best camera is the one you have with you, in this case from a iPhone

There's an adage that the Best camera is the one you have with you. Around 2010 (for many reasons, not all of them photography related), the camera I had with me was often the one glued into the back of a mobile phone. That was okay for uploading pictures, there was an embarrassment of riches for sync'ing photos from phones, but publishing and sharing in something like a blog post was still challenging. In the real, non-geek, world that's why something like Instagram happens. Someone, somewhere, figures out how to solve a pain point that it turns out lots of other people also have. Turns out that included me too. So I had another workflow that went:

  1. Take photo
  2. Share on Instagram

But now I had posts on Crossoak and Instagram (/sadface) and I didn't really want to republish that were already on instagram to Crossoak manually because that makes even more work.

Enter IfTTT. IFTTT is a webservice that lets you create recipes that combine actions from other webservices. With IFTTT the Instagram workflow becomes

  1. Take photo
  2. Share on Instagram
    1. Automatically!!!
    2. Check if Instagram post has the #blog tag, if it does then...
    3. Publish the instagram post to Crossoak too

This worked really well, so well that the majority of the Crossoak posts over the last 12 months have been via instagram.

That was until stuff started to break.

The problem was that posts published by IFTTT used Instagram links that changed, resulting in large parts of Crossoak experiencing broken image syndrome. Not a good look when you're a photo blog. Especially not when any text you include is frequently so cryptic as to cause confusion a) even when the image isn't broken and b) even to folks featured in the photo (who, presumably, have a clue what's going on).

Fortunately, there was a straight-forward fix. When creating the IFTTT recipe to post from Instagram, I also created one to upload the same image to Flickr. This meant I had copies of the broken images (or all except one) on Flickr. Fixing was possible, but that was a lot of links. I was looking at all the time saved over the years in my clever hack to the publication workflow being eaten up by the cost of fixing. Douglas Coupland smiles.

Irony: actually fixing bit rot while supported by Coupland's book. Irony: actually fixing bit rot while supported by Coupland's book #blog

Fixing Bit Rot

Programmatically, an automated fix was relatively trivial. Iterate through the posts on Crossoak; identify posts published from Instagram; search Flickr for the corresponding photo; update the Crossoak post, replacing Instagram with the corresponding link to Flickr. Simples.

First, iterate through posts using the python-wordpress-xmlrpc library:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import GetPosts
endpoint = blog_url + '/xmlrpc.php'
wp = Client(enpoint, auth_user, auth_password)

offset = 0
increment = 20
while True:
    posts = wp.call(GetPosts({'number': increment, 'offset': offset}))
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        update_if_instagram(post)
    offset = offset + increment

To identify Instagram posts I considered looking for the Instagram tag (which the IFTTT recipe created) but instead I opted for searching the <img> tag src attribute for the magic text with Beautiful Soup:

magic_text = 'instagram'
content = post.content
soup = BeautifulSoup(content, 'lxml')
for img in soup.findAll('img'):
    img_src = img['src']
    if img_src.find(magic_text) > -1:

The tricky bit was finding the corresponding Flickr photos. Flickr has a lovely API (here's the API explorer for search) which the python-flickr-api library nicely wraps, so I can search with something like:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title)

There were two snags however. First, the text attribute is a fuzzy search, and my Instagram-generated post titles are far from unique. This was mitigated by scoping the search to +/- a day of the Wordpress post:

dmin = post.date - timedelta(days=1)
dmax = post.date + timedelta(days=1)

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())

But a second problem was that Flickr wasn't returning everything I thought it should. In many cases I could manually browse to the right image, but the API wasn't returning it based on the text search. So I flipped the search logic and used the Flickr API to return all photos in the right time range and then let Python's string search find the match:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())
for candidate_photo in flickr_photos:
    if candidate_photo.title.find(post.title) == 0: 
        doSomething()

All that left was to call the Flickr GetSizes API for the photo URL and update the Wordpress post with the corrected attribute:

photo_sizes = posphoto.getSizes()
img['src'] = photo_sizes['Original']['source']
newcontent = str(soup)
# Update post
post.content = newcontent
wp.call(EditPost(post.id, post))

Crossoak fixed.

Update: 2019-06-22

Alright, it wasn't completely fixed. Earlier above someone said:

When creating the IFTTT recipe to post from Instagram, I also created one to upload the same image to Flickr. This meant I had copies of the broken images (or all except one) on Flickr.

Me.

That wasn't 100% accurate. There was also a set of Instagram photos that had been published to c r o s s o a k that didn't have corresponding images on flickr. These were from a period between setting up posting to c r o s s o a k and adding a recipe to also archive to flickr. Fortunately there's a solution.

Instagram has a feature to export your data. This includes copies of all the pictures you've posted on the service. For me, the data export was 190MB of json data and JPEG photo files. Here's an example from media.json of an Instagram post I'd shared:

{
    "caption": "Settling down to watch Mama Mia 2, filmed on Vis, in 
    the open air cinema in Komiža, Vis.\n\nFireworks scene in the 
    film made all the more impressive by the shooting star overhead.",
    "taken_at": "2018-07-28T15:49:56",
    "path": "photos/201807/3b586a89bed412e9b2fc19383d130bb8.jpg"
},

where path is the relative location in the data export to a copy of the photo posted. In this case, the image is also on Flickr here:

Settling down to watch Mama Mia 2, filmed on Vis, in the open air cinema in Komiža, Vis. Fireworks scene in the film made all the more impressive by the shooting star overhead. Settling down to watch Mama Mia 2, filmed on Vis, in the open air cinema in Komiža, Vis. Fireworks scene in the film made all the more impressive by the shooting star overhead.

That's enough info to modify my fix_crossoak.py script to try and find the local copy in the Instagram export for photo's that couldn't be matched to an existing post on Flickr.

Here I load the media.json file, and look for a match with the `caption\ field and the title of the post with the missing photo. If there's a match, the script uploads the image to Flickr.

with open(instagram_data_path,  encoding='utf-8' ) as json_file:  
    data = json.load(json_file)
    # loop through all the posts in the Instagram export...
    for photo in data['photos']:
        if photo['caption'].find(post_title_to_match) == 0:
            image_path = instagram_photos_root + photo['path']

            response = flickr_api.upload(photo_file = image_path, title = post_title_to_match)

With a copy of the image on Flickr, the script can then update the c r o s s o a k post to fix the broken image link.

C r o s s o a k fixed. Again.

Afterword

This post includes an image of Douglas Coupland's Bit Rot serving as a leg for a standing desk-like contraption. I thought it ironic to be writing a post describing how to fix the effects of bit rot supported in part by a book with that title. When updating the post I noticed that the image was missing. I wondered why. On investigating, I saw something like the following Markdown used to insert that picture: ![Bit Rot](https://scontent.cdninstagram.com/vp/0d0f0175d05d369267152565126823936_n.jpg)

Yes. That's a reference to an image hosted by Instagram and yes, like the whole point of this post, that was broken. So I fixed that too. Meta-irony.

Prev: Image viewing on a Raspberry Pi

Next: Wasdale