Analysing Crossoak

Sun 04 November 2018 ⊕ Category: misc

A friend had a theory that photography, and therefore by extension c r o s s o a k is a window into my mental well being. So I did some digging which I'm capturing here (and which is in all probability a much bigger insight into my head...). If you have a similar theory, you might think the conclusion tells you quite a lot. If you don't it won't.

Getting the data

First off, let's grab some data from Wordpress, the blog platform that currently powers c r o s s o a k.

from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods.posts import GetPosts
from bs4 import BeautifulSoup
import json

wp = Client(BLOG_URL + '/xmlrpc.php')

This gets a reference to wp which we can use to query wordpress, for example this code:

post_data = {}
# get pages in batches of 20
offset = 0
increment = 20
while True:
    posts = wp.call(
        GetPosts({'number': increment, 'offset': offset})
    )
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        post_data[post.id] = {"title": post.title, 
           "date": post.date.strftime("%Y%m%d"), 
           "img_count": count_images(post)}
    offset = offset + increment

will get some basic data on all the posts in the blog. In the above count_images uses BeautifulSoup to parse post content to count img tags like this:

def count_images(post):
    soup = BeautifulSoup(post.content, 'lxml')
    imgs = soup.findAll('img')
    return len(imgs)

So you don't have to keep querying the site, you can save to a local file. For example, to export as JSON with:

with open('wp_data.json', 'w') as outfile:
    json.dump(post_data, outfile)

which looks something like this:

{
    "4444": {
        "title": "Wandering Around Wasdale",
        "date": "20181027",
        "img_count": 5
    }
}

where 4444 is the post id which you can use to reference the post on the blog with https://www.aiddy.com/crossoak?p=4444.

Exploring the data

To analyse the content we'll use Pandas and Seaborn.

import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# load JSON file with our Wordpress data
with open('wp_data.json') as json_data:
    d = json.load(json_data)

# convert to a DataFrame
df = pd.DataFrame(d)
df = df.transpose()
df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)

Which we can then use to visualize the distribution of the number of images per post.

sns.distplot(df.img_count, kde_kws={"bw":.5}, bins=15)
sns.plt.xlim(0, None)
sns.plt.xlabel('Number of images per post')

To delve a bit data, we'll augment the data, breaking out the date with year, month, day and weekday and counting words and characters in the titles...

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['title_words'] = df['title'].map(lambda x: len(str(x).split()))
df['title_chars'] = df['title'].map(lambda x: len(x))
df['title_wordlength'] = df['title_chars'] - df['title_words']

...and calculate summaries using groupby for counts of posts and the total number of images:

# get summarized views
df_imgs = df.groupby([df.year, df.month]).img_count.sum()
df_posts = df.groupby([df.year, df.month]).id.count()
df_imgs = pd.DataFrame(df_imgs)
df_imgs.reset_index(inplace=True)
df_posts = pd.DataFrame(df_posts)
df_posts.reset_index(inplace=True)

df_summary = pd.merge(df_imgs, df_posts, on=['month', 'year'])

Comparing the DataFrames for the original data imported from JSON in df:

	id	date	img_count	title	year	month	day	weekday	title_chars	title_words	title_wordlength
0	100	2009-10-28	1	Last game	2009	10	28	2	9	2	7
1	1000	2005-10-23	1	Alice\'s Jugs	2005	10	23	6	12	2	10
2	1001	2005-10-23	1	Padstow Sunrise	2005	10	23	6	15	2	13
3	1002	2005-10-23	1	Rock at Night	2005	10	23	6	13	3	10
4	1003	2005-10-23	1	Padstow Sunset	2005	10	23	6	14	2	12

and the view in df_summary:

	year	month	img_count	id
0	2005	5	50	50
1	2005	6	17	17
2	2005	7	8	8
3	2005	8	21	21
4	2005	9	7	7

Now we can dig into: posts & images per year...

sns.barplot(df_summary['year'],df_summary['id'])
sns.barplot(df_summary['year'],df_summary['img_count'])

Posts per year Images per year

And per month:

Posts per month Images per month

And per weekday:

Posts per weekday Images per weekday

Using the summary data we can also explore correlations between fields with a pair plot:

Summary pairplot

or look at the correlation between number of posts and number of images per month

jp = sns.jointplot(df_summary['img_count'], df_summary['id'], kind='reg')
jp.set_axis_labels('total images on a month', 'number of posts on a month')

Images versus posts per month

We can also play around with the words used in titles:

Words per month Words per year Words per weekday

and also see if I used longer or shorter words in post titles over the years c r o s s o a k has been published

Words per year

Conclusions

So what have we learnt from the excursion into the data lurking behind c r o s s o a k?

2012 was a low year for both posts and images posted. The number of posts recovered slightly in 2016-2018 but not the number of images (more posts have only one image on average).
April is when I post most
I post on weekend's and Fridays more than midweek
I use the fewest words in post titles in May
Over the years I've used more words and longer words in post titles
This is the kind of thing you do on a dark autumn evening in the northern hemisphere. Apparently it's why Scandinavia has so many tech start-ups (relative to population).

Fixing Crossoak

Sun 01 July 2018 ⊕ Category: SysAdmin
#blog

Sony DSC-V1

C r o s s o a k is a photo blog that goes back to Lost Something in Cromer in May 2005. It's really a photo journal. Or a log of things illustrated by photos that's available on the web, a web log. It's been through a couple of iterations since starting out on Blogger with snaps from a Sony DSC-V1 processed in Picasa.

For the longest time the core workflow was:

Take photo
Import to Adobe Lightroom
Tweak photo
Upload to Flickr
Draft new post in Wordpress
Publish

That had a couple of downsides. First, it's quite manual. Second, it's hard to do when travelling light. This meant that posts for Crossoak tended to batch up waiting for some time for me to publish.

Best Camera

There's an adage that the Best camera is the one you have with you. Around 2010 (for many reasons, not all of them photography related), the camera I had with me was often the one glued into the back of a mobile phone. That was okay for uploading pictures, there was an embarrassment of riches for sync'ing photos from phones, but publishing and sharing in something like a blog post was still challenging. In the real, non-geek, world that's why something like Instagram happens. Someone, somewhere, figures out how to solve a pain point that it turns out lots of other people also have. Turns out that included me too. So I had another workflow that went:

Take photo
Share on Instagram

But now I had posts on Crossoak and Instagram (/sadface) and I didn't really want to republish that were already on instagram to Crossoak manually because that makes even more work.

Enter IFFF. IFTTT is a webservice that lets you create recipes that combine actions from other webservices. With IFTTT the Instagram workflow becomes

Take photo
Share on Instagram
1. Automatically!!!
2. Check if Instagram post has the #blog tag, if it does then...
3. Publish the instagram post to Crossoak too

This worked really well, so well that the majority of the Crossoak posts over the last 12 months have been via instagram.

That was until stuff started to break.

Bit Rot

The problem was that posts published by IFTTT used Instagram links that changed, resulting in large parts of Crossoak to experience broken image syndrome. Not a good look when you're a photo blog. Especially not when any text you include is frequently so cryptic as to cause confusion even with those that were featured in the accompanying photographs.

Fortunately, there was a straight-forward fix. When creating the IFTTT recipe to post from Instagram, I also created one to upload the same image to Flickr. This meant I had copies of the broken images (or all except one) on Flickr. Fixing was possible, but that was a lot of links. I was looking at all the time saved over the years in my clever hack to the publication workflow being eaten up by the cost of fixing. Douglas Coupland smiles.

Fixing Bit Rot

Programmatically, an automated fix was relatively trivial. Iterate through the posts on Crossoak; identify posts published from Instagram; search Flickr for the corresponding photo; update the Crossoak post, replacing Instagram with the corresponding link to Flickr. Simples.

First, iterate through posts using the python-wordpress-xmlrpc library:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import GetPosts
endpoint = blog_url + '/xmlrpc.php'
wp = Client(enpoint, auth_user, auth_password)

offset = 0
increment = 20
while True:
    posts = wp.call(GetPosts({'number': increment, 'offset': offset}))
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        update_if_instagram(post)
    offset = offset + increment

To identify Instagram posts I considered looking for the Instagram tag (which the IFTTT recipe created) but instead I opted for searching the <img> tag src attribute for the magic text with Beautiful Soup:

magic_text = 'instagram'
content = post.content
soup = BeautifulSoup(content, 'lxml')
for img in soup.findAll('img'):
    img_src = img['src']
    if img_src.find(magic_text) > -1:

The tricky bit was finding the corresponding Flickr photos. Flickr has a lovely API (here's the API explorer for search) which the python-flickr-api library nicely wraps, so I can search with something like:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title)

There were two snags however. First, the text attribute is a fuzzy search, and my Instagram-generated post titles are far from unique. This was mitigated by scoping the search to +/- a day of the Wordpress post:

dmin = post.date - timedelta(days=1)
dmax = post.date + timedelta(days=1)

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())

But a second problem was that Flickr wasn't returning everything I thought it should. In many cases I could manually browse to the right image, but the API wasn't returning it based on the text search. So I flipped the search logic and used the Flickr API to return all photos in the right time range and then let Python's string search find the match:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())
for candidate_photo in flickr_photos:
    if candidate_photo.title.find(post.title) == 0: 
        doSomething()

All that left was to call the Flickr GetSizes API for the photo URL and update the Wordpress post with the corrected attribute:

photo_sizes = posphoto.getSizes()
img['src'] = photo_sizes['Original']['source']
newcontent = str(soup)
# Update post
post.content = newcontent
wp.call(EditPost(post.id, post))

Crossoak fixed.

Image viewing on a Raspberry Pi

Mon 01 January 2018 ⊕ Category: Raspberry Pi
#tools #sysadmin

I wanted a quick and simple slideshow on a Raspberry Pi

apt-get install fbi
fbi -a -t 5 -noverbose img/*.jpg

Creates a slideshow with auto image scaling, a five second delay, without the status bar using all .jpg images in the img subfolder.

Encoding with FFMEG

Fri 25 December 2015 ⊕ Category: tools
#video #encoding #tools

Wow. iOS is fussy about MPEG4 encoding. Stuff that worked fine as HTML5 video sources in Chrome and Safari on a Mac failed to load in various i-devices. In the end I re-encoded using FFMPEG and:

ffmpeg.exe"  -i input.mov -codec:v libx264 -profile:v main -preset slow -b:v 2000k -maxrate 400k -bufsize 800k -vf scale=-1:1080 -threads 0 -codec:a libvo_aacenc -b:a 192k output.mp4

Storage Pools

Sun 26 April 2015 ⊕ Category: Tools
#Windows #Storage #Backup

If you're not restoring you're not backing up

Windows Server has this neat feature: Storage Pools. In a nutshell it separates the logical storage from physical devices. I use it to make two physical hard drives appear as one logical disk. Anything saved to the pool is mirrored to both disks. In theory, this means that a failure of one physical drive won't loose any data since a copy is available on the second.

Last week I had a drive failure. It wasn't either of the drives in the storage pool. Instead the system drive (a third drive hosting the OS) had failed.

I think it took me 40 minutes to be up and running enough to validate the data was okay.

Install replacement system disk
Reinstall Windows Server 2012 R2
Reconnect two physical disks hosting the storage pool
Trawl the interwebs for details of how to reattached the storage pool

Job done (except for the reboots and updates and reboots and updates thing...).

One trick, Windows server doesn't automatically mount a newly attached pool on reboot. Here'e the PowerShell rune to chnage that:

Get-VirtualDisk | Where-Object {$_.IsManualAttach –eq $True} | Set-VirtualDisk –IsManualAttach $False

Youtube-dl

Sun 30 November 2014 ⊕ Category: Tools

For those occasions where you don't have the bandwidth to watch something without buffering: youtube-dl.

On a Mac:

sudo pip install youtube-dl
brew install libav
youtube-dl <url>

If bandwidth is a real pain and you just want the audio...

youtube-dl --extract-audio --audio-format mp3 <url>

To update:

sudo pip install -U youtube-dl

A comparison of sorting algorithms

Mon 17 November 2014 ⊕ Category: Programming
#Sorting #algorithms

From an internal DL. A comparison of sorting algorithms on different types of data

Sorting algorithms

Or, through the medium of Hungarian folk dancing

.Net book recommendations

Wed 12 November 2014 ⊕ Category: Library
#.net #programming #books

Top 10 books everyone .Net developer should onw

Secure Copy

Mon 10 November 2014 ⊕ Category: Tools
#Sysadmin #linux #mac

Simple scp from remote to me:

$ scp username@remotehost:file.txt /some/local/directory

Simple scp from me to remote:

$ scp file.txt username@remotehost:remote/directory

Here's a cheat sheet

TheShed