TheShed

Analysing Crossoak

Category: misc

A friend had a theory that photography, and therefore by extension c r o s s o a k is a window into my mental well being. So I did some digging which I'm capturing here (and which is in all probability a much bigger insight into my head...). If you have a similar theory, you might think the conclusion tells you quite a lot. If you don't it won't.

Getting the data

First off, let's grab some data from Wordpress, the blog platform that currently powers c r o s s o a k.

from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods.posts import GetPosts
from bs4 import BeautifulSoup
import json

wp = Client(BLOG_URL + '/xmlrpc.php')

This gets a reference to wp which we can use to query wordpress, for example this code:

post_data = {}
# get pages in batches of 20
offset = 0
increment = 20
while True:
    posts = wp.call(
        GetPosts({'number': increment, 'offset': offset})
    )
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        post_data[post.id] = {"title": post.title, 
           "date": post.date.strftime("%Y%m%d"), 
           "img_count": count_images(post)}
    offset = offset + increment

will get some basic data on all the posts in the blog. In the above count_images uses BeautifulSoup to parse post content to count img tags like this:

def count_images(post):
    soup = BeautifulSoup(post.content, 'lxml')
    imgs = soup.findAll('img')
    return len(imgs)

So you don't have to keep querying the site, you can save to a local file. For example, to export as JSON with:

with open('wp_data.json', 'w') as outfile:
    json.dump(post_data, outfile)

which looks something like this:

{
    "4444": {
        "title": "Wandering Around Wasdale",
        "date": "20181027",
        "img_count": 5
    }
}

where 4444 is the post id which you can use to reference the post on the blog with https://www.aiddy.com/crossoak?p=4444.

Exploring the data

To analyse the content we'll use Pandas and Seaborn.

import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# load JSON file with our Wordpress data
with open('wp_data.json') as json_data:
    d = json.load(json_data)

# convert to a DataFrame
df = pd.DataFrame(d)
df = df.transpose()
df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)

Which we can then use to visualize the distribution of the number of images per post.

sns.distplot(df.img_count, kde_kws={"bw":.5}, bins=15)
sns.plt.xlim(0, None)
sns.plt.xlabel('Number of images per post')
Distribution of images (vertical axis) per post Distribution of images per post

To delve a bit data, we'll augment the data, breaking out the date with year, month, day and weekday and counting words and characters in the titles...

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['title_words'] = df['title'].map(lambda x: len(str(x).split()))
df['title_chars'] = df['title'].map(lambda x: len(x))
df['title_wordlength'] = df['title_chars'] - df['title_words']

...and calculate summaries using groupby for counts of posts and the total number of images:

# get summarized views
df_imgs = df.groupby([df.year, df.month]).img_count.sum()
df_posts = df.groupby([df.year, df.month]).id.count()
df_imgs = pd.DataFrame(df_imgs)
df_imgs.reset_index(inplace=True)
df_posts = pd.DataFrame(df_posts)
df_posts.reset_index(inplace=True)

df_summary = pd.merge(df_imgs, df_posts, on=['month', 'year'])

Comparing the DataFrames for the original data imported from JSON in df:

id date img_count title year month day weekday title_chars title_words title_wordlength
0 100 2009-10-28 1 Last game 2009 10 28 2 9 2 7
1 1000 2005-10-23 1 Alice\'s Jugs 2005 10 23 6 12 2 10
2 1001 2005-10-23 1 Padstow Sunrise 2005 10 23 6 15 2 13
3 1002 2005-10-23 1 Rock at Night 2005 10 23 6 13 3 10
4 1003 2005-10-23 1 Padstow Sunset 2005 10 23 6 14 2 12

and the view in df_summary:

year month img_count id
0 2005 5 50 50
1 2005 6 17 17
2 2005 7 8 8
3 2005 8 21 21
4 2005 9 7 7

Now we can dig into: posts & images per year...

sns.barplot(df_summary['year'],df_summary['id'])
sns.barplot(df_summary['year'],df_summary['img_count'])

Posts per year Images per year

And per month:

Posts per month Images per month

And per weekday:

Posts per weekday Images per weekday

Using the summary data we can also explore correlations between fields with a pair plot:

Summary pairplot

or look at the correlation between number of posts and number of images per month

jp = sns.jointplot(df_summary['img_count'], df_summary['id'], kind='reg')
jp.set_axis_labels('total images on a month', 'number of posts on a month')

Images versus posts per month

We can also play around with the words used in titles:

Words per month Words per year Words per weekday

and also see if I used longer or shorter words in post titles over the years c r o s s o a k has been published

Words per year

Conclusions

So what have we learnt from the excursion into the data lurking behind c r o s s o a k?

  1. 2012 was a low year for both posts and images posted. The number of posts recovered slightly in 2016-2018 but not the number of images (more posts have only one image on average).
  2. April is when I post most
  3. I post on weekend's and Fridays more than midweek
  4. I use the fewest words in post titles in May
  5. Over the years I've used more words and longer words in post titles
  6. This is the kind of thing you do on a dark autumn evening in the northern hemisphere. Apparently it's why Scandinavia has so many tech start-ups (relative to population).

Fixing Crossoak

Category: SysAdmin
#blog

Sony DSC-V1

C r o s s o a k is a photo blog that goes back to Lost Something in Cromer in May 2005. It's really a photo journal. Or a log of things illustrated by photos that's available on the web, a web log. It's been through a couple of iterations since starting out on Blogger with snaps from a Sony DSC-V1 processed in Picasa.

For the longest time the core workflow was:

  1. Take photo
  2. Import to Adobe Lightroom
  3. Tweak photo
  4. Upload to Flickr
  5. Draft new post in Wordpress
  6. Publish

That had a couple of downsides. First, it's quite manual. Second, it's hard to do when travelling light. This meant that posts for Crossoak tended to batch up waiting for some time for me to publish.

Best Camera

There's an adage that the Best camera is the one you have with you. Around 2010 (for many reasons, not all of them photography related), the camera I had with me was often the one glued into the back of a mobile phone. That was okay for uploading pictures, there was an embarrassment of riches for sync'ing photos from phones, but publishing and sharing in something like a blog post was still challenging. In the real, non-geek, world that's why something like Instagram happens. Someone, somewhere, figures out how to solve a pain point that it turns out lots of other people also have. Turns out that included me too. So I had another workflow that went:

  1. Take photo
  2. Share on Instagram

But now I had posts on Crossoak and Instagram (/sadface) and I didn't really want to republish that were already on instagram to Crossoak manually because that makes even more work.

Enter IFFF. IFTTT is a webservice that lets you create recipes that combine actions from other webservices. With IFTTT the Instagram workflow becomes

  1. Take photo
  2. Share on Instagram
    1. Automatically!!!
    2. Check if Instagram post has the #blog tag, if it does then...
    3. Publish the instagram post to Crossoak too

This worked really well, so well that the majority of the Crossoak posts over the last 12 months have been via instagram.

That was until stuff started to break.

Bit Rot

The problem was that posts published by IFTTT used Instagram links that changed, resulting in large parts of Crossoak to experience broken image syndrome. Not a good look when you're a photo blog. Especially not when any text you include is frequently so cryptic as to cause confusion even with those that were featured in the accompanying photographs.

Fortunately, there was a straight-forward fix. When creating the IFTTT recipe to post from Instagram, I also created one to upload the same image to Flickr. This meant I had copies of the broken images (or all except one) on Flickr. Fixing was possible, but that was a lot of links. I was looking at all the time saved over the years in my clever hack to the publication workflow being eaten up by the cost of fixing. Douglas Coupland smiles.

Fixing Bit Rot

Programmatically, an automated fix was relatively trivial. Iterate through the posts on Crossoak; identify posts published from Instagram; search Flickr for the corresponding photo; update the Crossoak post, replacing Instagram with the corresponding link to Flickr. Simples.

First, iterate through posts using the python-wordpress-xmlrpc library:

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods.posts import GetPosts
endpoint = blog_url + '/xmlrpc.php'
wp = Client(enpoint, auth_user, auth_password)

offset = 0
increment = 20
while True:
    posts = wp.call(GetPosts({'number': increment, 'offset': offset}))
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        update_if_instagram(post)
    offset = offset + increment

To identify Instagram posts I considered looking for the Instagram tag (which the IFTTT recipe created) but instead I opted for searching the <img> tag src attribute for the magic text with Beautiful Soup:

magic_text = 'instagram'
content = post.content
soup = BeautifulSoup(content, 'lxml')
for img in soup.findAll('img'):
    img_src = img['src']
    if img_src.find(magic_text) > -1:

The tricky bit was finding the corresponding Flickr photos. Flickr has a lovely API (here's the API explorer for search) which the python-flickr-api library nicely wraps, so I can search with something like:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title)

There were two snags however. First, the text attribute is a fuzzy search, and my Instagram-generated post titles are far from unique. This was mitigated by scoping the search to +/- a day of the Wordpress post:

dmin = post.date - timedelta(days=1)
dmax = post.date + timedelta(days=1)

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    tags='instagram', 
    text=post.title, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())

But a second problem was that Flickr wasn't returning everything I thought it should. In many cases I could manually browse to the right image, but the API wasn't returning it based on the text search. So I flipped the search logic and used the Flickr API to return all photos in the right time range and then let Python's string search find the match:

flickr_photos = flickr_api.Photo.search(user_id=user.id, 
    min_upload_date=dmin.timestamp(), 
    max_upload_date=dmax.timestamp())
for candidate_photo in flickr_photos:
    if candidate_photo.title.find(post.title) == 0: 
        doSomething()

All that left was to call the Flickr GetSizes API for the photo URL and update the Wordpress post with the corrected attribute:

photo_sizes = posphoto.getSizes()
img['src'] = photo_sizes['Original']['source']
newcontent = str(soup)
# Update post
post.content = newcontent
wp.call(EditPost(post.id, post))

Crossoak fixed.

Image viewing on a Raspberry Pi

Category: Raspberry Pi
#tools #sysadmin

I wanted a quick and simple slideshow on a Raspberry Pi

apt-get install fbi
fbi -a -t 5 -noverbose img/*.jpg

Creates a slideshow with auto image scaling, a five second delay, without the status bar using all .jpg images in the img subfolder.

Encoding with FFMEG

Category: tools
#video #encoding #tools

Wow. iOS is fussy about MPEG4 encoding. Stuff that worked fine as HTML5 video sources in Chrome and Safari on a Mac failed to load in various i-devices. In the end I re-encoded using FFMPEG and:

ffmpeg.exe"  -i input.mov -codec:v libx264 -profile:v main -preset slow -b:v 2000k -maxrate 400k -bufsize 800k -vf scale=-1:1080 -threads 0 -codec:a libvo_aacenc -b:a 192k output.mp4

Storage Pools

Category: Tools
#Windows #Storage #Backup

If you're not restoring you're not backing up

Windows Server has this neat feature: Storage Pools. In a nutshell it separates the logical storage from physical devices. I use it to make two physical hard drives appear as one logical disk. Anything saved to the pool is mirrored to both disks. In theory, this means that a failure of one physical drive won't loose any data since a copy is available on the second.

Last week I had a drive failure. It wasn't either of the drives in the storage pool. Instead the system drive (a third drive hosting the OS) had failed.

I think it took me 40 minutes to be up and running enough to validate the data was okay.

  1. Install replacement system disk
  2. Reinstall Windows Server 2012 R2
  3. Reconnect two physical disks hosting the storage pool
  4. Trawl the interwebs for details of how to reattached the storage pool

Job done (except for the reboots and updates and reboots and updates thing...).

One trick, Windows server doesn't automatically mount a newly attached pool on reboot. Here'e the PowerShell rune to chnage that:

Get-VirtualDisk | Where-Object {$_.IsManualAttach eq $True} | Set-VirtualDisk –IsManualAttach $False

More on MPD

Category: Tools

More jottings on MPD (previously on m a n n i n g t r e e)

Youtube-dl

Category: Tools

For those occasions where you don't have the bandwidth to watch something without buffering: youtube-dl.

On a Mac:

  1. sudo pip install youtube-dl
  2. brew install libav
  3. youtube-dl <url>

If bandwidth is a real pain and you just want the audio...

youtube-dl --extract-audio --audio-format mp3 <url>

To update:

sudo pip install -U youtube-dl

A comparison of sorting algorithms

Category: Programming
#Sorting #algorithms

From an internal DL. A comparison of sorting algorithms on different types of data

Sorting algorithms

Or, through the medium of Hungarian folk dancing

.Net book recommendations

Category: Library
#.net #programming #books

Top 10 books everyone .Net developer should onw

Secure Copy

Category: Tools
#Sysadmin #linux #mac

Simple scp from remote to me:

$ scp username@remotehost:file.txt /some/local/directory

Simple scp from me to remote:

$ scp file.txt username@remotehost:remote/directory

Here's a cheat sheet