Analysing Crossoak

Sun 04 November 2018 ⊕ Category: misc

A friend had a theory that photography, and therefore by extension c r o s s o a k is a window into my mental well being. So I did some digging which I'm capturing here (and which is in all probability a much bigger insight into my head...). If you have a similar theory, you might think the conclusion tells you quite a lot. If you don't it won't.

Getting the data

First off, let's grab some data from Wordpress, the blog platform that currently powers c r o s s o a k.

from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.methods.posts import GetPosts
from bs4 import BeautifulSoup
import json

wp = Client(BLOG_URL + '/xmlrpc.php')

This gets a reference to wp which we can use to query wordpress, for example this code:

post_data = {}
# get pages in batches of 20
offset = 0
increment = 20
while True:
    posts = wp.call(
        GetPosts({'number': increment, 'offset': offset})
    )
    if len(posts) == 0:
        break  # no more posts returned
    for post in posts:
        post_data[post.id] = {"title": post.title, 
           "date": post.date.strftime("%Y%m%d"), 
           "img_count": count_images(post)}
    offset = offset + increment

will get some basic data on all the posts in the blog. In the above count_images uses BeautifulSoup to parse post content to count img tags like this:

def count_images(post):
    soup = BeautifulSoup(post.content, 'lxml')
    imgs = soup.findAll('img')
    return len(imgs)

So you don't have to keep querying the site, you can save to a local file. For example, to export as JSON with:

with open('wp_data.json', 'w') as outfile:
    json.dump(post_data, outfile)

which looks something like this:

{
    "4444": {
        "title": "Wandering Around Wasdale",
        "date": "20181027",
        "img_count": 5
    }
}

where 4444 is the post id which you can use to reference the post on the blog with https://www.aiddy.com/crossoak?p=4444.

Exploring the data

To analyse the content we'll use Pandas and Seaborn.

import json
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# load JSON file with our Wordpress data
with open('wp_data.json') as json_data:
    d = json.load(json_data)

# convert to a DataFrame
df = pd.DataFrame(d)
df = df.transpose()
df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)

Which we can then use to visualize the distribution of the number of images per post.

sns.distplot(df.img_count, kde_kws={"bw":.5}, bins=15)
sns.plt.xlim(0, None)
sns.plt.xlabel('Number of images per post')

To delve a bit data, we'll augment the data, breaking out the date with year, month, day and weekday and counting words and characters in the titles...

df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.weekday
df['title_words'] = df['title'].map(lambda x: len(str(x).split()))
df['title_chars'] = df['title'].map(lambda x: len(x))
df['title_wordlength'] = df['title_chars'] - df['title_words']

...and calculate summaries using groupby for counts of posts and the total number of images:

# get summarized views
df_imgs = df.groupby([df.year, df.month]).img_count.sum()
df_posts = df.groupby([df.year, df.month]).id.count()
df_imgs = pd.DataFrame(df_imgs)
df_imgs.reset_index(inplace=True)
df_posts = pd.DataFrame(df_posts)
df_posts.reset_index(inplace=True)

df_summary = pd.merge(df_imgs, df_posts, on=['month', 'year'])

Comparing the DataFrames for the original data imported from JSON in df:

	id	date	img_count	title	year	month	day	weekday	title_chars	title_words	title_wordlength
0	100	2009-10-28	1	Last game	2009	10	28	2	9	2	7
1	1000	2005-10-23	1	Alice\'s Jugs	2005	10	23	6	12	2	10
2	1001	2005-10-23	1	Padstow Sunrise	2005	10	23	6	15	2	13
3	1002	2005-10-23	1	Rock at Night	2005	10	23	6	13	3	10
4	1003	2005-10-23	1	Padstow Sunset	2005	10	23	6	14	2	12

and the view in df_summary:

	year	month	img_count	id
0	2005	5	50	50
1	2005	6	17	17
2	2005	7	8	8
3	2005	8	21	21
4	2005	9	7	7

Now we can dig into: posts & images per year...

sns.barplot(df_summary['year'],df_summary['id'])
sns.barplot(df_summary['year'],df_summary['img_count'])

Posts per year Images per year

And per month:

Posts per month Images per month

And per weekday:

Posts per weekday Images per weekday

Using the summary data we can also explore correlations between fields with a pair plot:

Summary pairplot

or look at the correlation between number of posts and number of images per month

jp = sns.jointplot(df_summary['img_count'], df_summary['id'], kind='reg')
jp.set_axis_labels('total images on a month', 'number of posts on a month')

Images versus posts per month

We can also play around with the words used in titles:

Words per month Words per year Words per weekday

and also see if I used longer or shorter words in post titles over the years c r o s s o a k has been published

Words per year

Conclusions

So what have we learnt from the excursion into the data lurking behind c r o s s o a k?

2012 was a low year for both posts and images posted. The number of posts recovered slightly in 2016-2018 but not the number of images (more posts have only one image on average).
April is when I post most
I post on weekend's and Fridays more than midweek
I use the fewest words in post titles in May
Over the years I've used more words and longer words in post titles
This is the kind of thing you do on a dark autumn evening in the northern hemisphere. Apparently it's why Scandinavia has so many tech start-ups (relative to population).

BT Broadband Update

Fri 03 October 2014 ⊕ Category: Misc
#Network #BT #Broadband

Update: 29/10/2014

Same person from @BTCare. Sorry to hear still slower than before. I checked speed while on the call (~4.4mps down). She wanted to check something, called back 10mins later with:

Test show that there's a mismatch on the supplier side. I'll escalate, normally 48hrs turnaround, but I'll check tomorrow

So, we wait some more.

Update: 24/10/2014

Lovely engineer. Line is okay (but replaced master socket to one with integrated filter). Cabinet port is also okay, feed from DSLAM in fibre cabinet is gving rated 40Mbps but this falls to 4-ish by the time it reaches us. Can't check the fibre cabinet though ("that's another bit of BT, BT wholesale").

Now need to wait 10 days for the line to stabalise. Then, assuming no change, will need to choose whether to go back to ADSL or possibly try the 80Mbps feed from the DSLAM because "that might give you *-ish by the time it reaches you".

Update: 16/10/2014

A nice person from @BTCare called. They predicted 9Mbps, so will send an engineer to check.

Updated: 06/10/2014 08:08:57

Initial mesaurements: line speed download is slower than before upgrade. Latency is increased. Upload speed is higher

Original: 20141003 07:35:40

BT are upgrading the broadband today (Friday). Or at least they've scheduled an engineer to. Let's use their tools for data gathering: BT wholesale speed test

Data

| Test | Download
(Mbps) | Upload
(Mbps) | Latency
(ms) | When | | :-----| --------------------:|----------------:|--------------:| :——: | | Before | | | | | | Dirty studio | 6.02 | 0.31 | 30.38 | Friday Day | | | 5.98 | 0.67 | 36.25 | | | | 6.08 | 0.77 | 44.38 | | | Dirty HH4 | 6.24 | 0.72 | 38.5 | | | | 6.02 | 0.71 | 35.25 | | | | 6.18 | 0.69 | 34.63 | | | Clean HH4 | 6.01 | 0.60 | 0.0 | | | | 6.24 | 0.72 | 39.0 | | | | 6.20 | 0.74 | 48.75 | | | After (BT faster broadband upgrade) | | | | | | Dirty HH5 | 4.31 | 0.95 | 53.13 | Friday | | | 4.47 | 1.03 | 117.25 | Saturday | | | 4.23 | 1.05 | 92.25 | Monday 6am | | Clean HH5 | 4.22 | 0.56 | 60.88 | Monday 9am | | | 4.24 | 0.55 | 45.25 | | | | 4.24 | 0.64 | 36.63 | |

Observations

Three step process:
- on premisis replace BT master socket (chip platser, fix previously used faceplate...)
- nip to cabinet, flip over to fibre
- return and configure line / Home Hub 5 (HH5)
"Will take ~3 days for line speed to stabalise"

Notes

HH4 == BT Home Hub 4
Dirty: other stuff happening / other devices conected on the network
Clean: just the test site, all other devices disconnected (easy to do, because the current BT HomeHub4 is just the modem. WiFi and Ethernet networks are through another router).
Network configuration for Studio:

BT HomeHub 4 ← → TP Link TL-WDR3600 ← → TP Link Powerline Adapters ← → Studio

test

Mon 29 September 2014 ⊕ Category: misc

Adding Search to Pelican

Wed 24 September 2014 ⊕ Category: misc

To add search try this. You'll find additional details over on the Tipue site.

Update: 20140927

It works, but make sure you don't have conflicts with different versions of jQuery. Oh, the joys of mixing-and-matching frameworks.

First post

Wed 24 September 2014 ⊕ Category: Misc
#First

So, here it is:

print("Hello, World!");

That okay with you?

Second Post

Sun 21 September 2014 ⊕ Category: misc

I've invented a timemachine. Sounds cool? It is!