Question: does having a Kindle mean I buy more books?

A bit of fun for a Friday. I bought a Kindle 3 when they were launched in the UK in August 2010. Over the past 9 months, my impression is that I’ve been buying more books than I used to, and that they’re mostly Kindle books. I have a second prediction, which is that I read more than I used to, but we won’t cover that today.

To check whether my intuition was correct, I decided to take a look at my Amazon order history. I do buy books from other places, but they’re a minority and tend to be photography books, a category which I’m excluding from this analysis as they’re not the kind of book I’d buy on a Kindle. All other types of book are included, even cookery and programming books.

The first task was to grab my order history. US customers have it easy – Amazon.com have a reporting facility that lets you download all your orders by year. Alas, this doesn’t currently work for the UK site, and there isn’t an API, so I resorted to scraping my order history using Python. I’ll cover this in more detail in a later post, but let’s just say that Mechanize and BeautifulSoup are awesome for doing this kind of thing – Mechanize pretends to be a browser, and so enables you to authenticate with Amazon and let Python into the good stuff. BeautifulSoup then tries to make sense of the HTML being returned by letting you parse the tag tree and grab elements of interest.

Thankfully, the updated physical order history uses ID and class names, which makes it a little easier to home in on different aspects of the order, so this wasn’t too tricky. The Kindle order history is another matter though: nested tables with no identifiers, such that my identifier to find an order block is to grab table rows which have bgcolor=’#ffffff’! Not pretty. The Kindle order page also doesn’t give any information about price – and although I didn’t need to include order total in the visualisation below, having price for the Kindle books was crucial because a large chunk of my downloads will have been for the free, out-of-print editions. Including these wouldn’t have been a fair comparison. So, to get price, I had to send another sub-request off to grab each individual order page from the Kindle order history.

A little while later, and the two scripts gave me 581 items ordered since 2000! (including the free eBooks) This includes non-book orders from Amazon, and helpfully, it appears that the Amazon ASIN identifier starts with a B when the product ID isn’t an ISBN, i.e. isn’t a book. This meant it was easy to separate out the two. I then manually removed anything that looked like a photography book, and brought the data into Tableau.

My books vs paid Kindle books purchases (Current Kindle period: September - May)

Powered by Tableau

Surprise! My Kindle purchases per month in the valid period (it’s only been 9 months since I got my Kindle, so I’m only comparing September-May each year) nearly mirror my physical book purchases from last year. The total for this year is higher, but looking further back, my Amazon book buying has steadily increased year on year, so there isn’t justification to say that the Kindle has affected my overall book buying quantity – though it’s clear that the majority of my purchases are now Kindle books.

One assumption quashed – next time we’ll look at the Python scripts, and then take a look at cumulative order costs over 10 years!

[admin note: migrated from FindingVirtue to Ixyl in April 2019]

Data reporting: inner and outer joins

First of all apologies for the styling mess, we are still decorating around here.

I want to propose that nearly every management information report written should never contain an inner join. That is, it should always have a dominant entity and a series of subordinates. This flies in the face of database integrity: in theory your data integrity is watertight and you will always have perfectly matched joins. In practice, I haven’t seen this.

Let’s backtrack.

A simple database contains a Person table and an Orders table. The Person table contains a unique numeric identifier which we’ll use as the primary key. The Orders table contains a unique order identifier, and also the Person identifier used as the foreign key. When you’re running a website based on this database, you typically want to do two things:

1) show the user who they are

2) show the user what they’ve ordered

Let’s also think about join types: an inner join will return records where the key (Person) exists in both the Person and the Orders table. It won’t show you records where PersonID exists in Person but not in orders, and it won’t show you records where PersonID exists in Orders but not in Person (rightly so – your database is broken if it does).

In situation 1, you’re just probing the Person table for a given ID – simple enough and no joins involved. You might want to show them some information about the number of orders they’ve made, and that could be an interesting experience – we’ll come back to that.

In situation 2, you’re listing the Order table for a given personID – also simple enough, you barely need the join at all but an inner join – selecting all records where the personID exists in both the Person and the Orders table – makes perfect sense. If there’s nothing to show, there’s nothing to show – the page is personal to you, so there’s just nothing to show. It’s not like you’re missing – you know you’re on the website and exist, so you just have no orders. The join is passively broken.

Let’s play at management information though. Here you are never an individual, you’re always taking an overview and wanting to use your Person table as a critical measure of ranking performance. You now care about the combination of Person and Others, where it’s a many to many situation, not just the one Person to many (or none) Orders you had as a user.

Say your Person table has 60,000 customers. 30,000 of these have placed one order (for this example, you get banned after your first order). If you make an inner join on these two, you’ll end up with 30,000 customers and a single order record for each. Great! We know… very little indeed. We’ve learned that of those people who have made an order, they’ve made an order within the rules of our market. (now OK, the one order thing is farsical but it’s to avoid the multiplication of records – you could easily replace the ‘one order’ with a sum of orders and get something more meaningful, but….)

As someone looking at the overview, the most important thing to you is to see everything, even and especially when it has no subsequent impact. Your question is “what proportion of my customers are placing orders?” As an analyst, you can’t answer this with an inner join: an inner join will only ever give you the customers who have made orders, so it will always be 100%. That’s not useful.

I believe an analyst should only ever use one type of join, and it should be a left or right join. Personally I’ve always gone with left outers, but I’m left handed – you might like rights. An outer join says ‘give me everything from the dominant table, and then whatever matches from the next’. The join is actively incomplete. As an analyst, you have to decide what your dominant entity is, and then always ensure there is subordination from there. That decision of the first dominant table is important, but in truth it’s usually pretty easy to work out – in many cases it’s a person, a company, or a machine. An inner join requires that there is data at the other end: the whole point of management information is to show where there are gaps, where there is non-engagement, so that you can work on it and improve so that your performance increases. Gradually those null foreign keys should become populated.

Indeed, look at that as a base measure: the proportion of null foreign keys you have is a measure of how much work you have to do. It should never be zero, otherwise you haven’t found an uncaptured market yet; you should always have partially empty tables hanging off one another. It isn’t about data integrity – it’s about exposing possibilities. I worry when I see an inner join: it means something is possibly being excluded that I might care about. Like all the potential signed-up customers who’ve never ordered anything from me. Time for a ‘Hi, it’s been a while!’ email, perhaps?

That’s my experience, anyway – I may be totally wrong about some of this, but that’s the way it’s worked out in practice. Your thoughts very appreciated.

[admin note: migrated from FindingVirtue to Ixyl in April 2019]

We choose, in this decade, to banish instantiation

From this day forth, I can confirm the new programming paradigm will be:


# ********************************************************************
# TRASHY LITTLE SUBROUTINES
# ********************************************************************

As found in the Lunar Landing Guidance Equations of the Apollog Guidance Computer,

On non-constructive criticism of populist writers

Provoked by comments on Coding Horror – I Stopped Reading Your Blog Years Ago.

If you are a blog reader: your faith in knowledge is cumulative, so learn how many imprints of a particular “truth” you need before you’re satisfied that it’s a decent reflection of reality. This threshold is your dividing line between entertainment and knowledge.

If you are a blog writer: people either read things because they have to (technical blogs) or because they don’t have to (entertaining blogs). More people read the latter, and the good ones point to the former, forming a useful knowledge ladder. However, attacking the former, whether ad hominen or direct, whether justified or not, does not encourage an uninformed reader to want to read your version of the truth. One thing I’ve yet to see Jeff do is throw his toys out of the pram, yet on nearly every post he makes, there are a series of commenters who do write valuable, technical blogs, who embarass themselves and diminish the exposure of their knowledge by piling in with non-constructive criticism.

On discrete/embedded coding strategies

The programming language should handle the business logic – get the stuff out of the database, make some decisions about it based on the other known factors, and end up with the data you want to display, described in a way which is comprehensible and reusable to the programming language, retaining the metadata obtained through the business logic. (for instance, pageTitle = “You’re Doing It Wrong”; bunchOfArrays = commentID, commentText, userID, userName, commentDate, gravatarID, gravatarCategory) There should be no HTML here (other than perhaps any originally embedded in commentText itself) because *at this stage you don’t even know whether you want to output it as HTML*. Here you are deciding on the data to output and how to describe it.

The data and its descriptors should then be handed over to a module which is capable of transforming it into the output language. Usually for our purposes, this is HTML, but it could easily be a CSV download, an RSS feed, or a bunch of emails. (well, why not?) This can be a templating engine which has templates for different output types, and the only logic it should really engage in is looping through arrays provided to it, i.e. foreach, and what to do when data has not been provided (e.g. a null gravatarID should mean that no img tag is output). There should be no decision making because your first stage has already provided you with all the information needed. If you are a template and are given a single variable, you output it once within the template definition. If you are given an array, you output it several times within the template definition. Here you get to define the order in which the data should be displayed, and the method (markup) you’re using to describe it. (for instance, whether it is *semantically* better to output the userName before or after the commentText, regardless of how you want it to be visually displayed.)

The final stage is presentational, usually in the form of CSS, and takes the structured markup and tells the device how to display it. Here you get to choose the color of the text, whether the userName should visually appear before or after the commentText (not the best example: think of a sidebar and body text; far too many developers output the sidebar and then the body text because the sidebar is being displayed on the left, when it is more accessible to output the body text then the sidebar, and style it so that they’re the other way round) and so on.

The first stage is defined by business logic and turns raw data into parsed data. (database -> decisions -> data+metadata)

The second stage is defined by semantic rules and usability-led, accessibility-led, platform-specific definitions and turns parsed data into structured data. (data+metadata -> output definition -> structured data)

The third stage is defined by presentational rules (design) and turns structured data into “displayed” data. (structured data -> style definition -> output data)

With embedded HTML all this is too difficult so you only offer one output stream, severely limiting your own options in the future. Also, with well designed output, the designer should never need to come back to “ask for another class”. The structured data should already contain enough semantic information in the form of the tag used and the id/class provided, to be able to hang any design elements off a set of CSS selectors.

If you are not designing your application in such a discrete way that a new “corporate image” or a print stylesheet decision only affect the presentational stage, or that a new usability issue only affects the second stage, or that a business decision only affects the first stage, then you are always going to run into trouble.

Visualisation – decline in interest rates

There’s an interesting post at PTS Blog about a chart used in the WSJ to show the decline in interest rates. There are some issues with the visualisation but I think the bigger problem is with using a very selective dataset which may serve more to support the writer’s position rather than reflecting reality. I left some remarks on the blog but here’s a very rough-and-ready view of 3 banks’ rates using a longer time-frame. The past few months do show rapid decline, but set against a different starting point back in 2004, the overall drop is somewhat less significant than the original chart makes out.

Graph showing the rise and then fall of interest rates for the UK, Australian and European central banks from 2004-2008
Graph showing the rise and then fall of interest rates for the UK, Australian and European central banks from 2004-2008

We

After reawakening this morning and seeing that Senate and House were also blue, I had a funny thought: that we would now be able to move on and start changing the world for the better.  I am from the UK and haven’t considered the USA to be a part of ‘we’ for the past 28% of my life.  Welcome back.

And so it begins

Time to get settled in.  Turn all your TVs to different news channels, reprogramme your multi-remote to switch audio between them, open at least 82 Firefox tabs, and hunker down!

Places to watch:

Making Light- Discussion hosted by Bruce Security Schneier
TPM have a dynamically updating dashboard

Edit: Daily Kos have a nice map too
Good luck all; the turnout at least looks promising.

Random: Tool “Remix”

It’s an abberation, and an insult.

But I was told to do it, and so I comply.

Tool – Eulogy (Lyxi Mix #10)

The brief: mix Lateralus with ITV’s Rainbow. Near impossible, feel. So the secondary was: Eulogy mixed with the theme tune of Cities of Gold. I hope I’ve achieved this, and yet destroyed the beautiful song even further by splicing all sorts of nonsense from my wife’s wav collection.

Oh the terrible days we own. I apologise, I understand, and will maintain. This is like jazz; everything remains, for better or for worse. It’s better this way; we can at least reflect, and say sorry – one day. You talking to me? 😉