Data reporting: inner and outer joins

First of all apologies for the styling mess, we are still decorating around here.

I want to propose that nearly every management information report written should never contain an inner join. That is, it should always have a dominant entity and a series of subordinates. This flies in the face of database integrity: in theory your data integrity is watertight and you will always have perfectly matched joins. In practice, I haven’t seen this.

Let’s backtrack.

A simple database contains a Person table and an Orders table. The Person table contains a unique numeric identifier which we’ll use as the primary key. The Orders table contains a unique order identifier, and also the Person identifier used as the foreign key. When you’re running a website based on this database, you typically want to do two things:

1) show the user who they are

2) show the user what they’ve ordered

Let’s also think about join types: an inner join will return records where the key (Person) exists in both the Person and the Orders table. It won’t show you records where PersonID exists in Person but not in orders, and it won’t show you records where PersonID exists in Orders but not in Person (rightly so – your database is broken if it does).

In situation 1, you’re just probing the Person table for a given ID – simple enough and no joins involved. You might want to show them some information about the number of orders they’ve made, and that could be an interesting experience – we’ll come back to that.

In situation 2, you’re listing the Order table for a given personID – also simple enough, you barely need the join at all but an inner join – selecting all records where the personID exists in both the Person and the Orders table – makes perfect sense. If there’s nothing to show, there’s nothing to show – the page is personal to you, so there’s just nothing to show. It’s not like you’re missing – you know you’re on the website and exist, so you just have no orders. The join is passively broken.

Let’s play at management information though. Here you are never an individual, you’re always taking an overview and wanting to use your Person table as a critical measure of ranking performance. You now care about the combination of Person and Others, where it’s a many to many situation, not just the one Person to many (or none) Orders you had as a user.

Say your Person table has 60,000 customers. 30,000 of these have placed one order (for this example, you get banned after your first order). If you make an inner join on these two, you’ll end up with 30,000 customers and a single order record for each. Great! We know… very little indeed. We’ve learned that of those people who have made an order, they’ve made an order within the rules of our market. (now OK, the one order thing is farsical but it’s to avoid the multiplication of records – you could easily replace the ‘one order’ with a sum of orders and get something more meaningful, but….)

As someone looking at the overview, the most important thing to you is to see everything, even and especially when it has no subsequent impact. Your question is “what proportion of my customers are placing orders?” As an analyst, you can’t answer this with an inner join: an inner join will only ever give you the customers who have made orders, so it will always be 100%. That’s not useful.

I believe an analyst should only ever use one type of join, and it should be a left or right join. Personally I’ve always gone with left outers, but I’m left handed – you might like rights. An outer join says ‘give me everything from the dominant table, and then whatever matches from the next’. The join is actively incomplete. As an analyst, you have to decide what your dominant entity is, and then always ensure there is subordination from there. That decision of the first dominant table is important, but in truth it’s usually pretty easy to work out – in many cases it’s a person, a company, or a machine. An inner join requires that there is data at the other end: the whole point of management information is to show where there are gaps, where there is non-engagement, so that you can work on it and improve so that your performance increases. Gradually those null foreign keys should become populated.

Indeed, look at that as a base measure: the proportion of null foreign keys you have is a measure of how much work you have to do. It should never be zero, otherwise you haven’t found an uncaptured market yet; you should always have partially empty tables hanging off one another. It isn’t about data integrity – it’s about exposing possibilities. I worry when I see an inner join: it means something is possibly being excluded that I might care about. Like all the potential signed-up customers who’ve never ordered anything from me. Time for a ‘Hi, it’s been a while!’ email, perhaps?

That’s my experience, anyway – I may be totally wrong about some of this, but that’s the way it’s worked out in practice. Your thoughts very appreciated.

[admin note: migrated from FindingVirtue to Ixyl in April 2019]

Leave a comment

Your email address will not be published. Required fields are marked *