Rolling up your shirtsleeves has long been associated with the idea of working hard. I have zero issues with working hard – but you should work smart as well. I was reminded of this recently in a discussion with a colleague regarding a table that stores distinct record counts by day. The idea behind the design was to allow for a quick rollup to determine how many distinct records were added over the course of the month by adding the daily values. Putting it nicely, that is a terrible idea. With this particular data set, transactions can change over the course of several days, weeks, and even months or years. To illustrate how this can lead you down the road to ruin, I’m going to use an example everyone should be able to understand – church records.
Let’s say I am going to start with two key measures of success – offerings and attendance (a head count metric applicable in every business) at Sunday services. For now, I’m going to exclude any dimensional data, and focus on these two metrics. Because you aren’t going to take roll call in Sunday service, and the offerings all get tossed into a collective plate, the individual attendee and/or donation cannot be determined as the grain of the table. Instead, the grain is the Sunday service (for now we’ll assume a single service per week for simplicity). Here is a sample data set for the four Sundays in a single month.
1st – $100, 4 attendees
2nd – $100, 5 attendees
3rd – $200, 10 attendees
4th – $120, 8 attendees
Rolling the dollar amounts up is pretty straightforward – the sum of all four Sundays is $520. The average per week is $520/4 = $130. Even the average donation per attendee works, either at the rollup level or at the individual service. Fact tables are built for accumulating dollar amounts, so this should work all the time.
The headcount attendance metric, on the other hand, is a bit more complex. It is a count distinct at the weekly service level. As such, it is impossible to do analytics and effective rollup on this metric. For example, it may be that you only have 10 distinct people in total for all 4 weeks. On the other hand, you may have 27 distinct attendees, as none of them attended twice. It’s most likely that there is some combination of the two, with some having attended just once, and others having attended multiple times. This is the key problem with a rollup of a count distinct values as a metric. Unless you have the most atomic level of the metric’s data within the grain of the fact, there is no way you can absolutely and with certainty parse out the data and allocate it well enough to discern what the count distinct value would be rolled up to over a period of time (in this case, over the course of a month).
This is not a new problem. In the preface to Virginia Soldiers in the Revolution, Hamilton J. Eckenrode discusses this at length. On page 5, he states, “Unfortunately such payments [for militia service] are generally put down as having been made to some officer for himself and his company as a whole…” In other words, when the total payment record for the militia company has been rolled up to the company level, there’s no method by which even today’s data scientist can determine to whom the payments were eventually made. Not only does this prevent knowing the details of the individuals who served (name, rank, etc.), but it also prevents us from knowing how many served – or even for how long that particular tour lasted.
Why is this example relevant today? One of the problems with tracking revolutionary war service was that militia soldiers were shuffled through captains and reorganized with each militia call-up. While the speed of reorganization is far quicker than what you will see in today’s modern corporation, the core question is the same – how are your resources allocated throughout the organization? Just as importantly in today’s world, how do you track this over time? As metrics move into analytics, it is imperative that today’s leaders are able to track changes in employee and team performance as management or team organizations change. For more detailed reading on this, I highly recommend Jac Fitz-Enz’s books, specifically The New HR Analytics. In the meantime, please don’t roll up distinct counts as gospel – if you do, you’ll run the risk of adding 2+2 and getting 5.