The Best of Sam Cooke was the first CD I ever bought. One of my favorite songs on that CD, “Wonderful World“, begins with the iconic line “Don’t know much about history.” Through the rest of the song, Cooke sings about a number of the other things he “don’t know much about”. He then ends each verse with a contrasting item he does know. For example, he “don’t know much about” trigonometry or algebra, but he does “know one and one is two”. For me, this song perfectly exemplifies the use and misuse of NULL values within a data ecosystem, and specifically within the data warehouse.
For those readers who do not know, a NULL value is a data value that is completely unknown and/or does not exist. In a transactional system, NULL values are an absolute must. When a user is entering a transaction into an application screen, not every field will be relevant. For example, when entering the name of a customer, the surname (Sr., Jr., III, etc.) is only relevant on occasion. Most people don’t have a surname. For storage and efficiency purposes, nothing gets entered into that field for those customers where it isn’t relevant, and consequently, nothing gets stored in the DB, either.
Within the data warehouse, on the other hand, you want to avoid NULLs when you can – especially within the dimensions. The Kimball group, founded by guru Ralph Kimball, states the reasoning succinctly, so I will refer the reader to their take on it as follows:
Null-valued dimension attributes result when a given dimension row has not been fully populated, or when there are attributes that are not applicable to all the dimension’s rows. In both cases, we recommend substituting a descriptive string, such as Unknown or Not Applicable in place of the null value. Nulls in dimension attributes should be avoided because different databases handle grouping and constraining on nulls inconsistently. [Quoted from the Kimball group, here]
The key reasoning is the inconsistency with the way NULL values are handled in different DBs and query tools. Many tools ignore NULLs at all times, which is detrimental when the user is trying to exclude specific values. For example, if the user wants to only show customers born in Virginia, it’s easy enough to filter the birth_state_cd column on the state value of “VA”. However, if the user wants to exclude customers born in Virginia, a filter of not in (“VA”) would exclude native Virginians, but in most systems and query tools, it would also exclude all customers whose birth_state_cd is NULL (the default value in the source system for unknown). This would compound their first mistake – because I can say without any bias at all that Virginians are naturally good looking and intelligent. It’s a fact, Jack!
My terrible comedy and shameless plugs for other articles on this site aside, hopefully you can see how this could create confusing results for your end users – especially when they are writing ad-hoc queries. Instead of giving your end users the chance to have unexpected results, substitute a standard consistent string instead, as suggested in the quote above. Not only will this allow for your users to get better results, but it will also give you the opportunity to begin consistently auditing the number of records missing key data elements, and perhaps improve your data entry processes as a result. If you happen to listen to Sam Cooke while you do your work, all the better for you.
2 Comments Add yours