Gross! When Dirty Data Flows Downstream

When my wife and I were married a decade ago, we discussed honeymoon options. I suggested Hawaii, Italy, or Greece. My wife, being as intelligent as she is beautiful, said that she would be good with any of those options. Sensing that this gave me an opportunity to plan something really cool, I volunteered to take care of the arrangements and surprise her with the results. After all, my analysis of the trip options showed that all three locations had mountains, beaches, and volcanoes. In July, all three have average highs in the low to mid-80s. This let her pack a bag that could literally go to all three locations. To teach my hyper-planning wife the beauty of versatility, I bought a Frommer’s guide to each location for her to thumb through and dog ear activities that looked interesting.

Like a House Hunters episode, I started the choosing process by eliminating one option – Hawaii. I reasoned that we would be starting a family soon and that by being in the US, I could guarantee access to key products like diapers and wipes. This in turn made me think that it would be a great family trip later down the road. Of the latter two, my first choice was Italy, due in part to having taken five years of Latin in middle and high school. However, during the planning process, I came across a number of articles like this one, talking about towns around Italy that were struggling to get trash off of their streets. Worse, I subsequently read about a garbage collector’s strike across southern Italy, which made an already poor situation worse. This resulted in my planning a trip to Greece, which to be perfectly fair, was neck and neck with Italy for me.

One of the highlights of our honeymoon for me was the trip to the palace at Knossos. Since seeing a slideshow of it in elementary school, the ancient home of the Minoans had been high on my list of places to see before I died. During the tour, I got to show my wife my two favorite parts – the dolphin mosaics and some of the world’s earliest toilets. That’s right, I said one of the highlights of my honeymoon was seeing 3,500 year old toilets. Note that these were in the queen’s apartment. Even back then the ladies’ room was nicer than the men’s facilities (apparently either a chamber pot or the nearest tree)! The toilets were pretty rudimentary, but they did use water to flush out the waste.

I know you’re asking, “what does this have to do with data warehousing”? If you were paying attention to the above paragraphs, you’ll notice two distinct topics – the undesired accumulation of garbage and a process to dispose of waste. So, dear reader, let me ask you this: How many times have you dealt with dirty data in your warehouse due to a “Garbage In, Garbage Out” error in your data management process? You shouldn’t avoid the issue, as I did when I avoided Italy on our honeymoon. You also shouldn’t let the waste accumulate – you need a process by which to get things clean again, following the example of the waste disposal system at Knossos. For thousands of years, humans have recognized that filth breeds disease, and dirty source data is the disease that can kill the usefulness of any warehouse (other other downstream applications and reports for that matter).

With that said, how to you solve it? That largely depends on where you are in your data warehouse process. If you are just starting out from a single source or two, then you should begin with a review of the data quality within the source systems. Often this is a painful exercise, but good health is often a painful exercise (there’s a reason I only run after my children or away from danger). That doesn’t mean you shouldn’t do it (similarly, I should probably run more and write less). Your data will likely never be perfect, and that’s ok. Humans enter the data at some point, and humans are inherently fallible. However, perfection should still be the goal, and for the data movement steps, is absolutely attainable. By starting at the source, you minimize the garbage going in, which in turn will minimize the garbage coming out of the warehouse in the form of reports. For continuously improving your data quality, I’m a big fan of a Six Sigma methodology, but it doesn’t really matter if you apply that specific process to it. What does matter is that you apply some process to discern errors and correct them going forward. Often this can be through changing the data entry process to something more automated. As you build your ETL process, I recommend adding data quality checks at each step in the process. Similar to turning on the full logging within an application, this can potentially slow down your ETL process, but in my opinion, these checks are valuable steps even if you choose to turn them off. After all, it is far easier to turn something back on than it is to create a data quality check within your ETL process while you are frantically troubleshooting how bad records were inserted into your warehouse.

In the above paragraph, we largely covered what to do with a new data ecosystem. If, however, you are working with a more mature data warehouse, I would recommend doing the inverse. Unless you already know of a bad process causing inaccuracies in the warehouse, start with one of your commonly used dimensions. While many people would say that you begin with the fact due to the need to ensure the numbers are accurate, I would disagree. First, your users will be filtering most queries on dimensional data. As we’ve previously discussed in Don’t Know Much About… NULL Values, a filter on a dimension field containing NULL values may accidentally exclude records meant to be included. Second, if the dimension has erroneous or duplicate keys, it won’t matter how accurate your fact table is, as it won’t properly map anyway. Third, your dimension should have far less records than your fact table, making clean up more efficient and far less overwhelming. Within the dimension, do some basic initial analysis – I.E. – select count(1) with group by on a single column from the dimension table. Look for data integrity issues with the landing or staging area first. Once you are confident in the ETL process taking place on your Data warehouse system, check the values against the source system in a similar manner. Remember to exclude anything in the OLTP with an update timestamp after the ETL job executed, as those records will be different by default.

With one clean dimension, the validation of the metrics within your fact table should be easy to do. Simple summations and counts with groupings from your recently cleaned dimension when compared with the source will provide all the confidence you need to either prioritize the fact table or move on to the next dimension within your star schema. As you create your validation scripts, I would recommend documenting them, and perhaps even scripting them as data quality checks going forward. I know I’ve essentially stated that several paragraphs ago, but it is important enough to mention again. Repeat this process with each dimension, and at the end of it, you should have a data match to your source system. After all, this portion of our work is an exact science. Here is where things can get most interesting, and where BI can often provide the greatest value. Once your warehouse data perfectly matches the source data, any discrepancies in your data quality must inherently be source system issues. As you discover these, write additional scripts to check for these errors going forward. Most importantly, bring these to your business for them to review the process that created the data defects in the first place. If you can suggest the solution without delaying the notification of your clients, I would do so. Not only is it the right thing for the business, but it’s also the right thing for you, as you get to control the solution to ensure it doesn’t create additional problems in the downstream ETL code.

Rock the Data!

Gross! When Dirty Data Flows Downstream

3 Comments Add yours

Leave a comment Cancel reply

Share this:

Related

3 Comments Add yours

Leave a comment Cancel reply