Shoot for the Star Schema!


The phrase “Shoot for the stars!” has long been associated with trying to achieve something thought impossible, with the idea that while you may not get all of the way there, by aiming for something grandiose, you will at least achieve more than if you set your goals at something easily accomplished. In more recent years, this has been replaced by “Give 110%”, which leads me to remind the reader to beware the use of a percentage as a metric. (As an aside, that particular phrase is a personal pet peeve of mine, as if I could give more than 100% of myself, I’d finish the cloning process and go full bore Multiplicity.)

In data warehousing, you should quite literally always shoot for the stars – star schemas, that is. A star schema will give you the optimal design for simplicity and performance. Simplicity of design also helps with data integrity issues, as simpler designs will be more easily audited for errors and inconsistencies. So, with that said, exactly what is a star schema? Put simply, a star schema is a series of dimensional tables grouped around a single fact table, with each dimension joined to the fact table by the dimension key (or perhaps the natural key and an active record indicator, as discussed in our Type 2 dimension article). Generally, each dimensional attribute will have zero or more transactions on the fact table. However, as the focus of your star schema should be built around analysis of those transactions, all of these should be inner joins.

To determine what data elements go where, there’s a very simple rule of thumb I like to follow. First, I try and group all of the text fields around a transaction together into dimensions. by associating the textual nouns with the adjectives describing them. For example, a customer dimension would begin with the data elements stored in the transaction that allow for you to associate a unique customer with the transaction. From there, you would add in all of the additional details that would describe that customer. If the customer is an individual, perhaps you include demographic information such as birth date or gender along with the customer identifier, customer name, etc. The dimensional table should be loaded with each distinct record for the dimension (essentially you are adding a group by statement on the entire dimensional record prior to populating it into the table).

Next, I group all of the point in time attributes that apply to that specific transaction into the fact table. This will include date (and time if relevant), the transaction id, and all numeric values you wish to aggregate. It’s important to note that the aggregation is the key – plenty of numeric values can be stored within a dimension. In the aforementioned customer dimension, current age would be an example of this. A measurement that would apply to the dimensional record every time would also qualify – drug dosage for a specific NDC would be another example. At this point, you will also add in the key from each dimensional table in order to look up the dimensional details that apply to the fact record in question. Every transaction should have a record, and only one record, as determined by your predefined transactional grain (I.E. – Is the grain the whole order, the specific item ordered, etc.).

That’s it – you’ve got an overview of the basics now. Please leave questions/comments below which will be used to help fill in the details in future articles.

 

 

4 Comments Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s