While we’ve previously discussed the importance of determining the grain of your fact table, we really didn’t cover exactly what data elements should go into the table. One of the reasons much of this site will concentrate on dimensions is because the dimensional data keys are integral to the fact table build. Once you determine the grain and decide on the dimensions, the fact table in a traditional star schema will be a representation of the point in time at which each of the dimensional attributes meet. In addition, any numeric attributes or measures solely associated with the specific transaction will be included. Finally, a fact table may or may not include a transaction id from the original source system.
If you’re reading this, you are probably already familiar with a typical spreadsheet design. Across the top, you have a series of column headers, and on the left, the rows are numbered. This in turn is meant to mimic the flat files developed as the original database systems (anyone reading this remember punch cards?). What you may not know is that these were designed in order to assist the US Census Bureau with the collection and storage of data – in 1890 (Sadly, in spite of this innovation, most of the 1890 census was destroyed by fire in 1921). The census records were set up similar to how a spreadsheet is today – column headers define the data listed below them, while the leftmost column typically defines the row.
The census format took its cue from Auditor’s Account Books, which were used to keep incoming and outgoing records to and from the government’s treasury. These books listed a textual description of the transaction on the left with columns for the numeric values on the right. Each page often included running totals at the top and bottom of each page. Believe it or not, but similar ledger records have been found going back more than 5,000 years. You heard me right – there were bean counters back when all the people had worth counting were beans. Over the millenia, people have alternated between recording these records from top to bottom or left to right in an attempt to improve efficiency. Perhaps this was the first data warehouse debate. The primary difference between the Auditor’s Accounts and the census book is that the census was attempting to record multiple metrics simultaneously, while the Auditor was only recording cash income and expenditures. The column format of the Auditor’s books enables the quick totaling of the values, as all of these are numeric. The census, on the other hand, often records textual values within the cells, along with a series of check boxes that could be construed as a precursor to the modern day database flag field.
So… how does this relate to building a fact table? It’s actually pretty simple. A proper star schema provides you the flexibility of both a vertical design and a horizontal. For speed of querying, rather than storing duplicate values once per record that is relevant, it instead stores the distinct values in the dimension table. As a result, the user can quickly query and filter any and all of the textual values within the data warehouse. The fact contains only the dimensional ID fields you will need to join to the text within the dimensions and any numeric values that can be aggregated vertically in a column as were done in the Auditor’s books of old. The result of this should be a series of wide dimensional tables and a long fact table. When joined, the filters on the distinct dimensional values in turn filter the fact table through the use of inner joins. The filtered fact table values will then roll up to whatever grouping is applied (if any). The output will then appear to be just like the original flat file format, but will be far more efficient to store, audit, and query than if it had queried a flat file.