Something I hear quite often is an ETL or IT resource declaring something to the effect of “99% accuracy is industry standard”. Well, I’m going to tell you here and now that 99% isn’t good enough, and I’m going to demonstrate why with real life examples.
Genealogists are avidly searching to fill in gaps in the stories of their ancestors, to the tune of nearly a billion dollars annually in just website revenues. While researching a revolutionary war ancestor, I became aware of the pension project over at http://revwarapps.org. Being a data nerd, I realized that this could be greatly enhanced by searching for muster and pay rolls, which would allow for further analysis and research at a company level. For example, one soldier may remember his captain’s name perfectly forty years later. George Woodard remembered serving under Captain John Graves at the siege of 96 in 1781. William Caldwell remembered it being Captain Clemmons. Luke Valentine, on the other hand, remembers serving under Adam Clement, Robert Cobbs, and Edmund Tate. Both are on the muster roll of Captain Adam Clements of Bedford, VA. I know at this point you are thinking, “How the heck does this have to do with poor ETL processes?”
Here’s why the above story is relevant:
In the 1930’s, the national archives hired a number of individuals to assist with indexing their military records, so as to ease the burden of archivist requests from the infancy of a new American hobby – genealogy. These individuals were hired to read muster rolls, pay rolls, and other records, and create an indexed card catalog for ease of searching. While searching for the service of an ancestor of mine, I repeatedly ignored one of these cards, as it was indexed by the name of “Battrick Linch” Years later, while transcribing a series of muster rolls to assist the pension project, I came across the full muster roll of Captain Clements. It was only while transcribing it that I realized the data errors. First, the heading of it clearly said the men enlisted in Bedford, VA. However, the roll was (and currently remains) indexed as Monongalia County, VA. In addition, one of the men listed was clearly Patrick Lynch – not Battrick Linch.
I’m certainly not the first person to come across these errors – but once an error is entered, it gets reproduced again and again. Once that occurs, it becomes far more work to correct than it would have been to prevent the issue from ever occurring by working on having 100% accurate data in the first place. Imagine if the data that was wrong was yours. Would you want this to be you? Do it right the first time, every time.