Old habits die hard. That, in a nutshell, is the concept behind regression to the mean. To understand this concept, let’s first define what mean, means (and forgive me for sounding like Bill Clinton during the Lewinski affair). Mean is the highfalutin way statisticians say “average”. With regression to the mean, the philosophy is that while outliers may occur, things will average out over time. Nowhere is this more easily understood than in the world of sports.
Let’s look at the case of Eric Thames. In 181 games across two seasons in the majors, Thames posted a cumulative .250 batting average and a slugging percentage of .434. He then went to Korea and played professionally for three years, where he put up an impressive .348 average with a .715 slugging percentage. Now with the Milwaukee Brewers, he started 2017 off on fire, with a .345 average and .810 slugging percentage. Articles appeared everywhere, touting his new skills, approach to the game, etc. MLB drug tested him extensively to see if PEDs were causing this newfound improvement, as his April also included 11 home runs and 28 runs scored – both club records. His star looked to be on the rise with no end to his good fortune in sight.
Of course, if that were the case, he wouldn’t be in this article. Thames crashed in May, hard, and his ups and downs have continued so far this season. As of the writing of this article, he is posting a .236 average with a mere .507 slugging percentage (Edit – he finished the 2017 season on an upswing with a .247 average and a .518 slugging percentage). This is a perfect example of regression to the mean. Quite simply, his April numbers were too small a sample size to prove change occurred. Instead, April proved to be an outlier, and Thames reverted to the batter he’s always been. Sort of.
As Mark Twain once popularized, “There are three kinds of lies. Lies, darned lies, and statistics.” It’s imperative in any kind of analytics or BI position that you tell the complete story. Even sporting that .236 average, Thames is also earning a .348 on base percentage (edit – he finished 2017 with a .359 on base percentage), thanks in large part to the plate discipline he learned as he matured as a hitter, but also in part to the power he showed earlier this year. As of today, he’s been walked 66 times – tied for best on the team (he finished with 80 walks on the season). He’s getting a free pass to first on 14.75% of his plate appearances, nearly three times the 5.5% he posted earlier in his career (He finished with 14.5% of the season). His strikeout ratio, previously at nearly 27% of his plate appearances, has crept up slightly to 29% of his plate appearances, which is still pretty close to the batter he always was (he finished the 2017 season with 29.5%).
As we’ve discussed previously, sample size is key to understanding the relevance of your data. The smaller the sample size, the greater the likelihood that an outlier may occur. One of the most important jobs of the BI analyst is to sift through the larger data sets to determine if a new pattern is emerging, or if it is simply an outlier. If it is an outlier, the odds are that eventually the pattern will normalize over time and regress to the average for the data set as a whole.