In my last article, I explored the correlations between golfers’ player stats and highlighted the problem of correlated variables. Today, we’re going to get into the weeds a bit and outline how to disentangle them, and what happens when we do.
Correlation between variables is a pretty well-known problem for data science in general, and it becomes a problem for fitting models efficiently when our list of variables starts to explode. To get around this problem, there’s an approach called dimensionality reduction, which basically calculates the correlations between variables and transforms them so that they’re as independent from one another as possible. Once they’re transformed, we can calculate what’s called the explained variance ratio, which measures how much of the random variation between our variables is explained by each transformed component. The “too long; didn’t read” version: When we have a bunch of correlated variables, dimensionality reduction lets us see how many truly independent variables we have.
Now that the math lesson is over (no more, I promise), how can this apply to golfer stats? I went back to the data set used in last week’s article that had a bunch of correlated variables (Driving Distance/Accuracy, GIR, Scrambling, PPR) and ran them through a dimensionality reduction algorithm called principal component analysis, or PCA. The fit PCA model let me get a backdoor peek at how many truly independent variables there are in player stats, and how much variance is explained by each one. Here’s the explained variance ratio for each of the five player stats after they undergo PCA:
The eye-popping stat: 91 percent of the “true” variance in player stats is explained by three variables alone. The remaining nine percent suggests there is a lot of noise in the remainder of those stats, and by extension, a player’s results.
For those of you wondering why there aren’t any labels on that graph, i.e. “how much variance does transformed Driving Distance explain vs. transformed GIR,” that’s because there aren’t any by design. The downside of something like PCA is once we transform our inputs, we lose any ability to explain what they are intuitively and relate them to an actual concept with which we’re familiar. So with our player stats, once they’re transformed, we lose the ability to explain what’s actually going on with these variables and what they represent.
Doesn’t sound very useful, does it? If we can’t explain what these new variables are, how do we know what “true” stats to look for in golfers? And what’s the point of applying all of these high-level algorithms if there’s not any tangible benefit? The benefit part is coming next week when we tackle one of the most elusive concepts in DFS golf and arguably sports in general: Luck.