Welcome to the third and final entry in my three part series on leadership in hockey. The first part dealt with leadership as a latent variable, and the second covered leadership as an independent variable in the regression equation. This installment will focus on how leadership may fit into the error term within the regression equation.
Before I begin, I want to take a moment to let you all know how surprised I was at the response to the previous post on leadership and regression analysis. In the analytics chapter of his latest book, Bob McKenzie quotes Vince Ferrari (the alias for Timothy Barnes) as saying
The math Fs people up. A lot of what is written now is way over the top with numbers. It’s crazy–kids with graduate degrees in stats or math, and they’re always looking for raw numbers and they want to throw them into the regression hopper and say, “Your answer is this.” Hockey fans, people in general, don’t know what this shit means. So they’re either impressed or they say, “This is shit. You’re a nerd.” Neither is a good thing. The kid with the PhD is probably a good person but he’s fucking up really bad. The conversation doesn’t move forward.
The basic argument from Barnes, and several other big names in contemporary hockey analytics, is that we have to keep things at a pretty basic level. However, I am not really after an analytics job with a hockey team (I have a very good job), and this blog is really a pet project that centers around the question of how different types of analysis can be usefully brought together. This essentially makes it a theory and epistemology blog, which means that it will be heavy going at times in order to do justice to the topics at hand. I would argue, contrary to what Mr. Barnes says, that these types of posts do move the conversation forward. We need to learn to do what we do very well and with precision, and then we need to learn to do it all better.
A portion, perhaps a small one, of people who read about hockey want to read about how the numbers work, and about the logic behind the analysis. I am now more convinced of this than ever because my post on regression tripled my previous record for hits. The nerds have spoken.
Sorry for the long intro, but I really wanted to get that off of my chest. Now on to error terms, which is the real topic of this post.
Types of Errors
When you look at a regression equation, different types of error occur in different places. Here is the equation for multiple regression, which I am re-posting so you don’t have to open up a new browser window to have it handy:
Yi = β0 + β1Xi1 +β2Xi2 … β1pXip + εi
The first place error can occur is in the measurement of the independent variables (IVs), which are the β1Xi1 +β2Xi2 … β1pXip part of the equation. The “p” is the number of IVs that you are dealing with. Mistakes can happen when these variables are measured. For example, if you are using shots directed at opposing net as an IV you may miss a shot or two, or a shot that drifts to the corner may be coded differently by two different people. If it is being tracked electronically, there may be an error in coding or a blip in the signal that disallows the shot attempt. There are ways of dealing with such errors, largely through use of more sophisticated modeling techniques such as Structural Equation Modeling. For now, errors within the model are not particularly important because they have little to do with situating leadership within the analysis.
The second locus for errors is not an error per se, but I will include it here anyway just to keep the story complete. Residuals are basically the difference between the expected values (defined by the regression line) and the actual observed values. Those are not errors so much as issues with model fit.
The third, and most important, place where error is located in the regression equation is in the term εi, which is 1.00 (i.e. 100%) minus the R² value, which the amount of change in the dependent variable (DV) that can be attributed to the IVs. For example, if we create a multiple regression model with an R² of 0.62 the εi would work out to 0.38 (i.e.38% of variability in DV not accounted for by IVs). This all seems pretty straightforward, until you delve a bit deeper into it as I will do in the next section.
How much Room is there for Leadership in εi?
Continuing with the example above, if the εi works out to be 0.38 then variables we have not considered, including leadership, may account for that amount of the variability in the DV. However, as David A Freedman usefully points out, this is not always the case. For purposes of this discussion, the main reason why the εi is not always just the sum of variables omitted from the analysis can be found in the assumptions of linear regression.
The key assumption that needs to be considered is that the IVs are not correlated with each other. In current hockey analysis pretty much all shot based metrics are very likely to be correlated with one other, which means that multiple linear regression models can probably only include on shot-based metric as an IV. There is also the fear of creating endogenous variables (i.e. variables that can be explained by functions within the model), which would render any number produced by the model completely inaccurate.
If we assume that leadership is not correlated with existing shot metrics (which is probably a safe assumption, but still needs to be empirically tested), we still cannot say that the maximum amount that leadership can account for in the model is the εi value. The reason is that there will be many factors that are not considered, and some of those factors will be correlated with each other. Thus the rules of multiple linear regression prohibit the model from ever realistically reaching a perfect fit. For example, if we included leadership in the multiple regression model it would lower the εi value. However, we would not be able to add other IVs that are correlated to leadership, such as “mental toughness” or “determination.” Moreover, a part of the εi would also be speed, agility, and other such physical skills that we will be able to more accurately measure if/when SportVu filters down to fans.
To sum all of this up, “kitchen sink” regression models that include everything you can find will almost always violate the key assumptions of regression analysis. Leadership currently resides in the εi of the regression model, and if we include it as an IV it would be subject to the same rules (multicolinearity) and issues (measurement accuracy) as any other variable in the model.
Leadership is a latent variable that currently exists in the εi of regression equations. A lack of data currently prevents us from measuring leadership. If we did measure it and include it in the regression analysis, it would be an IV expressed as β1pXip.
Regression equations are a useful case study to show how “intangibles” can be combined with current analytics to form amore complete picture of what is happening on the ice. However, this is not the only way of combining old and new approaches to hockey analysis. In my next entry I will discussion mediating and moderating variables as a starting point to creating increasingly complex models.