This is the second post in a three part series on leadership and hockey analytics. The first post, which you can find here, discusses leadership as a latent variable. The next two entries will go through where leadership slots in to linear regression analysis, with this post focusing specifically on what regression is and how it works, and the next dealing with error terms and residuals. Just so you know what to expect, discussion of leadership will disappear for a while as I go through key points about regression analyses. It does come back toward the end of the post, once the stage is set and everyone (hopefully) has a firm enough foundation to understand what I am talking about.
Although I will try to use plain language as much as possible, some of the details will be technical and “mathy.” That is simply the nature of the beast. My advice to readers is to just skip past the things you do not understand, if there are any, and focus on getting the gist of the argument. Ideally this post will function both as a part of my larger argument about leadership and as a reference page as well for hockey fans who want to give regression analysis a whirl.
Just to be up front and honest, I have seen hockey bloggers attempt regression analysis a few times and, to be blunt, a lot of it has been disappointing. I understand the drive to apply advanced statistical procedures, such as regression analysis, to hockey analytics. After all, with the spate of recent hires in the analytics community, blogs have become a type of open audition that, if done well, can attract the attention of NHL teams looking to fill out their respective analytics departments. However, every advanced statistical procedure comes with many assumptions and/or conditions that absolutely have to be met in order to produce meaningful results. When the assumptions are not met the results should be treated with suspicion, at best, and may even be misleading or meaningless. For this reason, my secondary goal is to go through the logic and key assumptions so that readers can more easily spot the analyses that are essentially incorrect and/or nonsensical.
As usual, I have absolutely no interest in calling anyone in particular out, nor will I link to any illustrative examples of poorly done regression analyses. That game is boring and pointless, and I refuse to knock others down in an attempt to prop myself up. If I link to something you can assume that it is, in my eyes, solid. I also made this post link heavy, and if you clink on words or phrases that are a different color and underlined you will be directed to far more detailed overviews. So, with that long introduction out of the way, it is time to move onwards and upwards.
What is Regression Analysis?
A regression is basically a model that describes the influence of one or more independent (or explanatory, or predictor) variables upon a dependent variable (the one you want to predict). One example in hockey would be zone entries, where it is possible to use regression analysis to chart how well controlled zone entries predict wins in regulation. Regression models are evaluated by how well the independent variables accurately predict future outcomes.
It is very important to note right from the beginning that there are hundreds of types of regression analyses. Each type of regression applies to a different situation, so selection of the correct form of regression is extremely important. Frustratingly, each type of regression also has its own set of rules and assumptions. As a result, using regression models well is all about practice and gaining useful criticism/feedback. Everyone makes mistakes early on and having a thick skin, accompanied by a willingness to accept and integrate constructive criticism, is a must.
The three types of regression analysis that have the most obvious applicability to hockey analytics in their current form (and this will change if and when SportVu data becomes available) are linear regression, multiple linear regression, and logistic regression. A linear regression is a simple model with one dependent variable and one independent variable. The relationship between the two can be drawn using a straight line. A multiple linear regression is similar to a linear regression, except that it contains two or more independent predictor variables. Logistic regression is a totally different animal, because the dependent variable is binary rather than continuous, which means it can be used to predict outcomes such as wins and losses. For purposes of keeping the word count down to a level where someone may actually read all of this I am going to focus on linear and multiple linear regressions for the remainder of this post. For those that are curious, fantastic information about logistic regression analysis can be found here.
The terminology surrounding the various forms of regression analyses can be confusing. I am going to take a moment to outline some key terms that are important when discussing linear and multiple linear regressions.
Regression equation: A linear regression essentially maps out a linear trend, which is expressed as a straight line. The equation looks like this:
Yi = β0 + β1Xi + εi
In the equation Y is the dependent variable and X is the independent variable. Note that this is simply another way of expressing the formula for a straight line, which is:
y = mx + b
My reason for pointing this out is that some programs, such as Excel, generate regression equations that follow the second form (straight line). The same information is there, but the position of terms on either end of the plus sign are reversed, and (more importantly) the error term is absent.
Intercept: The intercept is the point at which the straight line crosses the y (vertical) axis. In technical language, it is the value of the dependent variable when the dependent variable is equal to zero. In the first equation it is βο, while in the second equation it is the b.
Slope: The slope is the direction (angle) that the line takes once it departs from the intercept. Speaking in math language slope is the amount of increase in y brought about by a unit increase in x, and is expressed as β1 in the regression equation and m in the equation for a straight line. Once we have this information the line is drawn, and we can use it to calculate a predicted value for the dependent variable based on a given value for the independent variable.
Point estimation: The regression equation can be read as a type of number sentence that provides the unknown value of Y that is referred to as a point estimation. For example, if we were plotting out rises in the NHL salary cap ceiling it may read: “The cap set at $69 million dollars (intercept) and goes up $2 million per year (slope), so the predicted salary cap in six seasons is projected to be $81 million.” Solving the equation looks like this:
Y = $69 million + ($2 million per season) (6 seasons)
Y = $69 million + $12 million
Y = $81 million
Residuals: When you plot real data onto a chart it is immediately apparent that very few, if any, of the dots actually fall directly onto the regression line. The line itself is established by a mathematical criterion known as “least squares criterion for goodness of fit,” which essentially calculates the line that most closely matches the data. A residual is the vertical distance between an individual data point and the regression line.
r² (pronounced “r-squared”, also known as the coefficient of determination): r² is a numeric expression of how much of the variability in the dependent variable can be attributed to the independent variable. In other words, r² is an effect size. This score range from 0 to 1, and the number can be expressed as a percentage. For example, in a linear regression model with an r² of 0.34 we can say that 34% of the variability in the dependent variable can be attributed to the independent variable. As a general rule of thumb that is used in the social sciences, an r² of 0.2 is the lower threshold for a successful model.
Error: In the regression equation that is listed above, the ε represents “error.” The error is random in the sense that it is specific to each observation (i.e. some observations have less error, and are closer to the regression line, than others). A common assumption is that the error term represents the sum total of potential effect of the variables that are missing from the equation. That assumption is mathematically questionable, as explained here. So while I have no issue with saying that a part the error term in some regression equations may be attributed to the absence of a measure for leadership, I do so knowing full well that showing this is a complex process. For now, let’s just say that if fancy stats can predict about 60% of wins, some part of the other 40% may plausibly be accounted for by leadership and other factors. Attributing the entire error term to luck is a stretch.
Data cleaning: It seems odd to include data cleaning in this list, but it is important. Linear regression models are highly susceptible to extreme cases. If Erik Karlsson’s shot attempts and points, for example, are way above the normal trend for NHL defensemen it is important to remove him from the analysis. If you do not, the prediction will be off as a result of extreme cases pulling the line toward themselves. To use a non-hockey example, imagine a neighbourhood where almost everyone is making about $80-100k per year. Bill Gates moves in, and he takes in $2 billion that year. If you leave Gates in the analysis, the line may estimate salaries in the area as being in the hundreds of thousands, which is grossly inflated due to this one extreme case (known as an outlier). An model that does not remove outliers has questionable value as a tool for predicting outcomes. A straightforward overview of rules for detecting outliers can be found here.
Multiple linear regression: A multiple linear regression is simply a linear regression with two or more independent variables. The equation looks like this:
Yi = β0 + β1Xi1 +β2Xi2 … β1pXip + εi
Note that the middle part of the old equation that contained the slope (β1) and independent variable (x) is now expanded to include multiple independent variables. The interesting thing about multiple regression analysis is that adding a variable to the equation will always increase, and never decrease the R² (note that a capital R is used in multiple regression, but it means the same thing as the old r²). However, you have to be extremely careful about what variables you add into the mix, as we will see when going through the basic assumptions of linear regression models. Just as a head’s up for what is coming: adding a leadership scale is a fantastic choice that meets all of the standards.
Now that we have the terminology straight, the next step is to quickly go through the assumptions of linear (and multiple linear) regression models. I am not likely to make many friends here, largely because (based on what I have seen) these assumptions have often been ignored by hockey bloggers playing with regressions, but I think it is important to go through them. There are four key assumptions that are absolutely essential in regression models, and a fifth assumption is added for multiple linear regression models. These assumptions are rules, not suggestions. If you ignore any of these four assumptions your results will be hot garbage on a stick. At a bare minimum, if any of the assumptions is violated the author must alert the reader and explain why the issue was not fixed.
I am going to run through these quickly in a down and dirty way, but if you want more detail about the consequences of violating regression assumption you can find them in this paper. If you are looking for more detail, and a more complete and complex overview of regressions and their assumptions, this is a great place to start.
Assumption #1: Linearity: There must be a linear relation between each independent variable and the dependent variable. If, for example, you are using save percentage as a dependent variable and age as an independent variable you may run into trouble. The reason is that age goes up in a continuous linear fashion, while save percentage starts low, goes up until about 23 or 24, and then tails off. This suggests there may be a curvilinear rather than linear relationship, but it is impossible to know without testing the relationship. I would not trust the results unless the person doing the analysis provided F-Test scores that proved the relationship falls within the threshold for linearity.
Homoscedasticity: This looks at whether the variance in errors is increasing or decreasing over time. The basic idea is if the errors shrink or grow at either end of the regression line then all of the estimates are likely to be off. To go back to the salary cap example, the real cap increases are not evenly spread out at roughly $2 million per season. If one season saw an increase of $1 million, the next $3 million, followed by $2 million and so on the model is probably fine. However, if the increase is $2 million, $1 million, $2 million, 2.5 million, $3 million, $4 million, $10 million, etc, then the variability is increasing over time which makes the model heteroscedastic.
Normality: This is another assumption that can easily be problematic for hockey analytics. The variables used in a regression model are assumed to have a normal distribution, which means the values should look like a bell curve when you plot them out. If the bell curve looks like it is leaning over to the right or left the variable is skewed, and if the bell curve is low and flat it is said to be kurtotic. I have not looked at all of them, but hockey stats seem prone to having weird distributions. The solution is to use statistical transformations, such a log, square root, cube root, etc, to turn the matrix until it conforms to a bell curve. The process is described in Tukey’s Ladder of Powers, which is summarized here. This is not a fun one to deal with, and it may lead to issues with interpretation of results. The thing is that if you do not do a transformation when you have data that is skewed or kurtotic the ensuing analysis will produce numbers that are flat out wrong.
Independence: This is more specifically know as independence of errors. The general idea is that all of the variables must be independent from one another. A common problem is serial correlations, where the variables are obviously interacting in systematic ways. Another error, which occurs on occasion in hockey blogs, is using endogenous variables. Endogenous variables are jointly determined within the system rather than being independent. For example, if an analytics blogger used corsi and save percentage in the same regression analysis the result would be a violation of the independence assumption. The reason is that corsi is defined as shots on goal plus shots that miss the goal or are blocked, while save percentage is shots saved divided by shots on goal. I underlined the term “shots on goal” once in each of the formulas used in calculated supposedly separate variables. As a side note, the problem with many of the fancy stats used today is that they share too much common ground, most notably shots directed toward the net. SportVu will add speed and direction, which will help with respect to making good regression models. Another possibility is (surprise surprise), adding elements like leadership scales.
Collinearity (multiple regression only): Also known as multicollinearity. In multiple regression models, each independent variable must be uncorrelated with the others. The reason is that it is almost impossible to assess the independent impact of independent variables that are highly correlated (check this using tolerance and variance inflation factor, VIF, calculations). The same critique about hockey analytics found in “independency” applies here.
I have no way of knowing how many analytics bloggers ensure that the assumptions of a regression are met before using the test, largely because they tend to miss key pieces of information when they present their findings (or, in some cases, “findings”). The common argument is that people looking through hockey blogs don’t want to be bogged down with heavy numbers. Fair enough. However, outsiders still need some way of checking to ensure that the numbers they are reading are a) accurate, and b) meaningful. I would still suggest that a link can be provided to a separate page that goes through key diagnostic information.
With respect to leadership, examination of linear regression models shows that leadership would slot either in the error term. In the case of multiple linear regression modeling, leadership would fit well as a separate variable due to the fact that it would add to the predictive power of the model without being endogenous.