It has been a little while since I put up a new entry, and I want to start with a deeply personal note. We have the usual back to school busy season, except with all of the extras that go along with having a daughter who is a brain tumor survivor. That means a lot of meetings, a lot of extra discussions and planning, and a lot of extra stress. The good news is that my daughter completed her first week of full school days in three years. She was diagnosed part way through grade 1, which is when full days started for her, so she has actually had less than 10 full weeks of full days at school in her entire life. This was a huge accomplishment.
Moving on to hockey, I have had a few ideas for entries but they will take a bit more time to pull together. I’ll talk a bit about those after I post a few additional comments about my theory and hockey analytics as well a goaltending metrics.
Theory and Big Data
In this entry I argued that theory is central to any good quantitative analysis. My basic argument is that if we continue to run measures against one another we will eventually come up with correlations. However, those correlations may not make sense and may not actually be “real.” The chances of this happening increase as the size of our data sets increases. The idea behind this is that when data sets become huge every tiny differences becomes statistically significant.
I few days ago I came across a video that says a lot of the same things, although this speaker is far more gifted than I am in terms of putting these points across:
I encourage everyone reading this to take 15 minutes and watch this video. It is very relevant to what we are seeing today, with the founding of analysis sites such as War on Ice and Progressive Hockey. First off, I want to say up front that in no way shape of form am I criticizing such sites. The people who run these pages are fantastic at what they do, and the information they share is a gold mine of information for hockey fans. I appreciate and applaud their effort.
My point is simply that caution is required by users of such “big data.” Finding a strong association as a person plays with the data does not necessarily mean that association is real. This is much like the first segment of the video clip, when Jo Røislien talks about how if a large number of things fall on the floor in a random way you may find your initial in it. Seeing something in randomness does not always mean it is a real “thing.”
I said it before, and I will keep saying it. Theory, theory, and more theory is the key.
Save Percentages and Measuring Goaltender Quality
My plan moving forward after my previous post about age and goaltender regression was to track save percentages for goaltenders based on scores (e.g. tied, up one goal, up more than one goal, down one goal, down more than one goal). My theory was that maybe better goaltenders in the league were the ones who have great percentages when it matters most, which is tied or within one goal either way. However, within days of planning this out, and trying to figure out how I would collect and compile the data in the upcoming season, War on Ice went up and promptly added a feature that allows users to track goalie save percentages by score, controlling for minutes played. They basically did what I wanted to do, only they did it much better.
I am eternally grateful to them for this, because when I sorted through some of the numbers they were not good in terms of providing evidence for my theory. I may revisit this in more detail down the line, but at this point I think it is safe to say that (at least for me) this data is one more nail in the coffin for the theory that save percentages are a measure of how great a goaltender is.
In the next few weeks I will be taking on two larger projects that will take much more time than my usual posts. The first is to come up with Bayesian estimates for each team making the playoffs based on spending and one of the fancy stats. I will select the fancy stat I want to use after running the assorted measures against success reaching the post season. I tried to find this information online, and I was shocked that I could not find it. Anyway, once I select my variable I plan on doing a test run based on last year’s information. I will re-run the analysis at the 20 game point of the season based on where teams stand at that point. That should probably do it in terms of the prediction, unless something drastically changes.
The second post that I am working on is more complicated. I’ve seen a few bloggers starting to play with linear regressions, which is cool. Regressions come with quite a bit of requisites, which I will go through. There is also an error term that people often omit or forget. What this error term means is pretty complicated as well, but it is important in terms of formulating an analysis and making statements based on the results of that analysis. To make a long story short, I’ll probably spend a lot of time talking about ins and outs of regressions such as linearity and additivity, statistical independence, homoscedasticity, and normality. Dealing with all of these assumptions can have a huge impact on a regression analysis.
That’s it for today, short and sweet as promised.