NHL Home Ice Advantage

Description

In most sports there is some idea of home team advantage. In this analysis, we will examine the data from NHL games and try to determine if there is a home ice advantage. Also, if there is a home ice advantage, is there some metric that can explain the advantage?

df <- read.csv("https://moneypuck.com/moneypuck/playerData/careers/gameByGame/all_teams.csv")
df <- df |>
  mutate(result = if_else(goalsFor > goalsAgainst, 1, 0)) |>
  mutate(home_or_away = if_else(home_or_away == "HOME", 1, 0)) |>
  filter(situation == 'all') |>
  distinct(gameId, .keep_all = TRUE)

df <- df[c("home_or_away", "goalsFor", "goalsAgainst", "shotsOnGoalFor", "shotsOnGoalAgainst", "hitsFor", "hitsAgainst")]

#write_csv(df, file="df_clean.csv")

Data Description

The data comes from a large dataset of stats from every NHL game since 2008. After cleaning the data, there are 7 columns in the dataset: home_or_away, goalsFor, goalsAgainst, shotsOnGoalFor, shotsOnGoalAgainst, hitsFor, and hitsAgainst. From this dataset of 20,430 unique NHL games, we will randomly select 1000 games.

df <- read.csv("df_clean.csv")
set.seed(7496)
sample <- df[sample(nrow(df), 1000, replace=FALSE), ]

First, we will summarize the data to see how many samples have been selected from home and away games (it should be close to 50/50) and determine the average goals scored from each. We will also look at the distributions of the data. Although the home and away scores from each game are available, we will only look at 1 team’s score from each game. This is because the data from the 2 populations should be independent and we can probably assume that the home and away scores from the same game will influence each other in some way.

table(sample$home_or_away)
## 
##   0   1 
## 509 491
boxplot(goalsFor ~ home_or_away, data=sample, main="NHL Goals per Game", ylab="Goals", xlab="", names=c("Away","Home"))

home_goals <- sample$goalsFor[sample$home_or_away == 1]
summary(home_goals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.081   4.000   9.000
away_goals <- sample$goalsFor[sample$home_or_away == 0]
summary(away_goals)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.654   4.000   8.000
hist(home_goals, main="Home Goals", xlab="Goals")

hist(away_goals, main="Away Goals", xlab="Goals")

The average goals scored between home and away teams is slightly different, but too close too tell the difference on a boxplot. Although the shape of the histograms are skewed to the right and not normally distributed, they at least have similar shapes. Next, we will formally test if the average goals scored for the home team is greater than the away team at a significance level of alpha = 0.05.

Analysis

Does the home team score more goals than the away team?

H0: u_home = u_away
H1: u_home > u_away
alpha = 0.05

Using 2-sample t test statistic:

t = (x1_bar - x2_bar) / sqrt((s12/n1)+(s22/n2))

Reject H0 if p <= 0.05, otherwise, do not reject H0.

t.test(sample$goalsFor[sample$home_or_away == 1], sample$goalsFor[sample$home_or_away == 0], alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  sample$goalsFor[sample$home_or_away == 1] and sample$goalsFor[sample$home_or_away == 0]
## t = 4.0598, df = 983.72, p-value = 2.651e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.2539792       Inf
## sample estimates:
## mean of x mean of y 
##  3.081466  2.654224
t.test(sample$goalsFor[sample$home_or_away == 1], sample$goalsFor[sample$home_or_away == 0], alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  sample$goalsFor[sample$home_or_away == 1] and sample$goalsFor[sample$home_or_away == 0]
## t = 4.0598, df = 983.72, p-value = 5.302e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2207269 0.6337580
## sample estimates:
## mean of x mean of y 
##  3.081466  2.654224

Since the p-value is less than 0.05, we reject the null hypothesis H0. There is significant evidence, at the alpha = 0.05 level, that u_home is greater than u_away. There is significant evidence that the home teams scores more goals on average than the away team. We are 95 percent confident that the true difference in means is between .221 and .634.

Next we will examine the relationship between shots on goal and goals scored and determine if there is a linear relationship between the two variables.

Is there a linear relationship between goals scored and shots on goal?

plot(sample$shotsOnGoalFor, sample$goalsFor, ylab="Goals", xlab="Shots")

cor(sample$shotsOnGoalFor, sample$goalsFor)
## [1] 0.1313744

A correlation of 0.131 indicates a weak, positive linear association between shots on goal and goals scored. Next, we will fit a linear model and test the linear relationship between the 2 variables.

H0: beta1 = 0
H1: beta1 != 0
alpha = 0.025

Using F statistic: F = Reg MS/Res MS

Reject H0 if p <= 0.025, otherwise, do not reject H0.

m <- lm(sample$goalsFor~sample$shotsOnGoalFor)
summary(m)
## 
## Call:
## lm(formula = sample$goalsFor ~ sample$shotsOnGoalFor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1667 -1.1006 -0.0675  1.1639  5.8664 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.877488   0.241417   7.777 1.85e-14 ***
## sample$shotsOnGoalFor 0.033057   0.007896   4.187 3.08e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 998 degrees of freedom
## Multiple R-squared:  0.01726,    Adjusted R-squared:  0.01627 
## F-statistic: 17.53 on 1 and 998 DF,  p-value: 3.082e-05
plot(m)

resid <- resid(m)
hist(resid)

The residuals look approximately normally distributed and there are no major outliers. Since the p-value is less than 0.025, we reject the null hypothesis H0. There is significant evidence, at the alpha = 0.05 level, that beta1 is not equal to 0 and there is a linear relationship between shots on goal and goals scored. The estimate for beta1 is 0.033, which means, in the context of this data, the number of goals scored is expected to increase by .33 for every 10 shots on goal. Based on the adjusted R-squared value of 0.01627, about 1.63% of the variability in goals scored can be explained by the number of shots taken.

Since the data for hits per game is also available, we can add it as an explanatory variable to create a multiple linear regression model.

m.mult <- lm(sample$goalsFor~sample$shotsOnGoalFor + sample$hitsFor)
summary(m.mult)
## 
## Call:
## lm(formula = sample$goalsFor ~ sample$shotsOnGoalFor + sample$hitsFor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2431 -1.1118 -0.0912  1.1451  5.6861 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.179169   0.279119   7.807 1.47e-14 ***
## sample$shotsOnGoalFor  0.033287   0.007883   4.223 2.63e-05 ***
## sample$hitsFor        -0.013015   0.006076  -2.142   0.0324 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.657 on 997 degrees of freedom
## Multiple R-squared:  0.02176,    Adjusted R-squared:  0.0198 
## F-statistic: 11.09 on 2 and 997 DF,  p-value: 1.725e-05

Based on the p-value for the beta of hits, it is not statistically significant, when controlling for the number of shots on goal. The adjusted r-squared of the model does, however, improve to .02, which means that the multiple linear regression model fits this sample data slightly better than the simple linear regression.

Next, we will examine if the number of shots on goal can explain the home ice advantage or if there is some other contributing factors.

Is there a difference between the mean goals scored of home and away teams when taking number of shots into account?

Anova(lm(sample$goalsFor ~ sample$home_or_away + sample$shotsOnGoalFor), type=3)
## Anova Table (Type III tests)
## 
## Response: sample$goalsFor
##                        Sum Sq  Df F value    Pr(>F)    
## (Intercept)            163.64   1  59.956 2.375e-14 ***
## sample$home_or_away     30.14   1  11.045 0.0009218 ***
## sample$shotsOnGoalFor   32.84   1  12.033 0.0005450 ***
## Residuals             2721.04 997                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANCOVA shows the difference between home and away teams after adjusting for covariates in the model. This analysis adjusts for the number of shots on goal to see if the differences between the home and away teams are still significant or rather due to differences in number of shots on goal between the groups.

In this case, the results are the same. The p-value shows that the difference between the groups is still significant after adjusting for shots on goal.

Conclusion

Based on the above analysis, there is evidence of a home ice advantage in the NHL, based on the number of goals scored per game. There is also evidence of a linear relationship between goals scored and shots on goal. However, the number of shots on goal does not explain the difference in goals scored per game between home and away teams.