df <- read.csv("https://moneypuck.com/moneypuck/playerData/careers/gameByGame/all_teams.csv")
df <- df |>
mutate(result = if_else(goalsFor > goalsAgainst, 1, 0)) |>
mutate(home_or_away = if_else(home_or_away == "HOME", 1, 0)) |>
filter(situation == 'all') |>
distinct(gameId, .keep_all = TRUE)
df <- df[c("home_or_away", "goalsFor", "goalsAgainst", "shotsOnGoalFor", "shotsOnGoalAgainst", "hitsFor", "hitsAgainst")]
#write_csv(df, file="df_clean.csv")
First, we will summarize the data to see how many samples have been selected from home and away games (it should be close to 50/50) and determine the average goals scored from each. We will also look at the distributions of the data. Although the home and away scores from each game are available, we will only look at 1 team’s score from each game. This is because the data from the 2 populations should be independent and we can probably assume that the home and away scores from the same game will influence each other in some way.
##
## 0 1
## 509 491
boxplot(goalsFor ~ home_or_away, data=sample, main="NHL Goals per Game", ylab="Goals", xlab="", names=c("Away","Home"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 3.081 4.000 9.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.654 4.000 8.000
The average goals scored between home and away teams is slightly different, but too close too tell the difference on a boxplot. Although the shape of the histograms are skewed to the right and not normally distributed, they at least have similar shapes. Next, we will formally test if the average goals scored for the home team is greater than the away team at a significance level of alpha = 0.05.
H0: u_home = u_away
H1: u_home > u_away
alpha = 0.05
Using 2-sample t test statistic:
t = (x1_bar - x2_bar) / sqrt((s12/n1)+(s22/n2))
Reject H0 if p <= 0.05, otherwise, do not reject H0.
t.test(sample$goalsFor[sample$home_or_away == 1], sample$goalsFor[sample$home_or_away == 0], alternative = "greater")
##
## Welch Two Sample t-test
##
## data: sample$goalsFor[sample$home_or_away == 1] and sample$goalsFor[sample$home_or_away == 0]
## t = 4.0598, df = 983.72, p-value = 2.651e-05
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.2539792 Inf
## sample estimates:
## mean of x mean of y
## 3.081466 2.654224
t.test(sample$goalsFor[sample$home_or_away == 1], sample$goalsFor[sample$home_or_away == 0], alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: sample$goalsFor[sample$home_or_away == 1] and sample$goalsFor[sample$home_or_away == 0]
## t = 4.0598, df = 983.72, p-value = 5.302e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2207269 0.6337580
## sample estimates:
## mean of x mean of y
## 3.081466 2.654224
Since the p-value is less than 0.05, we reject the null hypothesis H0. There is significant evidence, at the alpha = 0.05 level, that u_home is greater than u_away. There is significant evidence that the home teams scores more goals on average than the away team. We are 95 percent confident that the true difference in means is between .221 and .634.
Next we will examine the relationship between shots on goal and goals scored and determine if there is a linear relationship between the two variables.
## [1] 0.1313744
A correlation of 0.131 indicates a weak, positive linear association between shots on goal and goals scored. Next, we will fit a linear model and test the linear relationship between the 2 variables.
H0: beta1 = 0
H1: beta1 != 0
alpha = 0.025
Using F statistic: F = Reg MS/Res MS
Reject H0 if p <= 0.025, otherwise, do not reject H0.
##
## Call:
## lm(formula = sample$goalsFor ~ sample$shotsOnGoalFor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1667 -1.1006 -0.0675 1.1639 5.8664
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.877488 0.241417 7.777 1.85e-14 ***
## sample$shotsOnGoalFor 0.033057 0.007896 4.187 3.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.66 on 998 degrees of freedom
## Multiple R-squared: 0.01726, Adjusted R-squared: 0.01627
## F-statistic: 17.53 on 1 and 998 DF, p-value: 3.082e-05
The residuals look approximately normally distributed and there are no major outliers. Since the p-value is less than 0.025, we reject the null hypothesis H0. There is significant evidence, at the alpha = 0.05 level, that beta1 is not equal to 0 and there is a linear relationship between shots on goal and goals scored. The estimate for beta1 is 0.033, which means, in the context of this data, the number of goals scored is expected to increase by .33 for every 10 shots on goal. Based on the adjusted R-squared value of 0.01627, about 1.63% of the variability in goals scored can be explained by the number of shots taken.
Since the data for hits per game is also available, we can add it as an explanatory variable to create a multiple linear regression model.
##
## Call:
## lm(formula = sample$goalsFor ~ sample$shotsOnGoalFor + sample$hitsFor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2431 -1.1118 -0.0912 1.1451 5.6861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.179169 0.279119 7.807 1.47e-14 ***
## sample$shotsOnGoalFor 0.033287 0.007883 4.223 2.63e-05 ***
## sample$hitsFor -0.013015 0.006076 -2.142 0.0324 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.657 on 997 degrees of freedom
## Multiple R-squared: 0.02176, Adjusted R-squared: 0.0198
## F-statistic: 11.09 on 2 and 997 DF, p-value: 1.725e-05
Based on the p-value for the beta of hits, it is not statistically significant, when controlling for the number of shots on goal. The adjusted r-squared of the model does, however, improve to .02, which means that the multiple linear regression model fits this sample data slightly better than the simple linear regression.
Next, we will examine if the number of shots on goal can explain the home ice advantage or if there is some other contributing factors.
## Anova Table (Type III tests)
##
## Response: sample$goalsFor
## Sum Sq Df F value Pr(>F)
## (Intercept) 163.64 1 59.956 2.375e-14 ***
## sample$home_or_away 30.14 1 11.045 0.0009218 ***
## sample$shotsOnGoalFor 32.84 1 12.033 0.0005450 ***
## Residuals 2721.04 997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANCOVA shows the difference between home and away teams after adjusting for covariates in the model. This analysis adjusts for the number of shots on goal to see if the differences between the home and away teams are still significant or rather due to differences in number of shots on goal between the groups.
In this case, the results are the same. The p-value shows that the difference between the groups is still significant after adjusting for shots on goal.
Based on the above analysis, there is evidence of a home ice advantage in the NHL, based on the number of goals scored per game. There is also evidence of a linear relationship between goals scored and shots on goal. However, the number of shots on goal does not explain the difference in goals scored per game between home and away teams.