The following report looks at the stats of NHL games since the 2008-2009 season. The game data is downloaded from moneypuck.com (https://moneypuck.com/data.htm). The report will demonstrate data wrangling techniques, an analysis of categorical and numerical variables, and multiple sets of 2 variables. The report will also examine the distribution of goals per game, applicability of the Central Limit Theorem, and various sampling methods.

Importing Data

The following data frame captures game level data from all teams since the 2008 season. The team data is filtered for all stats regardless of situation (even strength, powerplay, short-handed, etc). Team names are standardized and the Atlanta Thrashers are renamed to Winnipeg Jets (same franchise). A new column is created to easily see whether the result of the game was a win or loss. Columns are filtered in this section to show a more concise example of the data.

library(tidyverse)
library(plotly)
library(sampling)

team_data <- read.csv("https://moneypuck.com/moneypuck/playerData/careers/gameByGame/all_teams.csv")

team_data <- team_data |>
  #standardize team names
  mutate(team = replace(team, team == 'S.J', 'SJS')) |>
  mutate(team = replace(team, team == 'L.A', 'LAK')) |>
  mutate(team = replace(team, team == 'T.B', 'TBL')) |>
  mutate(team = replace(team, team == 'N.J', 'NJD')) |>
  mutate(team = replace(team, team == 'ATL', 'WPG')) |>
  mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'S.J', 'SJS')) |>
  mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'L.A', 'LAK')) |>
  mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'T.B', 'TBL')) |>
  mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'N.J', 'NJD')) |>
  mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'ATL', 'WPG')) |>
  #Create a game result column
  mutate(result = if_else(goalsFor > goalsAgainst, "W", "L")) |>
  #filter by all situations
  filter(situation == 'all')

head(team_data)[c("team","opposingTeam","season","goalsFor","goalsAgainst","result")]
##   team opposingTeam season goalsFor goalsAgainst result
## 1  NYR          TBL   2008        2            1      W
## 2  NYR          TBL   2008        2            1      W
## 3  NYR          CHI   2008        4            2      W
## 4  NYR          PHI   2008        4            3      W
## 5  NYR          NJD   2008        4            1      W
## 6  NYR          BUF   2008        1            3      L

Analysis

Visualizing Number of Games Played and Goals Scored

games_played <- team_data |>
  group_by(team) |>
  summarise(n=n())

data <- games_played[order(games_played$n, decreasing = T), ]

games <- data$n

x <- list(
  title = "Team"
)
y <- list(
  title = "Games Played"
)

p1 <- plot_ly(x=reorder(data$team,rev(games)),y=games, name = 'Games Played', type = 'bar') %>%
  layout(title="Games Played Since 2008", xaxis = x, yaxis = y)
p1
avg_goals <- team_data |>
  group_by(team) |>
  summarise(goalsFor_total = sum(goalsFor)) |>
  mutate(avg_goals = goalsFor_total/games_played$n)

data <- avg_goals[order(avg_goals$avg_goals, decreasing = T), ]

goals <- data$avg_goals
teams <- data$team

x <- list(
  title = "Team"
)
y <- list(
  title = "Goals"
)

p2 <- plot_ly(x = reorder(teams,rev(goals)), y = goals, type = 'bar') %>%
  layout(title="Average Goals Scored per Game", xaxis = x, yaxis = y)
p2
paste("The highest number of goals scored in one game is",max(team_data$goalsFor), "by", team_data$team[team_data$goalsFor == max(team_data$goalsFor)],"against", team_data$opposingTeam[team_data$goalsFor == max(team_data$goalsFor)],sep = " ")
## [1] "The highest number of goals scored in one game is 11 by PIT against DET"

The Boston Bruins have played the most games since 2008. As a consistent playoff contender and a staple of the league that is often invited to play in classic games, outdoor games, and at international venues, it makes sense they would have the most games played. The Vegas Golden Knights and Seattle Kraken are new franchises to the league and are expected to have less games played than the rest of the league. In regards to average goals per game, Vegas has been an extremely competitive team since they joined the league and have never had even a mediocre season. It makes sense they would have the highest average of goals scored per game.

Season Summary

This section analyzes the best and worst teams in terms of goal production.

season_goals <- team_data |>
  group_by(team, season) |>
  summarise(goalsFor_total = sum(goalsFor), games_played=n())

paste("The highest number of goals scored in one season is",max(season_goals$goalsFor_total), "by", season_goals$team[season_goals$goalsFor_total == max(season_goals$goalsFor_total)],"in", season_goals$season[season_goals$goalsFor_total == max(season_goals$goalsFor_total)],sep = " ")
## [1] "The highest number of goals scored in one season is 393 by COL in 2021"
paste("The lowest number of goals scored in one season is",min(season_goals$goalsFor_total), "by", season_goals$team[season_goals$goalsFor_total == min(season_goals$goalsFor_total)],"in", season_goals$season[season_goals$goalsFor_total == min(season_goals$goalsFor_total)],sep = " ")
## [1] "The lowest number of goals scored in one season is 109 by FLA in 2012"
## [2] "The lowest number of goals scored in one season is 109 by NSH in 2012"

The 2012 season has an unusually low number of goals scored. This is because the 2012 regular season began on January 19, 2013. The season start was delayed from its original October 11, 2012 date due to a lockout imposed by the NHL franchise owners. A better indication of best and worst season would be to look at number of goals per games played in the season.

season_goals_per_game <- season_goals |>
  mutate(goals_per_game = goalsFor_total/games_played)

paste("The lowest number of goals per game scored in one season is",round(min(season_goals_per_game$goals_per_game),2), "by", season_goals_per_game$team[season_goals_per_game$goals_per_game == min(season_goals_per_game$goals_per_game)],"in", season_goals_per_game$season[season_goals_per_game$goals_per_game == min(season_goals_per_game$goals_per_game)],sep = " ")
## [1] "The lowest number of goals per game scored in one season is 1.83 by BUF in 2013"
paste("The highest number of goals per game scored in one season is",round(max(season_goals_per_game$goals_per_game),2), "by", season_goals_per_game$team[season_goals_per_game$goals_per_game == max(season_goals_per_game$goals_per_game)],"in", season_goals_per_game$season[season_goals_per_game$goals_per_game == max(season_goals_per_game$goals_per_game)],sep = " ")
## [1] "The highest number of goals per game scored in one season is 3.93 by EDM in 2022"

With a 1.83 goal per game average in the 2013 season, the Buffalo Sabres ended the season with a record of:

buf_2013 <- team_data |>
  filter(team == "BUF", season == 2013)

buf_record_2013 <- table(buf_2013$result)
buf_record_2013
## 
##  L  W 
## 68 14

With a 3.93 goal per game average in the 2022 season, the Edmonton Oilers ended the season with a record of:

edm_2022 <- team_data |>
  filter(team == "EDM", season == 2022)

edm_record_2022 <- table(edm_2022$result)
edm_record_2022
## 
##  L  W 
## 38 56

Analysis of Stats for the Best and Worst Seasons

This section visualizes goals vs shots, penalty minutes, hits, takeaways, and faceoff wins for first the 2022 EDM Oilers season and then the BUF Sabers 2013 season.

plot_ly(data = edm_2022, y = ~goalsFor, x = ~shotsOnGoalFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "EDM Oilers 2022 Goals vs. Shots")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~penalityMinutesFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "EDM Oilers 2022 Goals vs. PIMs")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~hitsFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "EDM Oilers 2022 Goals vs. Hits")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~takeawaysFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "EDM Oilers 2022 Goals vs. Takeaways")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~faceOffsWonFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "EDM Oilers 2022 Goals vs. Faceoff Wins")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~shotsOnGoalFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "BUF Sabers 2013 Goals vs. Shots")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~penalityMinutesFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "BUF Sabers 2013 Goals vs. PIMs")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~hitsFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "BUF Sabers 2013 Goals vs. Hits")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~takeawaysFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "BUF Sabers 2013 Goals vs. Takeaways")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~faceOffsWonFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
  layout(title = "BUF Sabers 2013 Goals vs. Faceoff Wins")

Total Goal Distribution

This section examines the distribution and sampling of the distribution of total goals scored, by both teams, per game. For the distribution data, total goals and shots are calculated and duplicate rows are removed so that each game is only counted once.

dist_data <- team_data |>
  mutate(goals_total = team_data$goalsFor+team_data$goalsAgainst) |>
  mutate(shots_total = team_data$shotsOnGoalFor+team_data$shotsOnGoalAgainst) |>
  distinct(gameId, .keep_all = TRUE)

plot_ly(data=dist_data, x=~goals_total, type = "histogram")

Central Limit Theorem

To demonstrate the central limit theorem, 1000 samples of size 10 are chosen randomly and the means and standard deviation of the samples are compared to the mean and standard deviation of the original distribution. The results should show the samples are more tightly clustered around the original distribution mean.

print("Total distribution mean:")
## [1] "Total distribution mean:"
mean(dist_data$goals_total)
## [1] 5.682542
print("Total distribution standard deviation:")
## [1] "Total distribution standard deviation:"
sd(dist_data$goals_total)
## [1] 2.301972
samples <- 1000
sample.size <- 10
xbar <- numeric(samples)
set.seed(7496)
for (i in 1:samples) {
  xbar[i] <- mean(sample(dist_data$goals_total, sample.size, replace=FALSE))
}

plot_ly(x=xbar, type="histogram")
print("Sample mean:")
## [1] "Sample mean:"
mean(xbar)
## [1] 5.6985
print("Sample standard deviation:")
## [1] "Sample standard deviation:"
sd(xbar)
## [1] 0.7072255

Random Sampling

Simple Random Sampling

Goals scored in 500 sample games chosen at random from the entire distribution.

N <- nrow(dist_data)
sample.size <- 500
set.seed(7496)
s <- srswor(sample.size, N)
rows <- (1:nrow(dist_data))[s!=0]
sample.1 <- dist_data[rows,]

table(sample.1$goals_total)
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 
##  1  9 25 50 60 98 66 90 36 35 18 10  1  1
plot_ly(x=sample.1$goals_total, type="histogram")
sm.1 <- summary(sample.1$goals_total)
names(sm.1) <- c("Min", "Q1", "Q2", "Mean", "Q3", "Max")
sm.1
##    Min     Q1     Q2   Mean     Q3    Max 
##  0.000  4.000  6.000  5.766  7.000 13.000

Systematic Sampling with Unequal Probabilities

Goals scored in 500 sample games chosen with more probability given to games with a higher shot total.

set.seed(7496)
pik <- inclusionprobabilities(dist_data$shots_total, sample.size)
s <- UPsystematic(pik)
sample.2 <- dist_data[s != 0, ]
table(sample.2$goals_total)
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 
##  2  7 22 51 56 93 56 95 47 38 16 10  4  3
plot_ly(x=sample.2$goals_total, type="histogram")
sm.2 <- summary(sample.2$goals_total)
names(sm.2) <- c("Min", "Q1", "Q2", "Mean", "Q3", "Max")
sm.2
##    Min     Q1     Q2   Mean     Q3    Max 
##  0.000  4.000  6.000  5.938  7.000 13.000

Stratified Sampling by Season

Goals scored in 500 sample games chosen at random with equal sampling from each season.

freq <- table(dist_data$season)
sizes <- round(500 * freq / sum(freq))
st <- sampling::strata(dist_data, stratanames = c("season"),
                       size = sizes,
                       method = "srswor")
sample.3 <- sampling::getdata(dist_data, st)
table(sample.3$goals_total)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  17 
##  11  22  59  65 108  67  80  26  34  16   4   5   1   1
plot_ly(x=sample.3$goals_total, type="histogram")
sm.3 <- summary(sample.3$goals_total)
names(sm.3) <- c("Min", "Q1", "Q2", "Mean", "Q3", "Max")
sm.3
##    Min     Q1     Q2   Mean     Q3    Max 
##  1.000  4.000  5.000  5.615  7.000 17.000

Conclusions

The examination of goals vs other favorable stats tends to show that games with a lot of shots had more goals and games with a lot of hits had less goals. Hits are a favorable stat to establish an aggressive game, however, too many hits may be a sign of frustration that the game is not going their way. Additionally, systematic sampling with higher probability for games with higher shot totals will be chosen has a slightly higher average goals scored than the total population.