The following report looks at the stats of NHL games since the 2008-2009 season. The game data is downloaded from moneypuck.com (https://moneypuck.com/data.htm). The report will demonstrate data wrangling techniques, an analysis of categorical and numerical variables, and multiple sets of 2 variables. The report will also examine the distribution of goals per game, applicability of the Central Limit Theorem, and various sampling methods.
The following data frame captures game level data from all teams since the 2008 season. The team data is filtered for all stats regardless of situation (even strength, powerplay, short-handed, etc). Team names are standardized and the Atlanta Thrashers are renamed to Winnipeg Jets (same franchise). A new column is created to easily see whether the result of the game was a win or loss. Columns are filtered in this section to show a more concise example of the data.
library(tidyverse)
library(plotly)
library(sampling)
team_data <- read.csv("https://moneypuck.com/moneypuck/playerData/careers/gameByGame/all_teams.csv")
team_data <- team_data |>
#standardize team names
mutate(team = replace(team, team == 'S.J', 'SJS')) |>
mutate(team = replace(team, team == 'L.A', 'LAK')) |>
mutate(team = replace(team, team == 'T.B', 'TBL')) |>
mutate(team = replace(team, team == 'N.J', 'NJD')) |>
mutate(team = replace(team, team == 'ATL', 'WPG')) |>
mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'S.J', 'SJS')) |>
mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'L.A', 'LAK')) |>
mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'T.B', 'TBL')) |>
mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'N.J', 'NJD')) |>
mutate(opposingTeam = replace(opposingTeam, opposingTeam == 'ATL', 'WPG')) |>
#Create a game result column
mutate(result = if_else(goalsFor > goalsAgainst, "W", "L")) |>
#filter by all situations
filter(situation == 'all')
head(team_data)[c("team","opposingTeam","season","goalsFor","goalsAgainst","result")]
## team opposingTeam season goalsFor goalsAgainst result
## 1 NYR TBL 2008 2 1 W
## 2 NYR TBL 2008 2 1 W
## 3 NYR CHI 2008 4 2 W
## 4 NYR PHI 2008 4 3 W
## 5 NYR NJD 2008 4 1 W
## 6 NYR BUF 2008 1 3 L
games_played <- team_data |>
group_by(team) |>
summarise(n=n())
data <- games_played[order(games_played$n, decreasing = T), ]
games <- data$n
x <- list(
title = "Team"
)
y <- list(
title = "Games Played"
)
p1 <- plot_ly(x=reorder(data$team,rev(games)),y=games, name = 'Games Played', type = 'bar') %>%
layout(title="Games Played Since 2008", xaxis = x, yaxis = y)
p1
avg_goals <- team_data |>
group_by(team) |>
summarise(goalsFor_total = sum(goalsFor)) |>
mutate(avg_goals = goalsFor_total/games_played$n)
data <- avg_goals[order(avg_goals$avg_goals, decreasing = T), ]
goals <- data$avg_goals
teams <- data$team
x <- list(
title = "Team"
)
y <- list(
title = "Goals"
)
p2 <- plot_ly(x = reorder(teams,rev(goals)), y = goals, type = 'bar') %>%
layout(title="Average Goals Scored per Game", xaxis = x, yaxis = y)
p2
paste("The highest number of goals scored in one game is",max(team_data$goalsFor), "by", team_data$team[team_data$goalsFor == max(team_data$goalsFor)],"against", team_data$opposingTeam[team_data$goalsFor == max(team_data$goalsFor)],sep = " ")
## [1] "The highest number of goals scored in one game is 11 by PIT against DET"
The Boston Bruins have played the most games since 2008. As a consistent playoff contender and a staple of the league that is often invited to play in classic games, outdoor games, and at international venues, it makes sense they would have the most games played. The Vegas Golden Knights and Seattle Kraken are new franchises to the league and are expected to have less games played than the rest of the league. In regards to average goals per game, Vegas has been an extremely competitive team since they joined the league and have never had even a mediocre season. It makes sense they would have the highest average of goals scored per game.
This section analyzes the best and worst teams in terms of goal production.
season_goals <- team_data |>
group_by(team, season) |>
summarise(goalsFor_total = sum(goalsFor), games_played=n())
paste("The highest number of goals scored in one season is",max(season_goals$goalsFor_total), "by", season_goals$team[season_goals$goalsFor_total == max(season_goals$goalsFor_total)],"in", season_goals$season[season_goals$goalsFor_total == max(season_goals$goalsFor_total)],sep = " ")
## [1] "The highest number of goals scored in one season is 393 by COL in 2021"
paste("The lowest number of goals scored in one season is",min(season_goals$goalsFor_total), "by", season_goals$team[season_goals$goalsFor_total == min(season_goals$goalsFor_total)],"in", season_goals$season[season_goals$goalsFor_total == min(season_goals$goalsFor_total)],sep = " ")
## [1] "The lowest number of goals scored in one season is 109 by FLA in 2012"
## [2] "The lowest number of goals scored in one season is 109 by NSH in 2012"
The 2012 season has an unusually low number of goals scored. This is because the 2012 regular season began on January 19, 2013. The season start was delayed from its original October 11, 2012 date due to a lockout imposed by the NHL franchise owners. A better indication of best and worst season would be to look at number of goals per games played in the season.
season_goals_per_game <- season_goals |>
mutate(goals_per_game = goalsFor_total/games_played)
paste("The lowest number of goals per game scored in one season is",round(min(season_goals_per_game$goals_per_game),2), "by", season_goals_per_game$team[season_goals_per_game$goals_per_game == min(season_goals_per_game$goals_per_game)],"in", season_goals_per_game$season[season_goals_per_game$goals_per_game == min(season_goals_per_game$goals_per_game)],sep = " ")
## [1] "The lowest number of goals per game scored in one season is 1.83 by BUF in 2013"
paste("The highest number of goals per game scored in one season is",round(max(season_goals_per_game$goals_per_game),2), "by", season_goals_per_game$team[season_goals_per_game$goals_per_game == max(season_goals_per_game$goals_per_game)],"in", season_goals_per_game$season[season_goals_per_game$goals_per_game == max(season_goals_per_game$goals_per_game)],sep = " ")
## [1] "The highest number of goals per game scored in one season is 3.93 by EDM in 2022"
With a 1.83 goal per game average in the 2013 season, the Buffalo Sabres ended the season with a record of:
buf_2013 <- team_data |>
filter(team == "BUF", season == 2013)
buf_record_2013 <- table(buf_2013$result)
buf_record_2013
##
## L W
## 68 14
With a 3.93 goal per game average in the 2022 season, the Edmonton Oilers ended the season with a record of:
edm_2022 <- team_data |>
filter(team == "EDM", season == 2022)
edm_record_2022 <- table(edm_2022$result)
edm_record_2022
##
## L W
## 38 56
This section visualizes goals vs shots, penalty minutes, hits, takeaways, and faceoff wins for first the 2022 EDM Oilers season and then the BUF Sabers 2013 season.
plot_ly(data = edm_2022, y = ~goalsFor, x = ~shotsOnGoalFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "EDM Oilers 2022 Goals vs. Shots")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~penalityMinutesFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "EDM Oilers 2022 Goals vs. PIMs")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~hitsFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "EDM Oilers 2022 Goals vs. Hits")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~takeawaysFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "EDM Oilers 2022 Goals vs. Takeaways")
plot_ly(data = edm_2022, y = ~goalsFor, x = ~faceOffsWonFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "EDM Oilers 2022 Goals vs. Faceoff Wins")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~shotsOnGoalFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "BUF Sabers 2013 Goals vs. Shots")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~penalityMinutesFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "BUF Sabers 2013 Goals vs. PIMs")
plot_ly(data = buf_2013, y = ~goalsFor, x = ~hitsFor, type = "scatter", mode = "markers", color = ~result, colors = c("red", "green")) |>
layout(title = "BUF Sabers 2013 Goals vs. Hits")
This section examines the distribution and sampling of the distribution of total goals scored, by both teams, per game. For the distribution data, total goals and shots are calculated and duplicate rows are removed so that each game is only counted once.
dist_data <- team_data |>
mutate(goals_total = team_data$goalsFor+team_data$goalsAgainst) |>
mutate(shots_total = team_data$shotsOnGoalFor+team_data$shotsOnGoalAgainst) |>
distinct(gameId, .keep_all = TRUE)
plot_ly(data=dist_data, x=~goals_total, type = "histogram")
To demonstrate the central limit theorem, 1000 samples of size 10 are chosen randomly and the means and standard deviation of the samples are compared to the mean and standard deviation of the original distribution. The results should show the samples are more tightly clustered around the original distribution mean.
## [1] "Total distribution mean:"
## [1] 5.682542
## [1] "Total distribution standard deviation:"
## [1] 2.301972
samples <- 1000
sample.size <- 10
xbar <- numeric(samples)
set.seed(7496)
for (i in 1:samples) {
xbar[i] <- mean(sample(dist_data$goals_total, sample.size, replace=FALSE))
}
plot_ly(x=xbar, type="histogram")
## [1] "Sample mean:"
## [1] 5.6985
## [1] "Sample standard deviation:"
## [1] 0.7072255
Goals scored in 500 sample games chosen at random from the entire distribution.
N <- nrow(dist_data)
sample.size <- 500
set.seed(7496)
s <- srswor(sample.size, N)
rows <- (1:nrow(dist_data))[s!=0]
sample.1 <- dist_data[rows,]
table(sample.1$goals_total)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 1 9 25 50 60 98 66 90 36 35 18 10 1 1
## Min Q1 Q2 Mean Q3 Max
## 0.000 4.000 6.000 5.766 7.000 13.000
Goals scored in 500 sample games chosen with more probability given to games with a higher shot total.
set.seed(7496)
pik <- inclusionprobabilities(dist_data$shots_total, sample.size)
s <- UPsystematic(pik)
sample.2 <- dist_data[s != 0, ]
table(sample.2$goals_total)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 2 7 22 51 56 93 56 95 47 38 16 10 4 3
## Min Q1 Q2 Mean Q3 Max
## 0.000 4.000 6.000 5.938 7.000 13.000
Goals scored in 500 sample games chosen at random with equal sampling from each season.
freq <- table(dist_data$season)
sizes <- round(500 * freq / sum(freq))
st <- sampling::strata(dist_data, stratanames = c("season"),
size = sizes,
method = "srswor")
sample.3 <- sampling::getdata(dist_data, st)
table(sample.3$goals_total)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 17
## 11 22 59 65 108 67 80 26 34 16 4 5 1 1
## Min Q1 Q2 Mean Q3 Max
## 1.000 4.000 5.000 5.615 7.000 17.000
The examination of goals vs other favorable stats tends to show that games with a lot of shots had more goals and games with a lot of hits had less goals. Hits are a favorable stat to establish an aggressive game, however, too many hits may be a sign of frustration that the game is not going their way. Additionally, systematic sampling with higher probability for games with higher shot totals will be chosen has a slightly higher average goals scored than the total population.