NFL Analytics: Exploring Early-Season Defensive EPA with nflfastR

Long story, short: I am a big Pittsburgh Steelers fan. I grew up about 45 minutes outside of Pittsburgh. My father worked in the steel mills his entire life. There was really no other option for me in terms of my fandom.

Because of that, I usually gravitate towards working with the Pittsburgh Steelers when I fiddle around with NFL analytics in my free time.

This time around, I wanted to take a look at the Steelers defense. Yes, we are only two weeks into the season, but there is already talk about this year’s version of the Steel Curtain “being one for the ages.”

For instance, ESPN is talking about the Steelers’ defense getting after QBs at a “championship rate.”

They are also ranked first in pressure and blitz rate thru the first two weeks of the season.

Indeed, all signs seem to be pointing towards this being one of those Steeler defense that carry the team into the playoffs and allow us to tell our grandchildren about it one day.

But, I got to wondering: how does this defense stack up against other versions of the Steelers defense thru the first two weeks of the season?

To answer this question, I turned to the R programming language and the use of the nflfastR package.

First, I grabbed the play-by-play for every single play in the NFL from 1999 to 2019:

seasons <- 1999:2019

##Grabbing play-by-play data from 1999-2019
pbp <- map_df(seasons, function(x) {
  readRDS(
    url(
      paste0("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_",x,".rds")
    )
  )
})

And then I did the same for just the 2020 season.

##Grabbing play-by-play data for 2020
data.2020 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2020.rds'))

Now, to be honest, I could have done this in one step. But, I wanted the 2020 data pulled by itself for another study I was doing at the same time. That said, not a huge issue.

After pulling the play-by-play data, it is just a matter of filtering out for just Pittsburgh when the Steelers are the defensive unit on the field for the first two weeks of each season.

##Filtering 1999-2019 data for just PIT defense stats
pit.defense.df <- pbp %>%
  filter(defteam == "PIT" & week <= 2)

##Pulling 1999-2019 PIT defense EPA
pit.defense.epa <- pit.defense.df %>%
  group_by(season) %>%
  summarise(epa.pass.def = mean(epa[pass ==1], na.rm = TRUE),
            epa.rush.def = mean(epa[rush ==1], na.rm = TRUE),
            num.plays = n())

##Filtering 2020 data for just PIT defense stats
pit.2020defense.df <- data.2020 %>%
  group_by(season) %>%
  filter(defteam == "PIT" & week <= 2)

##Pulling 2020 PIT defense EPA
pit.2020defense.epa <- pit.2020defense.df %>%
  group_by(season) %>%
  summarise(epa.pass.def = mean(epa[pass ==1], na.rm = TRUE),
            epa.rush.def = mean(epa[rush ==1], na.rm = TRUE),
            num.plays = n())

As you can see, I filter out “PIT” as the ‘defteam’ variable for all weeks less than or equal to 2.

Afterwards, I group by ‘season’ and then summarise to get the EPA (expected points added) for both rushing and passing defense.

And, now, a quick explanation on EPA from our friends over at Advanced Football Analytics.

“Football is a sport of strategy and decision making. But before we can compare the potential risks and rewards of various options, we need to be able to properly measure the value of possible outcomes.

The value of a football play has traditionally been measured in yards gained. Unfortunately, yards is a flawed measure because not all yards are equal. For example, a 4-yard gain on 3rd down and 3 is much more valuable than a 4-yard gain on 3rd and 8. Any measure of success must consider the down and distance situation.

Field position is also an important consideration. Yards gained near the goal line are tougher to come by and are more valuable than yards gained at midfield. Yards lost near one’s own goal line can be more costly as well.

We can measure the values of situations and, by extension, the outcomes of plays by establishing an equivalence in terms of points. To do this we can start by looking back through recent NFL history at the ‘next points scored’ for all plays. For example, if we look at all 1st and 10s from an offense’ own 20-yard line, the team on offense will score next slightly more often than its opponent. If we add up all the ‘next points’ scored for and against the offense’s team, whether on the current drive or subsequent drives, we can estimate the net point advantage an offense can expect for any football situation. For a 1st and 10 at an offense’s own 20, it’s +0.4 net points, and at the opponent’s 20, it’s +4.0 net points. These net point values are called Expected Points (EP), and every down-distance-field position situation has a corresponding EP value.

Suppose the offense has a 1st and 10 at midfield. This situation is worth +2.0 EP. A 5-yard gain would set up a 2nd and 5 from the 45, which corresponds to a +2.1 EP. Therefore, that 5-yard gain in that particular situation represents a +0.1 gain in EP. This gain is called Expected Points Added (EPA). Likewise, a 5-yard loss on 1st down at midfield would create a 2nd and 15 from the offense’s own 45. That situation is worth +1.2 EP, representing a net difference of -0.8 EPA.”

Okay … back to the fun stuff now.

In terms of defensive EPA, we want to see negative numbers. That means the defensive is taking expected points added away from the offensive unit.

Now that we have all the EPA data, we need to do a quick merging of the information since we pulled 1999:2019 and then the 2020 seasons at different times:

##Merging 1999-2019 and 2020 data into one
pit.defense.epa <- rbind(pit.defense.epa, pit.2020defense.epa)

After that, we can build the graph;

##Plotting
ggplot(pit.defense.epa, aes(x = Pass, xend = Rush, y = season)) +
  geom_dumbbell(color = "black",
                size = 3,
                colour_x = "#ffb612",
                colour_xend = "#ffb612",
                size_x = 6,
                size_xend = 6) +
  scale_y_continuous(breaks = seq(1999, 2020)) +
  scale_x_continuous(breaks = c(-0.5, -0.25, 0.0, 0.25, 0.5)) +
  theme_bw() +
  theme(panel.background=element_rect(fill="#FFFFFF")) +
  theme(plot.background=element_rect(fill="#FFFFFF")) +
  theme(panel.border=element_rect(colour="#FFFFFF")) +
  theme(panel.grid.major=element_line(colour="#D0D0D0",size=.75)) +
  theme(axis.text.x=element_text(angle = 50, vjust = 0.5,    size=11,colour="#535353",face="bold")) +
  theme(axis.text.y=element_text(size=11,colour="#535353",face="bold")) +
  theme(axis.title.y=element_text(size=11,colour="#535353",face="bold",vjust=1.5)) +
  theme(axis.title.x=element_text(size=11,colour="#535353",face="bold",vjust=-.5)) +
  theme(aspect.ratio = 9 / 16) +
  geom_text(color = "black", size = 4, fontface = "bold",
            aes(x = Rush, label = "R")) +
  geom_text(color = "black", size = 4, fontface = "bold",
            aes(x = Pass, label = "P")) +
  geom_rect(data = pit.defense.epa, aes(xmin = .50, xmax = .55, ymin = -Inf, ymax = Inf), fill = "#efefe3") +
  geom_text(data = pit.defense.epa, aes(label = plays, y = season, x = .52), fontface = "bold", size = 3) +
  geom_text(data=filter(pit.defense.epa, season =="2020"), aes(x=.50, y=season, label="Total Plays"),
            color="#7a7d7e", size=3.1, vjust=-2.7, hjust = .40, fontface="bold") +
  labs(title = "Steelers' Defensive EPA: 1999 - 2020",
       subtitle = "First 2 Weeks of Every Season | R = Rushing, P = Passing", 
       caption = "Twitter: @BradCongelio  |  Data: nflfastR") +
  coord_cartesian(clip = "off") +
  xlab("Total EPA") +
  ylab("Season") +
  theme(plot.title=element_text(face="bold", colour="#3C3C3C",size=20)) +
  theme(plot.subtitle=element_text(size=9)) +
  theme(plot.caption=element_text(size=7, color="#7a7d7e"))

So, the 2020 version of the Steelers defense, is certainly worthy of the praise it is getting right now. Both rushing and passing defense EPA are in the negative numbers, which is awesome.

But, this year’s version isn’t quite as scary as that nasty 2007 defense, or even the 2008 and 2018 versions.

That said, based on this quick study using the nflfastR package, it is clear that the 2020 Steelers defense is trending in the right direction.

The following two tabs change content below.

Brad Congelio

An Assistant Professor in the College of Business at Kutztown University of Pennsylvania, Brad Congelio uses data science and analytics to investigate the sport industry.

Latest posts by Brad Congelio (see all)

Leave a Comment

Follow Me on Twitter

I am always talking about RStudio, data science, and sports analytics on Twitter - especially those subjects that aren't quite enough for blog posts on my site. Click below to follow me and join the conversation.