How to create a conditional and efficient rolling average using dplyr

70 views Asked by At

I'm working with pitch-by-pitch baseball data and am looking to create a number of rolling averages, a few of which are conditional based on another column. The data is 6 million rows, so ideally the method isn't too inefficient. I understand dplyr is probably not the best choice for efficiency - however would like to utilize it if at all possible. Working with 32 GB of memory so I do have a little wiggle room

Here's an example of the data:

mydata <- data.frame(pitch_type = c("FB", "SI", "CU", "FB", "CH", "FB", "FS", "SL", "FB", "CH"), 
                      velocity = c(99, 97, 83, 97, 85, 101, 82, 84, 100, 83))
mydata
   pitch_type velocity
1          FB       99
2          SI       97
3          CU       83
4          FB       97
5          CH       85
6          FB       101
7          FS       82
8          SL       84
9          FB      100
10         CH       83

In my actual code I have data grouped by pitcher, sorted chronologically, etc. but basically I just want to be able to calculate rolling averages for certain pitch types. I also cannot have NA values when pitch type is not equal to what I'm trying to calculate, I'd like for it to just revert to the previous calculation.

Here's an example of what I'm looking for. Here I want to calculate average velocity for last two observations where pitch_type == "FB":

mydata_updated
   pitch_type velocity l2_velo_fb
1          FB       99         NA
2          SI       97         NA
3          CU       83         NA
4          FB       97      98.00
5          CH       85      98.00
6          FB      101      99.00
7          FS       82      99.00
8          SL       84      99.00
9          FB      100      100.5
10         CH       83      100.5

Seems relatively straightforward, but for the life of me cannot find similar examples online to what I'm looking for. I have numerous different columns and conditions to work through in creating new columns, so creating a new dataframe for each filter is far from ideal.

I did find this example:

mutate(last1000FBvelo = ifelse(pitch_type %in% c("FB"),
rollapply(release_speed, 1000, mean, fill = NA, align = 'right', na.rm = TRUE), NA),

but instead of NA I need it to revert to whatever the previous value was. Also, just a few of these columns took about 30 minutes to run.

I also realize there are a number of packages that can compute rolling averages (zoo, RccpRoll, slider, runner are ones I've came across). It's hard to tell what is the most efficient for my use case - assuming an of them can achieve the conditional filtering.

Any input is greatly appreciated

Also - not important or even necessary, but would also be awesome if there's a simple option to have a weighted moving average that weights more recent observations a little more.

1

There are 1 answers

0
r2evans On

I'm not certain about 99.25, but here's a method:

library(dplyr)
# library(zoo)
mydata %>%
  mutate(
    l2_velo_fb = if (first(pitch_type == "FB")) {
        zoo::rollmeanr(velocity, 2, na.pad = TRUE) 
      } else rep(NA_real_, n()),
    .by=pitch_type) |>
  mutate(l2_velo_fb = zoo::na.locf(l2_velo_fb, na.rm = FALSE))
#    pitch_type velocity l2_velo_fb
# 1          FB       99         NA
# 2          SI       97         NA
# 3          CU       83         NA
# 4          FB       97       98.0
# 5          CH       85       98.0
# 6          FB      101       99.0
# 7          FS       82       99.0
# 8          SL       84       99.0
# 9          FB      100      100.5
# 10         CH       83      100.5

The if (...) zoo::rollmeanr(..) else NA is really just a small efficiency: if you don't want to calculate a rolling mean for non-FB data, then using if (.) before rollmeanr keeps us from calculating and discarding the results. (That's also the reason I group by pitch_type.)

I think it's incorrect because a simple 2-wide rolling right-mean is:

filter(mydata, pitch_type == "FB") %>%
  mutate(vb=zoo::rollmeanr(velocity, 2, na.pad = TRUE))
#   pitch_type velocity    vb
# 1         FB       99    NA
# 2         FB       97  98.0
# 3         FB      101  99.0
# 4         FB      100 100.5

The use of .by= is predicated on dplyr_1.1.0 or newer. If you have older, then change to

mydata %>%
  group_by(pitch_type) %>%
  mutate(
    l2_velo_fb = ...
  ) %>%
  ungroup() %>% ...