Rolling regression across multiple columns with NAs in R

81 views Asked by At

I want to regress each column of y against x with width 12 and calculate Beta, Intercept, and R-2 values. I want a minimum of 12 observations and for cases where there isn't overlapping data, for these to be NAs. For my output, I want three matrices (dataframes) with column headers y1,y2,y3, containing all Beta values, all Intercept values, and all R-2 values respectively. The output matrices should each be of size 3x53.

In practice, my y-matrix has 2000+ columns, so I'd like to avoid looping if possible.

Example:
y <- data.frame(y1 = c(NA,NA,NA, rnorm(50)),
                y2 = c(NA, NA, NA, NA, NA, rnorm(48)),
                y3 = rnorm(53))
x <- data.frame(x = c(NA, NA,NA, NA, NA, rnorm(48)))

my attempt using rollapply:

model_output <- rollapply(data = cbind.data.frame(y, x),
                          width = 12, FUN = function(a,b) {lm(a ~ b)} ,
                          by.column = FALSE, align = "right",
                          fill = NA)
2

There are 2 answers

0
G. Grothendieck On BEST ANSWER

Using the data in the Note at the end from the question define a roll function which performs the linear regression and then applies a passed in function fun. Then use it to define Intercept, Beta and R2.

library (zoo)

fm <- lm(as.matrix(y) ~ x, x)

roll <- function(fun) rollapplyr(1:nrow(y), 12, FUN = 
   \(ix) fun(update(fm, subset = ix)), fill = NA, by.column = FALSE) 

Intercept <- roll(\(x) coef(x)[1, ])
Beta <- roll(\(x) coef(x)[2, ])
R2 <- roll(\(x) sapply(summary(x), function(z) z$r.squared))

Note

y <- data.frame(y1 = c(NA,NA,NA, rnorm(50)),
                y2 = c(NA, NA, NA, NA, NA, rnorm(48)),
                y3 = rnorm(53))
x <- data.frame(x = c(NA, NA,NA, NA, NA, rnorm(48)))
0
evolet herrington On

You can use the zoo package's rollapply function in combination with lapply to achieve this. Here's a streamlined way to get the desired matrices:

library(zoo)

y <- data.frame(y1 = c(NA,NA,NA, rnorm(50)),
                y2 = c(NA, NA, NA, NA, NA, rnorm(48)),
                y3 = rnorm(53))
x <- data.frame(x = c(NA, NA,NA, NA, NA, rnorm(48)))

compute_stats <- function(y) {
  if(any(is.na(y))) return(c(Beta = NA, Intercept = NA, R2 = NA))
  model <- lm(y ~ x$x)
  c(Beta = coef(model)[2], Intercept = coef(model)[1], R2 = summary(model)$r.squared)
}

results <- lapply(y, function(col) {
  t(rollapplyr(data = col, width = 12, FUN = compute_stats, by.column = TRUE, fill = NA))
})

betas <- do.call(cbind, lapply(results, function(res) res$Beta))
intercepts <- do.call(cbind, lapply(results, function(res) res$Intercept))
r2s <- do.call(cbind, lapply(results, function(res) res$R2))

colnames(betas) <- colnames(intercepts) <- colnames(r2s) <- colnames(y)

The variables betas, intercepts, and r2s now contain the Beta values, the Intercept values, and the R-squared values respectively.