I'm trying to take cumulative sums for each column of a matrix. Here's my code in R:
testMatrix = matrix(1:65536, ncol=256);
microbenchmark(apply(testMatrix, 2, cumsum), times=100L);
Unit: milliseconds
                         expr      min       lq     mean  median       uq      max neval
 apply(testMatrix, 2, cumsum) 1.599051 1.766112 2.329932 2.15326 2.221538 93.84911 10000
I used Rcpp for comparison:
cppFunction('NumericMatrix apply_cumsum_col(NumericMatrix m) {
    for (int j = 0; j < m.ncol(); ++j) {
        for (int i = 1; i < m.nrow(); ++i) {
            m(i, j) += m(i - 1, j);
        }
    }
    return m;
}');
microbenchmark(apply_cumsum_col(testMatrix), times=10000L);
Unit: microseconds
                         expr     min      lq     mean  median      uq      max neval
 apply_cumsum_col(testMatrix) 205.833 257.719 309.9949 265.986 276.534 96398.93 10000
So the C++ code is 7.5 times as fast. Is it possible to do better than apply(testMatrix, 2, cumsum) in pure R? It feels like I have an order of magnitude overhead for no reason.
                        
Using a byte-compiled for loop is slightly faster than the
applycall on my system. I expected it to be faster because it does less work thanapply. As expected, the R loop is still slower than the simple C++ function you wrote.