I am trying to extrapolate forwards for missing values after the last non-missing value in a series using the growth rate given in another column.
This is a toy example below. My data is grouped according to product and region and should be arranged by year within grouping. I have sales data with some NA values. I am interested in using the World.growth rate to extrapolate the NA values for sales which appear after the last non-missing value in each series. So say for example Product = 1, Region = 1, I want the Interpolated.sales for 1973 to be 1972 value * World.growth value given in 1973.
Here is my data:
> library(openxlsx)
> library(dplyr)
>
> sales.data <- read.xlsx("\\Interpolate_sales.xlsx", sheet = 1)
>
> print(sales.data)
Product Region Year Sales World.growth Extrapolated.sales
1 1 1 1970 NA 0.88 NA
2 1 1 1971 141.906913 0.72 141.906913
3 1 1 1972 9.605398 1.01 9.605398
4 1 1 1973 NA 1.18 11.334370
5 1 1 1974 NA 0.84 9.520871
6 1 2 1970 60.062486 0.88 60.062486
7 1 2 1971 124.904150 0.72 124.904150
8 1 2 1972 NA 1.01 126.153191
9 1 2 1973 NA 1.18 148.860765
10 1 2 1974 NA 0.84 125.043043
11 1 2 1975 NA 1.23 153.802943
12 1 3 1970 63.298780 0.88 63.298780
13 1 3 1971 90.219126 0.72 90.219126
14 1 3 1972 107.271043 1.01 107.271043
15 1 3 1973 129.122561 1.18 129.122561
16 1 3 1974 NA 0.84 108.462951
This is my dplyr attempt (Is there perhaps a better way by writing a function or something? I guess this will also be handy if I want to backwards extrapolate missing values at the start of a series.)
> sales.data.extrapolated <- sales.data %>%
+ group_by(Product, Region) %>%
+ arrange(Product, Region, Year) %>%
+ mutate(Extrapolated.sales.manual = case_when(
+ row_number() > max(which(!is.na(Sales))) ~ lag(Sales) * World.growth,
+ TRUE ~ Sales
+ ))
>
> print(sales.data.extrapolated)
# A tibble: 16 × 7
# Groups: Product, Region [3]
Product Region Year Sales World.growth Extrapolated.sales Extrapolated.sales.manual
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1970 NA 0.88 NA NA
2 1 1 1971 142. 0.72 142. 142.
3 1 1 1972 9.61 1.01 9.61 9.61
4 1 1 1973 NA 1.18 11.3 11.3
5 1 1 1974 NA 0.84 9.52 NA
6 1 2 1970 60.1 0.88 60.1 60.1
7 1 2 1971 125. 0.72 125. 125.
8 1 2 1972 NA 1.01 126. 126.
9 1 2 1973 NA 1.18 149. NA
10 1 2 1974 NA 0.84 125. NA
11 1 2 1975 NA 1.23 154. NA
12 1 3 1970 63.3 0.88 63.3 63.3
13 1 3 1971 90.2 0.72 90.2 90.2
14 1 3 1972 107. 1.01 107. 107.
15 1 3 1973 129. 1.18 129. 129.
16 1 3 1974 NA 0.84 108. 108.
As you can see, this only calculates the value for the first missing value and none beyond.
This is a "reduction" problem and must be done step-wise, since "this row's" new value depends on the results of "last row's" value ... the normal vectorized operations in R don't know the updated value for the previous value until after the whole vector has been calculated.
For this, I'll use
Reduce, thoughpurrr::accumulateis nearly a drop-in replacement if you're already usingpurrr.Walkthrough:
Reduceis called eithernrows-1times (ifinit=is not provided) ornrowsif it is provided;prevvalis set to theinit=value we provided (NA) andindis1(the first fromrow_number());init=completely, then on the first iteration,prevvalwould be1andindwould be2, the first two values inrow_number(); not what we wantprevvalis the previous row'sSales, andindis the row number ... they are different "types" of numbers, not to be intermingled, which is why havingprevval=1, ind=2was a problemcoalesce(..)calculates the expected (extrapolated) new value, but only uses it ifSales[ind]isNA; that is,coalescereturns (vector-wise in general, though here it is always length-1) the first non-NAvalue of the arguments providedcoalescebecomes theprevvalon the next iteration of the anon-function (whichindnow being incremented, since we usedrow_number().Reducegives you the last value of the iterations, but we want every step of the way, so we addaccumulate=TRUE;init=, the return vector is actually lengthnrows+1, and we know that the first is a throw-away (because we usedinit=), so we remove it with[-1]The
purrr::accumulateversion, if you're curious:Moving the
row_number()as the first argument, using.init=as we didinit=, and keeping[-1]as before.