Massive dataset - average values by month and location

32 views Asked by At

I would really appreciate some assistance with sorting some data to reduce temporal auto-correlation for a massive dataset that I have (almost 2600 rows). I'm sure this is probably quite an easy thing to do with something like tidyverse but I'm still very new to data cleaning using R and need to learn.

It's a very large table with entomological data... insect trap count, longitude, latitude, date, and a load of environmental variables (some numeric, some categorical). A hypothetical (much smaller) example would be:

df <- data.frame(
insects = round(runif(20)*100),
long = c(rep(22.4,5), rep(18.4,5), rep(21.8,5), rep(20.1,5)),
lat = c(rep(-34.1,5), rep(-34.4,5), rep(-33.8,5), rep(-33.7,5)),
land = c(rep("AG8",5), rep("GA6",5), rep("HE1",5), rep("JA2",5)),
temp = round(runif(20)*15),
NDVI = c(rep(c(48032730,48493073,48449380,48423773,48420000),4)),
precip = round(runif(20)*15),
date = lubridate::as_date(rep(c("2006-01-15", "2006-01-25", "2006-01-04", "2007-02-15", "2007-02-18", "2006-01-15", "2008-02-20", "2008-02-01", "2009-04-08", "2009-04-19"),2)))

I'd like to create a new dataframe, whereby every location (long/lat) which is the same, that has multiple counts in a calendar month, is all merged into one row, whereby the insect count and the environmental variables are all averaged for that month (the categorical values should be the same for every location.

So for this dataframe example above, instead of 20 rows I would end up with: 2 rows for first location (jan 2006 and feb 2007) 3 rows for second location (jan 2006, feb 2008, apr 2009) 2 rows for third location (jan 2006 and feb 2007) 3 rows for fourth location (jan 2006, feb 2008, apr 2009)

Thanks for your help

1

There are 1 answers

1
geek45 On BEST ANSWER

First, create a new column for month and year.
Next, group by location (long, lat) and month_year, and calculate the averages:

df <- df %>%
  mutate(month_year = format(date, "%Y-%m"))

df_averaged <- df %>%
  group_by(long, lat, month_year, land) %>%
  summarise(insects = mean(insects),
            temp = mean(temp),
            NDVI = mean(NDVI),
            precip = mean(precip))

df_averaged <- df_averaged %>%
  mutate(month_year = as.Date(paste0(month_year, "-01")))

print(df_averaged)

> print(df_averaged)
# A tibble: 10 × 8
# Groups:   long, lat, month_year [10]
    long   lat month_year land  insects  temp      NDVI precip
   <dbl> <dbl> <date>     <chr>   <dbl> <dbl>     <dbl>  <dbl>
 1  18.4 -34.4 2006-01-01 GA6      75   11    48032730    1   
 2  18.4 -34.4 2008-02-01 GA6      93.5  6.5  48471226.   6.5 
 3  18.4 -34.4 2009-04-01 GA6      11.5 13    48421886.   9   
 4  20.1 -33.7 2006-01-01 JA2      38   13    48032730    6   
 5  20.1 -33.7 2008-02-01 JA2      45   11    48471226.   7.5 
 6  20.1 -33.7 2009-04-01 JA2      56.5 11    48421886.   6   
 7  21.8 -33.8 2006-01-01 HE1      45.7  4    48325061    7.67
 8  21.8 -33.8 2007-02-01 HE1      28    7    48421886.   7   
 9  22.4 -34.1 2006-01-01 AG8      68.3  9.33 48325061    8.67
10  22.4 -34.1 2007-02-01 AG8      56    8.5  48421886.   9