R difftime() producing incorrect intervals

74 views Asked by At

The data was imported from a CSV file, with the dates originally in characters. I have mutated them separately as I want to do some further analysis on them.

The only other question on the topic I can find, the answer is complicated by needing to allow for pre-1970 dates, but my data is from the last few years, so that isn't a consideration.

EARLIEST_DATE <- c(06/07/2021, 15/11/2019, 01/10/2019, 24/09/2019, 10/11/2020, 21/09/2021, 21/07/2020, 18/10/2022, 05/10/2021, 22/10/2021)
LATEST_DATE <- c(28/09/2021, 20/12/2019, 31/03/2020, 18/09/2020, 15/06/2021, 28/06/2022, 04/09/2020, 28/02/2023, 02/11/2021, 08/07/2022)


library(dplyr)    

df <- data.frame(EARLIEST_DATE, LATEST_DATE)

df %>%
  mutate(EARLIEST_DATE = as.Date(EARLIEST_DATE,'%d/%m/%Y'),
         LATEST_DATE = as.Date(LATEST_DATE,'%d/%m/%Y'))

df %>%
  summarise(weeks = ceiling(as.numeric(difftime(EARLIEST_DATE, LATEST_DATE, units = "weeks"))))

I get the following results:

1 -1157
2  -265
3    78
4   209
5  -239
6  -352
7   774
8   970
9   152
10  744

I can't work out why it's giving such wildly incorrect values.

They should all be positive, and none of the intervals is more than a year, so it's not even as if it's presenting the difference in days. The ceiling clause is so that the value returned for weeks is at least 1.

I can calculate the differences fine in Excel, but the file is so large it keeps crashing whenever I try to do anything, hence using R.

1

There are 1 answers

0
NicChr On

Try parsing them as strings before the calculation. Using lubridate's dmy() to parse day-month-year formatted strings, I seem to get a more sensible output.

library(dplyr)
library(lubridate)
EARLIEST_DATE <- dmy(c("06/07/2021", "15/11/2019", "01/10/2019", "24/09/2019", "10/11/2020", "21/09/2021", "21/07/2020", "18/10/2022", "05/10/2021", "22/10/2021"))
LATEST_DATE <- dmy(c("28/09/2021", "20/12/2019", "31/03/2020", "18/09/2020", "15/06/2021", "28/06/2022", "04/09/2020", "28/02/2023", "02/11/2021", "08/07/2022"))


df <- tibble(EARLIEST_DATE, LATEST_DATE)


df %>%
  reframe(weeks = ceiling(as.numeric(difftime(LATEST_DATE, EARLIEST_DATE, units = "weeks"))))
#> # A tibble: 10 x 1
#>    weeks
#>    <dbl>
#>  1    12
#>  2     5
#>  3    26
#>  4    52
#>  5    31
#>  6    40
#>  7     7
#>  8    19
#>  9     4
#> 10    37

Created on 2023-04-12 with reprex v2.0.2