Problem I'm trying to solve: I am attempting to write a function that iteratively updates certain rows in master data frame. This has to work with .RDS files.
Issue: When you load a .RData file with load(), all the objects load into the environment. I need to select a single df and then perform a series of anti_join and semi_join's to update certain rows.
Reprex problem setup:
library(dplyr)
set.seed = 19572
n <- 40
old_dat1 <- rnorm(n) |>
as.data.frame(length = n) |>
cbind(group = sample(rep(c(1,2,3),length.out = n))) |>
cbind(timepoint = sample(rep(c(1,2),length.out = n))) |>
cbind(condition = sample(rep(c(1,2),length.out = n))) |>
mutate(idx = seq_along(1:n))
set.seed = 11232
n <- 10
old_dat2 <- rnorm(n) |>
as.data.frame(length = n) |>
cbind(group = sample(rep(c(1,2,3),length.out = n))) |>
mutate(idx = seq_along(1:n))
save(old_dat1,
old_dat2,
file = "dat.RData")
Here's what the original data looks like:
> head(old_dat1)
rnorm(n) group timepoint condition idx
1 -0.85557487 3 2 2 1
2 -0.32528867 3 1 1 2
3 -0.07737735 3 1 2 3
4 1.31808404 1 2 2 4
5 -0.41356510 3 2 2 5
6 0.77771936 2 2 2 6
Now we want to load in old_dat1, update some rows, merge it back to the original dataframe. Inherently we get old_dat1 and old_dat2 even though we don't currently need to do anything with old_dat2. We declare the group and timepoint for the current analysis.
load("dat.RData")
> ls()
[1] "old_dat1" "old_dat2"
group = 1
timepoint = 1
We start with the dataframes imported with load() and then run a series of calculation for each group x timepoint combo. We calculate a new DV for a subset of rows that look something like this:
## More Fake Data
tmp <- rbind(
old_dat1 |>
filter(group == .GlobalEnv$group &
timepoint == 1 & condition == 1) |>
mutate(new_dv = sample(rep(
c("res_a", "res_b"), length.out = n()
))),
old_dat1 |>
filter(group == .GlobalEnv$group &
timepoint == 1 & condition == 2) |>
mutate(new_dv = sample(rep(
c("res_a", "res_b"), length.out = n()
)))
)
> head(tmp)
rnorm(n) group timepoint condition idx new_dv
1 -0.3696799 1 1 1 15 res_b
2 0.7211805 1 1 1 21 res_a
3 -0.6030503 1 1 1 31 res_a
4 -0.9662953 1 1 2 23 res_b
5 0.3723939 1 1 2 33 res_a
6 2.1657427 1 1 2 36 res_b
We may or may not repeat this across different groups, timepoints, etc. depending on what data is available per subject since not every subject will see every combination of conditions.
We're now ready to merge these results back into the original dataframe.
If the data were saved with RDS, the solution would go a little like this:
combine_fx <- function(dir,
df,
grouping,
timepoint) {
tmp_combined <- readRDS(paste0(file_dir)) |>
filter(!c(group == grouping & condition == condition_new)) |>
# Add the input df/condition to the combined df
dplyr::bind_rows(df |>
filter(group == grouping &
(
condition == condition_new |
is.na(condition_new)
) == TRUE))
}
#Where
dir <- "dat.RData"
combine_fx(
dir = dir,
df = tmp,
grouping = .GlobalEnv$group,
timepoint = .GlobalEnv$timepoint
)
My result should give the original dataframe that iteratively updates the new DV column, it would look something like this:
old_dat1_updated_group1 <- anti_join(x = old_dat1, y = tmp) |>
dplyr::bind_rows(y = tmp) |>
arrange(idx)
> head(old_dat1_updated_group1)
rnorm(n) group timepoint condition idx new_dv
1 -0.85557487 3 2 2 1 <NA>
2 -0.32528867 3 1 1 2 <NA>
3 -0.07737735 3 1 2 3 <NA>
4 1.31808404 1 2 2 4 <NA>
5 -0.41356510 3 2 2 5 <NA>
6 0.77771936 2 2 2 6 <NA>
BUT, because load() pulls in multiple dataframes, we can't use the tmp_combined <- readRDS(paste0(file_dir)) code.
So how can I reference a specific DF while sticking with .RData instead of .RDS files?