How to fix "Cannot allocate vector of size..." when using filter-function?

98 views Asked by At

In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.

This is my code so far:

library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")

pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")

Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".

I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..

1

There are 1 answers

0
Tjark van de Merwe On

For working with large datasets the arrow package might provide a solution. See the documentation for some examples.

But in the case of your code you could use:

library(dplyr)
library(arrow)

setwd("F:/data")
pl <- readRDS("pl.rds")

# define folder to store partitioned data file
dataset_path <- file.path(getwd(), "subset")
if(!dir.exists(dataset_path)) dir.create(dataset_path)

# break up file in smaller subsets
pl %>%
  group_by(syear) %>%
  write_dataset(dataset_path)

rm(pl)
gc()

# check
list.files(dataset_path, recursive = TRUE)

# make connection to data
dset <- open_dataset(dataset_path)

# do lazy loading and processing, example filtering
pl <- dset %>%
  filter(syear > 2012) %>%
  collect()

And you can use this not only to filter, but to do all kinds of operations without needing the full dataset in memory.