Extract iteratively data in subfolders within folder into googledrive

153 views Asked by At

In a general folder, I have this subset of subfolders which I would like to open iteratively to download datasets in .xlsx format

In the following major folder:

enter image description here

there are these subfolders that as you could see has a specific pattern name

enter image description here

and within each of them, there is a .xlsx dataset

enter image description here

named as similarly as the main subfolder where it is contained

I was wondering on extract them using some iterative function. Based on the code I have found through the forum I have found something that I have readapted as for loop, but with no results.

url = 'urlnamexxx'
for (folder in url) {
  temp <- tempfile(fileext = ".xlsx")
  download.file(url, temp)
  readxl::read_xlsx(temp)
}

Could you please give some suggestions?

Please in case something is not clear, just comment down below and let me the details I should provide.

1

There are 1 answers

2
VonC On BEST ANSWER

The script you provided is intended to download a single Excel file from a given URL, save it as a temporary file, and then read it into R using the readxl::read_xlsx() function. This will not work for multiple URLs or for URLs that point to directories rather than files.

To extract data from Google Drive in a structured manner, you would need to use Google Drive's API, which provides a way to list and download files. For R, there are several packages that can facilitate this, such as googledrive.

That would involve:

  1. Authenticate with Google Drive:

    library(googledrive)
    drive_auth()
    
  2. Identify the parent folder:

    folder <- drive_get("~/path/to/your/parent/folder")
    
  3. List all subfolders:

    subfolders <- drive_ls(path = folder)
    
  4. Loop over the subfolders, identifying and downloading the .xlsx files:

    for (i in 1:nrow(subfolders)) {
        subfolder <- subfolders[i,]
        files <- drive_ls(path = subfolder)
        xlsx_files <- files[grepl("\\.xlsx$", files$name),]
        for (j in 1:nrow(xlsx_files)) {
            file <- xlsx_files[j,]
            drive_download(file, path = "~/path/to/save/files", overwrite = TRUE)
        }
    }
    

This script authenticates with Google Drive, locates the parent folder, lists its subfolders, and then iterates over these subfolders to identify and download the .xlsx files they contain.

Do replace "~/path/to/your/parent/folder" and "~/path/to/save/files" with your actual paths.

That script assumes that you have already set up Google Drive API credentials and that the googledrive package has been installed (install.packages("googledrive")).
And make sure you have the necessary permissions to access the files and folders on Google Drive.