I have a list of files that I have downloaded from the internet using an R script to run through a csv that contains download links. However when my files download they are named from the end of the csv column titled external instead of with the ID name that I want.
My data looks like this:
| ID | external |
| -------- | --------------------------------------------- |
| ABC_101 | https://peaches.com/12345_download |
| ABC_102 | https://peaches.com/123456_download |
So when the file downloads it is named 12345_download but I want it to be named ABC_101. I am working with over 1,000 instances so ideally I would want to write an R script that can match the last part in the external column then name it to the correct ID column.
library(dplyr)
library(stringdist)
library(writexl)
# set working directory for project to access files
setwd("/home2/peach/")
# read files in folder and get list of file names
file_names <- list.files(path ="peaches_downloads/downloads/",
all.files=TRUE,
full.names=TRUE,
recursive=TRUE,
pattern=".jpg") %>%
data.frame(paths = .)
# extract part of file name [remove directory sub strings] that
# comes before .jpg + other parts of naming convention and add a column.
file_names$match.name <- file_names %>%
pull(paths) %>%
basename() %>%
gsub(pattern = "\\.jpg.*", replacement = "") %>%
gsub(pattern = "_download", replacement = "")
# read in excel/csv file with names to change to
name_data <- read.csv("peaches_downloads/xlsx/fruit_full_dump_.csv")
# matching with external
# extract part of the external name to get external to match path names
name_data <- name_data %>% mutate (external_match = name_data$external <- name_data %>%
pull(external) %>%
basename() %>%
gsub(pattern = "\\.jpg.*", replacement = "") %>%
gsub(pattern = "_download", replacement = ""))
check2 <- print(file_names$match.name %in% name_data$external_match) %>% data.frame(check2 =.)
Where
"Your_files_directory"is the directory you download your file directory andname_datayour date which containsIDandexternal.