Read Excel file with highlighted cells using R

Question

Read Excel file with highlighted cells using R

895 views Asked by Flore At 27 January 2022 at 13:08

I have a lot of long Excel files and it is too hard to handle them manually. I'm reading them in R to identify the highlighted yellow cells as in the image

The objective: is to loop over the days and the hours in the file in order to construct a data frame that indicates the option according to the hour as

I am following these answers: answer1, answer2, answer3 to do the job using the libraries xlsx, openxlsx and tidyr

library(xlsx)
library(openxlsx)
library(tidyr)

wb     <- loadWorkbook("active.xlsx") #the table is saved in the file active.xlsx
sheet1 <- getSheets(wb)[[1]]
rows  <- getRows(sheet1)
cells <- getCells(rows)
styles <- sapply(cells, getCellStyle)
cellColor <- function(style) 
   {
    fg  <- style$getFillForegroundXSSFColor()
    rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
    rgb <- paste(rgb, collapse = "")
    return(rgb)
   }
mycolor <- (yellow = "ffff00")
m     <- match(sapply(styles, cellColor), mycolor)

But the data is neither read nor processed correctly and the code is not yielding the needed result, I am not even close!

Is it possible to guide me and link a tutorial or a package in R which I can use to detect the highlighted cells and to construct the required dataframe?

Original Q&A

There are 2 answers

**Paul van Leeuwen** · Answer 1 · 2022-01-27T13:26:39+00:00

Paul van Leeuwen On 27 January 2022 at 13:26

Check out the free online book'Spreadsheet Munging Strategies':

https://nacnudus.github.io/spreadsheet-munging-strategies/ with the tidyxl package.

In the case of colored cells:

https://nacnudus.github.io/spreadsheet-munging-strategies/tidy-formatted-cells.html

**Jonni** · Answer 2 · 2023-03-19T21:12:28+00:00

I made an excel file similar to one shown named openxlsx highlighted cell.xlsx

Using openxlsx, create a workbook then load existing "openxlsx highlighted cell.xlsx". This will preserve the format of the excelworkbook, but not reading in the data. If using RStudio IDE you can look and View the existing styles in the workbook

library(tidyverse)
library(openxlsx)

wb <- createWorkbook()
wb <- openxlsx::loadWorkbook(file = "path_to_file\\openxlsx highlighted cell.xlsx",isUnzipped = F)

wb$styleObjects  #shows all styles

Looking at styles, the second style has the yellow highlighted cell as indicated by rgb value, it also lists the corresponding rows and column coordinates of the cells with this format.

Next, make vectors of the rows and columns, these will be used later

highlighted_rows <- wb$styleObjects[[2]]$rows
highlighted_cols <- wb$styleObjects[[2]]$cols

Next, read in the data as a separate workbook, this will read in the data as it appears in the excel. since one of these is a merged cell and is important for your date, specify that the merge cell should be filled. This will enable you to track the date/option combination. Set colnames to FALSE since the first row will be used as data and not column names

wb_read <- createWorkbook()
wb_read <- readWorkbook(xlsxFile = "path_to_file\\openxlsx highlighted cell.xlsx",fillMergedCells = T,colNames = FALSE)

Excel did not like the time row so I had to rewrite the time values (also converted 12:30:00 to 00:30:00), but this could be due to how I formatted the excel originally

times <- seq(from=as.POSIXct("2005-02-05 00:30:00"),to=as.POSIXct("2005-02-05 09:30:00"),by="30 min") %>% 
    substr(start = 12, stop = 20)  

wb_read[1,3:21] <- times

Only applied this to first row since row 15 is duplicate of row 1 for time

You essentially will need indices of where the times and options are located in their respective series. So we make two vectors, one that has the time position - this will include the initial A and B columns that don't have time values

times_positions <- wb_read[1,]
times_positions <- times_positions %>% 
    t()

And similarly where the option value is and date value (this is the merged cell that was filled in with earlier code)

option_index <- wb_read %>% 
    pull(X2)

date_index <- wb_read %>% 
    pull(X1)

Since you also want to retain NA values or times with there is no highlighted cell, you'll also need an expanded dataset to later join the highlighted cell data with. This code could be improved with a function, but I left as is for demo/there were just 2 dates:

date1 <- seq(from=as.POSIXct("2005-01-05 00:30:00"),to=as.POSIXct("2005-01-05 09:30:00"),by="30 min")
date2 <- seq(from=as.POSIXct("2005-02-05 00:30:00"),to=as.POSIXct("2005-02-05 09:30:00"),by="30 min") 

date1 <- as_tibble(date1) %>% 
    set_names("X")
date2 <- as_tibble(date2) %>% 
    set_names("X")

all_date_times <- bind_rows(date1,date2)

Then it is just bringing together the highlighted rows/cells indexing the vectors for what that should be, and joining with all combinations to identify where there would be NA highlights.

df <- as_tibble(bind_cols(highlighted_rows,highlighted_cols)) %>% 
    set_names("rows", "columns") %>% 
    mutate(Option = option_index[rows],
           time = times_positions[columns],
           date = mdy(date_index[rows])) %>%   #change to date here
    dplyr::select(Option, time, date) %>% 
    mutate(date_time = paste(date,time) %>% strptime(.,format = "%Y-%m-%d %H:%M:%S")) %>% 
    right_join(all_date_times, by = c("date_time" = "X")) %>% 
    arrange(date_time) %>% 
    dplyr::select(date_time, Option) %>% 
    rename("Date" = "date_time")
    
df %>% 
View()

TechQA.

Read Excel file with highlighted cells using R

There are 2 answers

Related Questions in R

Related Questions in TIDYR

Related Questions in XLSX

Related Questions in READXL

Related Questions in R-XLSX

Popular Questions

Trending Questions