Function to search multiple strings across multiple columns and create binary output variables

Question

Function to search multiple strings across multiple columns and create binary output variables

61 views Asked by L.Steell At 27 February 2024 at 12:47

I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. I have cleaned my data table to now be formatted as one row per episode, each of which contains the date of the episode and the associated primary and secondary diagnoses (up to 20 secondary diagnoses). The diagnoses columns contain strings of ICD-10 disease codes.

I am interested in extracting data on approximately 60 conditions from this table, each of which is made up of a variable number of clinical codes (e.g. asthma has five associated ICD-10 codes, anxiety 3 codes, lyme disease 3 codes..). I'd like to write a function that searches across all of the primary and secondary diagnosis columns, checks for any of the strings associated with the 60 conditions, and creates a binary yes/no output which is coded 1 if any of the associated ICD codes are present, or otherwise 0.

I've worked out how to do this on the primary diagnoses column only ('diag_icd10') using grepl within ifelse within mutate, but having to write across many many lines of code to get what I want for all 60 conditions. Shortened example below...

anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")

# Checking primary diagnosis column 'diag_icd10'
data <- data |> 
  mutate(
    anxiety_prim = ifelse(grepl(paste(anxiety_codes, collapse = '|'), diag_icd10), 1, 0),
    asthma_prim = ifelse(grepl(paste(asthma_codes, collapse = '|'), diag_icd10), 1, 0),
    lymedisease_prim = ifelse(grepl(paste(chronic_lyme_codes, collapse = '|'), diag_icd10), 1, 0))

My problems are:

I'd like to make this a function that checks for all 60 conditions and creates the binary condition_prim variables without having to write 60+ lines of code.
I'd also like to create a similar function that checks for the same string patterns but across multiple secondary diagnosis columns (21 columns total... 'sec_diag_0' : 'sec_diag_20') and returns to a "condition"_sec variable whether the conditions were present or not.
To amalgamate the two, checking across both the primary and secondary columns and returning to a single condition variable whether the conditions were present or not.

Original Q&A

There are 2 answers

G. Grothendieck On 27 February 2024 at 14:22

Create a data frame, codesDF, that associates codes and conditions and also create a test input data frame DF. Then join the them and use model.matrix.

library(dplyr)

# codes has icd10 and condn columns and data has icd10 column
make01 <- function(data, codes) {
  data %>%
    left_join(codes) %>%
    mutate(model.matrix(~ condn - 1, .) %>% as.data.frame) %>%
    setNames(sub("condn(.*)_codes", "\\1", x = names(.)))
}

# test - create inputs codesDF and DF and then run make01

anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")
codesDF <- mget(ls(pattern = "_codes$")) %>%
  stack() %>%
  setNames(c("icd10", "condn"))

# test input data frame
set.seed(123)
DF <- data.frame(icd10 = sample(codesDF[[1]], 6))

make01(DF, codesDF)

giving

Joining with `by = join_by(icd10)`
  icd10              condn anxiety asthma chronic_lyme
1  F402      anxiety_codes       1      0            0
2  M012 chronic_lyme_codes       0      0            1
3  F401      anxiety_codes       1      0            0
4  J458       asthma_codes       0      1            0
5  G630 chronic_lyme_codes       0      0            1
6  J451       asthma_codes       0      1            0

**Ben** · Accepted Answer · 2024-02-27T23:35:53+00:00

Here is one approach using example data. You can make a named list of your diagnosis vectors:

anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")

dx_list <- mget(ls(pattern = "(\\w+)_codes"))
names(dx_list) <- sub("_codes$", "", names(dx_list))

dx_list

Which looks like this:

$anxiety
[1] "F400" "F401" "F402"

$asthma
[1] "J450" "J451" "J458" "J459" "J46X"

$chronic_lyme
[1] "A692" "G630" "M012"

I made an example data.frame with an id, the primary diagnosis code, and a couple of secondary diagnostic codes.

df <- data.frame(
  id = 1:3,
  diag_icd10 = c("F400", "J458", "M012"),
  sec_diag_0 = c("J450", NA, "J46X"),
  sec_diag_1 = c(NA, NA, "F401")
)

Here is the data.frame:

  id diag_icd10 sec_diag_0 sec_diag_1
1  1       F400       J450       <NA>
2  2       J458       <NA>       <NA>
3  3       M012       J46X       F401

You can use sapply through your diagnosis list, and first check the primary diagnosis column and change column name to include "prim" for primary. For secondary diagnoses, use Reduce and lapply through multiple columns, then label with suffix of "sec" for secondary.

df_result <- cbind(
  df[1],
  setNames(
    as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
    paste(names(dx_list), "prim", sep = "_")
  ),
  setNames(
    as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
    paste(names(dx_list), "sec", sep = "_")
  )
)

The result looks like this:

  id anxiety_prim asthma_prim chronic_lyme_prim anxiety_sec asthma_sec chronic_lyme_sec
1  1            1           0                 0           0          1                0
2  2            0           1                 0           0          0                0
3  3            0           0                 1           1          1                0

You can combine like columns and add additional summary (primary or secondary diagnosis):

sapply(split.default(df_result[-1], sub("_(prim|sec)$", "", names(df_result[-1]))),
       \(x) +(rowSums(x) > 0))

Which has this output (can be added to df_result if desired):

     anxiety asthma chronic_lyme
[1,]       1      1            0
[2,]       0      1            0
[3,]       1      1            1

Or, if you want to just do all columns at once, try:

cbind(
  df[1],
  setNames(
    as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
    paste(names(dx_list), "prim", sep = "_")
  ),
  setNames(
    as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
    paste(names(dx_list), "sec", sep = "_")
  ),
  sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,2:4], `%in%`, x)))
)

TechQA.

Function to search multiple strings across multiple columns and create binary output variables

There are 2 answers

Related Questions in R

Related Questions in FUNCTION

Related Questions in GREPL

Popular Questions

Trending Questions