I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. I have cleaned my data table to now be formatted as one row per episode, each of which contains the date of the episode and the associated primary and secondary diagnoses (up to 20 secondary diagnoses). The diagnoses columns contain strings of ICD-10 disease codes.
I am interested in extracting data on approximately 60 conditions from this table, each of which is made up of a variable number of clinical codes (e.g. asthma has five associated ICD-10 codes, anxiety 3 codes, lyme disease 3 codes..). I'd like to write a function that searches across all of the primary and secondary diagnosis columns, checks for any of the strings associated with the 60 conditions, and creates a binary yes/no output which is coded 1 if any of the associated ICD codes are present, or otherwise 0.
I've worked out how to do this on the primary diagnoses column only ('diag_icd10') using grepl within ifelse within mutate, but having to write across many many lines of code to get what I want for all 60 conditions. Shortened example below...
anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")
# Checking primary diagnosis column 'diag_icd10'
data <- data |>
mutate(
anxiety_prim = ifelse(grepl(paste(anxiety_codes, collapse = '|'), diag_icd10), 1, 0),
asthma_prim = ifelse(grepl(paste(asthma_codes, collapse = '|'), diag_icd10), 1, 0),
lymedisease_prim = ifelse(grepl(paste(chronic_lyme_codes, collapse = '|'), diag_icd10), 1, 0))
My problems are:
- I'd like to make this a function that checks for all 60 conditions and creates the binary
condition_primvariables without having to write 60+ lines of code. - I'd also like to create a similar function that checks for the same string patterns but across multiple secondary diagnosis columns (21 columns total... 'sec_diag_0' : 'sec_diag_20') and returns to a
"condition"_secvariable whether the conditions were present or not. - To amalgamate the two, checking across both the primary and secondary columns and returning to a single
conditionvariable whether the conditions were present or not.
Here is one approach using example data. You can make a named list of your diagnosis vectors:
Which looks like this:
I made an example data.frame with an
id, the primary diagnosis code, and a couple of secondary diagnostic codes.Here is the data.frame:
You can use
sapplythrough your diagnosis list, and first check the primary diagnosis column and change column name to include "prim" for primary. For secondary diagnoses, useReduceandlapplythrough multiple columns, then label with suffix of "sec" for secondary.The result looks like this:
You can combine like columns and add additional summary (primary or secondary diagnosis):
Which has this output (can be added to
df_resultif desired):Or, if you want to just do all columns at once, try: