Extracting character sequences from text using the stringr package in R

45 views Asked by At

I have a column with texts, named 'OBSERVA.' In the midst of this text, there may be a sequence of 8 digits corresponding to a code that I would like to extract for filling another column. For example, one of the tuples in the OBSERVA column has the following record: 'DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.' In this case, I need to extract the numbers 29932940. I used the 'str_extract' function from the 'stringr' package, but I did not get a satisfactory result (the sequence of 8 numbers is not identified, I just have NA's).

library(stringr)
dados_sivep_tratados$Teste  <- ifelse(
  dados_sivep_tratados$NU_DO == 0 & !is.na(dados_sivep_tratados$OBSERVA),
  str_extract(dados_sivep_tratados$OBSERVA, "\\b\\d{8}\\b"),
  NA
)
1

There are 1 answers

0
geek45 On BEST ANSWER

Example with different lengths of the number before -

library(stringr)

df <- data.frame(
  OBS = c(
    "DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.",
    "DO 29932967840-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL."
  )
)

df$ExtractedNumber <- str_extract(df$OBS, "\\d+(?=-)")


print(df$ExtractedNumber)
[1] "29932940"    "29932967840"