Standardizing address formatting in R

64 views Asked by At

I have a medium-sized data set (provided to me) that includes address information in R that I'm in the process of cleaning. There is information that I need to remove but I am unsure how to do so, as the information after the ZIP code itself is not static. Below is a sample:

addresses <- c("515 DUMMY 1 75253 69AP",
               "1000 DUMMY 2  75211",
               "3948 DUMMY 3 75217 69Q",
               "4545 DUMMY 4 75217 MAP 68C")

In essence, I need to transform these addresses into the following format:

"515 DUMMY 1 75253",
"1000 DUMMY 2  75211",
"3948 DUMMY 3 75217",
"4545 DUMMY 4 75217"

Thanks in advance for any help you may be able to provide.

2

There are 2 answers

0
r2evans On BEST ANSWER

Seems a classic regex approach might be something like below. I'll add one more address with another 5-digit number (leading) to make sure we don't over-remove.

addresses <- c("515 DUMMY 1 75253 69AP",
               "1000 DUMMY 2  75211",
               "3948 DUMMY 3 75217 69Q",
               "4545 DUMMY 4 75217 MAP 68C",
               "45454 DUMMY 4 75217 MAP 68C")
sub("^(.+)\\b(\\d{5})\\b.*", "\\1\\2", addresses)
# [1] "515 DUMMY 1 75253"   "1000 DUMMY 2  75211" "3948 DUMMY 3 75217"  "4545 DUMMY 4 75217"  "45454 DUMMY 4 75217"

Regex:

"^(.+)\\b(\\d{5})\\b.*"
 ^^^^^                    something at the beginning of string,
                          so that we don't false-trigger on a 5-digit
                          house address (a little fragile)
      ^^^        ^^^      word boundaries
         ^^^^^^^^         exactly five digits ([0-9])
                    ^^    anything else (discarded)

The (...) are saved groups, and \\1\\2 restore those two groups.

Quick edit: I don't like having to double-backslash everything, so in a newer R with "raw strings", we can do

sub(r"{^(.+)\b(\d{5})\b.*}", r"{\1\2}", addresses)

I think it makes it a little easier to read, though we still need to mentally discard the leading/trailing braces (we can also use r"(..)", r"[..]", r"|..|").

0
Roman On
# Extract pattern (adjust to your use case)
result <- stringr::str_extract(
    # patterns used:
    # at least 3 numbers: [0-9]{3,}
    # at least 1 space: [:space:]+
    # at least 3 letters of any case: [A-z]{3,}
    # 
    # other possible variations:
    # 1 to 5 numbers: [0-9]{1,5}
    # uppercase character: [A-Z]
    # more info: https://stringr.tidyverse.org/articles/regular-expressions.html
    #
    addresses, "[0-9]{3,}[:space:]+[A-z]{3,}[:space:]+[0-9]{1,}[:space:]+[0-9]{5}"
)

# Remove all extra spaces
result_cleaned <- stringr::str_replace_all(result, "[:space:]{2,}", " ")
> result_cleaned
[1] "515 DUMMY 1 75253"  "1000 DUMMY 2 75211" "3948 DUMMY 3 75217" "4545 DUMMY 4 75217"