I have a medium-sized data set (provided to me) that includes address information in R that I'm in the process of cleaning. There is information that I need to remove but I am unsure how to do so, as the information after the ZIP code itself is not static. Below is a sample:
addresses <- c("515 DUMMY 1 75253 69AP",
"1000 DUMMY 2 75211",
"3948 DUMMY 3 75217 69Q",
"4545 DUMMY 4 75217 MAP 68C")
In essence, I need to transform these addresses into the following format:
"515 DUMMY 1 75253",
"1000 DUMMY 2 75211",
"3948 DUMMY 3 75217",
"4545 DUMMY 4 75217"
Thanks in advance for any help you may be able to provide.
Seems a classic regex approach might be something like below. I'll add one more address with another 5-digit number (leading) to make sure we don't over-remove.
Regex:
The
(...)are saved groups, and\\1\\2restore those two groups.Quick edit: I don't like having to double-backslash everything, so in a newer R with "raw strings", we can do
I think it makes it a little easier to read, though we still need to mentally discard the leading/trailing braces (we can also use
r"(..)",r"[..]",r"|..|").