Word Boundaries in R for Word Filtering

39 views Asked by At

I am filtering some pathway names, and I want to include only pathways that contains: "HEMATOPOIETIC", "ERYTHROCYTE", "ERYTHROID", "STEM", "STEMNESS", "HEMATOPOIESIS" and make a bar plot of the results. This is my script so far:

obese <- gsea_obese %>% 
  filter(FDR.q.val < 0.25) %>% 
  filter(NOM.p.val < 0.05) %>% 
  filter(str_detect(NAME, "HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS")) %>% 
  arrange(FDR.q.val)

ggplot(obese, aes(reorder(NAME, NES), NES)) +
  geom_col(aes(fill=FDR.q.val)) +
  scale_fill_gradient(low = "blue", high = "red") +
  coord_flip() +
  labs(x="Pathway", y="Normalized Enrichment Score",
       title="Enriched Genes JAK2 vs WT in Obese mice")

on my results, i got unwanted pathways like "REACTOME_NEURONAL_SYSTEM", "REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM", "GOBP_IMMUNE_SYSTEM_DEVELOPMENT", ... and I noticed that it is due to the word "SYSTEM" containing "STEM". I tried using "\\b" but instead it does not return (or does not match) with any pathway names at all. Maybe I have mispositioned "\\b" in the script, so any help would be appreciated!

Edit: Here's a sample of the input:

gsea_obese <- c(
  "RAMALHO_STEMNESS_UP",
  "IVANOVA_HEMATOPOIESIS_INTERMEDIATE_PROGENITOR",
  "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_QTL_CIS",
  "IVANOVA_HEMATOPOIESIS_EARLY_PROGENITOR",
  "RIZ_ERYTHROID_DIFFERENTIATION",
  "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS",
  "RIZ_ERYTHROID_DIFFERENTIATION_CCNE1",
  "REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM",
  "GOBP_ERYTHROCYTE_HOMEOSTASIS",
  "REACTOME_NEURONAL_SYSTEM",
)

the output I expected is that they do not include the ones outside of the pattern I have specified. i manage to solve it with filter(str_detect(NAME, "HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|\\b STEM|STEMNESS|HEMATOPOIESIS")), however i also try it with filter(str_detect(NAME, "\\b HEMATOPOIETIC|\\b ERYTHROCYTE|\\b ERYTHROID|\\b STEM|\\b STEMNESS|\\b HEMATOPOIESIS")) and i don't understand why this script does not return any match.

1

There are 1 answers

1
r2evans On

_ does not constitute a word boundary:

grepl("\\ba\\b", "_a_")
# [1] FALSE

This is because \b in regex "matches the empty string at either edge of a word", and a "word" is typically [[:alnum:]_] (it explicitly includes the underscore). (See ?regex for all of that with more explanations.)

We need to refine the regex to look for specific boundary conditions. Using (^|_) will look for a _-boundary or beginning-of-string, similarly for EOS:

grepl("(^|_)a(_|$)", c("a_", "_a_", "_a", "_aa_"))
# [1]  TRUE  TRUE  TRUE FALSE

We will also need to group the pattern strings into a (-group. I think that means your regex should look something like this:

"(^|_)?(HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS)(_|$)"

If there are other boundary-like characters, you may want to include them in those book-end patterns. For instance, if a boundary could be a space or underscore, then you might use

"(^|[_ ])?(HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS)([_ ]|$)"

ala

txt <- c("REACTOME_NEURONAL_SYSTEM", "SOMETHING_STEM_ELSE", "STEM_QUUX")
grepl("(^|[_ ])?(HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS)([_ ]|$)", txt)
# [1] FALSE  TRUE  TRUE

Using your added data:

gsea_obese <- c(
  "RAMALHO_STEMNESS_UP",
  "IVANOVA_HEMATOPOIESIS_INTERMEDIATE_PROGENITOR",
  "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_QTL_CIS",
  "IVANOVA_HEMATOPOIESIS_EARLY_PROGENITOR",
  "RIZ_ERYTHROID_DIFFERENTIATION",
  "BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS",
  "RIZ_ERYTHROID_DIFFERENTIATION_CCNE1",
  "REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM",
  "GOBP_ERYTHROCYTE_HOMEOSTASIS",
  "REACTOME_NEURONAL_SYSTEM"
)
keep <- c("HEMATOPOIETIC", "ERYTHROCYTE", "ERYTHROID", "STEM", "STEMNESS", "HEMATOPOIESIS")
ptn <- paste0("(^|[ _])(", paste(keep, collapse = "|"), ")([ _]|$)")
ptn
# [1] "(^|[ _])(HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS)([ _]|$)"

stringr::str_detect(gsea_obese, ptn)
#  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE