I am filtering some pathway names, and I want to include only pathways that contains: "HEMATOPOIETIC", "ERYTHROCYTE", "ERYTHROID", "STEM", "STEMNESS", "HEMATOPOIESIS" and make a bar plot of the results. This is my script so far:
obese <- gsea_obese %>%
filter(FDR.q.val < 0.25) %>%
filter(NOM.p.val < 0.05) %>%
filter(str_detect(NAME, "HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|STEM|STEMNESS|HEMATOPOIESIS")) %>%
arrange(FDR.q.val)
ggplot(obese, aes(reorder(NAME, NES), NES)) +
geom_col(aes(fill=FDR.q.val)) +
scale_fill_gradient(low = "blue", high = "red") +
coord_flip() +
labs(x="Pathway", y="Normalized Enrichment Score",
title="Enriched Genes JAK2 vs WT in Obese mice")
on my results, i got unwanted pathways like "REACTOME_NEURONAL_SYSTEM", "REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM", "GOBP_IMMUNE_SYSTEM_DEVELOPMENT", ... and I noticed that it is due to the word "SYSTEM" containing "STEM". I tried using "\\b" but instead it does not return (or does not match) with any pathway names at all. Maybe I have mispositioned "\\b" in the script, so any help would be appreciated!
Edit: Here's a sample of the input:
gsea_obese <- c(
"RAMALHO_STEMNESS_UP",
"IVANOVA_HEMATOPOIESIS_INTERMEDIATE_PROGENITOR",
"BYSTRYKH_HEMATOPOIESIS_STEM_CELL_QTL_CIS",
"IVANOVA_HEMATOPOIESIS_EARLY_PROGENITOR",
"RIZ_ERYTHROID_DIFFERENTIATION",
"BYSTRYKH_HEMATOPOIESIS_STEM_CELL_AND_BRAIN_QTL_CIS",
"RIZ_ERYTHROID_DIFFERENTIATION_CCNE1",
"REACTOME_CYTOKINE_SIGNALING_IN_IMMUNE_SYSTEM",
"GOBP_ERYTHROCYTE_HOMEOSTASIS",
"REACTOME_NEURONAL_SYSTEM",
)
the output I expected is that they do not include the ones outside of the pattern I have specified. i manage to solve it with
filter(str_detect(NAME, "HEMATOPOIETIC|ERYTHROCYTE|ERYTHROID|\\b STEM|STEMNESS|HEMATOPOIESIS")), however i also try it with filter(str_detect(NAME, "\\b HEMATOPOIETIC|\\b ERYTHROCYTE|\\b ERYTHROID|\\b STEM|\\b STEMNESS|\\b HEMATOPOIESIS")) and i don't understand why this script does not return any match.
_does not constitute a word boundary:This is because
\bin regex "matches the empty string at either edge of a word", and a "word" is typically[[:alnum:]_](it explicitly includes the underscore). (See?regexfor all of that with more explanations.)We need to refine the regex to look for specific boundary conditions. Using
(^|_)will look for a_-boundary or beginning-of-string, similarly for EOS:We will also need to group the pattern strings into a
(-group. I think that means your regex should look something like this:If there are other boundary-like characters, you may want to include them in those book-end patterns. For instance, if a boundary could be a space or underscore, then you might use
ala
Using your added data: