Join two tables by index range but table length increased

Question

Join two tables by index range but table length increased

61 views Asked by monotonic At 22 December 2023 at 07:29

I'm trying to follow this answer that join two tables with range: https://stackoverflow.com/a/46341899/6636572

I want to join two tables where one has some ranges and another are numbers and I want to annotate the numbers by matching which range those numbers are in from another dataframe.

However the length of the table surprisingly increased. I expect it to stay the same as 2832. What's happening? How to trouble shoot this? The result doesn't change if I use fuzzy_left_join

> head(gene_list_selected)
  chr start_pos  end_pos  gene_name
1   1  55013806 55100417     ACOT11
2   1  55074849 55089200    FAM151A
3   1  55107412 55175940      MROH7
4   1  55107412 55208328 MROH7-TTC4
5   1  55181494 55208328       TTC4
6   1  55222570 55230226      PARS2
> head(df)
            rsid      pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>% 
+     fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`)) 
> 
> nrow(df_with_gene_name)
[1] 3298

I want the end result to keep rsid, pos, from, start while having additional columns of one-hot coded binary table of gene_name.

> result
rsid, chr, pos, from, start, ACOT11, FAM151A, MROH7, MROH7-TTC4, TTC4, PARS2
1:55013860:C:T 1 55013860 55013806 55100417 1 0 0 0 0 0

Original Q&A

There are 1 answers

**L Tyrone** · Answer 1 · 2023-12-22T08:34:16+00:00

The reason df_with_gene_name has more rows than df is because the start_pos and end_pos of your gene_name values are not mutually exclusive. Therefore, a record in df has the potential to qualify for more than one gene_name. If so, one record will be returned for each match. See the example below as an illustration. You haven't indicated how you want to handle the multiple matches, but pivot_wider() is useful in these cases and will give you single rows per rsid. Note I have modified your example data to highlight an instance where multiple matches are found.

Update: new, much faster answer:

It is possible to achieve the desired result using only dplyr and tidyr, as fuzzyjoin is very slow and can cause issues with RAM usage for lower-spec computers. This new solution is 9 times faster than using fuzzyjoin (according to microbenchmak):

library(dplyr)
library(tidyr)

gene_list_selected <- read.table(text = "chr start_pos  end_pos  gene_name
  1  55013806 55100417     ACOT11
  1  55074849 55089200    FAM151A
  1  55107412 55175940      MROH7
  1  55107412 55208328 MROH7-TTC4
  1  55181494 55208328       TTC4
  1  55222570 55230226      PARS2", header = TRUE)

df <- read.table(text = "rsid      pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)

df_with_gene_name <- df %>%
  inner_join(., gene_list_selected, 
            by = join_by(pos >= start_pos, 
                         pos <= end_pos)) %>%
  pivot_wider(id_cols = rsid,
              names_from = gene_name,
              values_from = pos)

df_with_gene_name
# A tibble: 7 × 4
           rsid    ACOT11    MROH7 `MROH7-TTC4`
           <chr>    <int>    <int>        <int>
1 1:55013860:C:T 55013860       NA           NA
2 1:55013957:G:A 55013957       NA           NA
3 1:55014013:C:T 55014013       NA           NA
4 1:55014095:C:T 55014095       NA           NA
5 1:55014099:C:T 55014099       NA           NA
6 1:55014100:G:A 55014100       NA           NA
7 1:dupmatch:E.G       NA 55107413     55107413

Original answer:

library(fuzzyjoin)
library(dplyr)
library(tidyr)

gene_list_selected <- read.table(text = "chr start_pos  end_pos  gene_name
  1  55013806 55100417     ACOT11
  1  55074849 55089200    FAM151A
  1  55107412 55175940      MROH7
  1  55107412 55208328 MROH7-TTC4
  1  55181494 55208328       TTC4
  1  55222570 55230226      PARS2", header = TRUE)

df <- read.table(text = "          rsid      pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)

# Fuzzy join and pivot
df_with_gene_name <- df %>%
  fuzzy_inner_join(gene_list_selected, 
                   by = c("pos"="start_pos","pos"="end_pos"),
                   match_fun = list(`>=`, `<=`)) %>%
  pivot_wider(id_cols = rsid,
              names_from = gene_name,
              values_from = pos)

TechQA.

Join two tables by index range but table length increased

There are 1 answers

Related Questions in JOIN

Related Questions in DPLYR

Related Questions in TIDYVERSE

Related Questions in FUZZYJOIN

Popular Questions

Trending Questions