Join two tables by index range but table length increased

61 views Asked by At

I'm trying to follow this answer that join two tables with range: https://stackoverflow.com/a/46341899/6636572

I want to join two tables where one has some ranges and another are numbers and I want to annotate the numbers by matching which range those numbers are in from another dataframe.

However the length of the table surprisingly increased. I expect it to stay the same as 2832. What's happening? How to trouble shoot this? The result doesn't change if I use fuzzy_left_join

> head(gene_list_selected)
  chr start_pos  end_pos  gene_name
1   1  55013806 55100417     ACOT11
2   1  55074849 55089200    FAM151A
3   1  55107412 55175940      MROH7
4   1  55107412 55208328 MROH7-TTC4
5   1  55181494 55208328       TTC4
6   1  55222570 55230226      PARS2
> head(df)
            rsid      pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>% 
+     fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`)) 
> 
> nrow(df_with_gene_name)
[1] 3298

I want the end result to keep rsid, pos, from, start while having additional columns of one-hot coded binary table of gene_name.

> result
rsid, chr, pos, from, start, ACOT11, FAM151A, MROH7, MROH7-TTC4, TTC4, PARS2
1:55013860:C:T 1 55013860 55013806 55100417 1 0 0 0 0 0
1

There are 1 answers

4
L Tyrone On

The reason df_with_gene_name has more rows than df is because the start_pos and end_pos of your gene_name values are not mutually exclusive. Therefore, a record in df has the potential to qualify for more than one gene_name. If so, one record will be returned for each match. See the example below as an illustration. You haven't indicated how you want to handle the multiple matches, but pivot_wider() is useful in these cases and will give you single rows per rsid. Note I have modified your example data to highlight an instance where multiple matches are found.

Update: new, much faster answer:

It is possible to achieve the desired result using only dplyr and tidyr, as fuzzyjoin is very slow and can cause issues with RAM usage for lower-spec computers. This new solution is 9 times faster than using fuzzyjoin (according to microbenchmak):

library(dplyr)
library(tidyr)

gene_list_selected <- read.table(text = "chr start_pos  end_pos  gene_name
  1  55013806 55100417     ACOT11
  1  55074849 55089200    FAM151A
  1  55107412 55175940      MROH7
  1  55107412 55208328 MROH7-TTC4
  1  55181494 55208328       TTC4
  1  55222570 55230226      PARS2", header = TRUE)

df <- read.table(text = "rsid      pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)

df_with_gene_name <- df %>%
  inner_join(., gene_list_selected, 
            by = join_by(pos >= start_pos, 
                         pos <= end_pos)) %>%
  pivot_wider(id_cols = rsid,
              names_from = gene_name,
              values_from = pos)

df_with_gene_name
# A tibble: 7 × 4
           rsid    ACOT11    MROH7 `MROH7-TTC4`
           <chr>    <int>    <int>        <int>
1 1:55013860:C:T 55013860       NA           NA
2 1:55013957:G:A 55013957       NA           NA
3 1:55014013:C:T 55014013       NA           NA
4 1:55014095:C:T 55014095       NA           NA
5 1:55014099:C:T 55014099       NA           NA
6 1:55014100:G:A 55014100       NA           NA
7 1:dupmatch:E.G       NA 55107413     55107413

Original answer:

library(fuzzyjoin)
library(dplyr)
library(tidyr)

gene_list_selected <- read.table(text = "chr start_pos  end_pos  gene_name
  1  55013806 55100417     ACOT11
  1  55074849 55089200    FAM151A
  1  55107412 55175940      MROH7
  1  55107412 55208328 MROH7-TTC4
  1  55181494 55208328       TTC4
  1  55222570 55230226      PARS2", header = TRUE)

df <- read.table(text = "          rsid      pos
1:55013860:C:T 55013860
1:55013957:G:A 55013957
1:55014013:C:T 55014013
1:55014095:C:T 55014095
1:55014099:C:T 55014099
1:55014100:G:A 55014100
1:dupmatch:E.G 55107413", header = TRUE)

# Fuzzy join and pivot
df_with_gene_name <- df %>%
  fuzzy_inner_join(gene_list_selected, 
                   by = c("pos"="start_pos","pos"="end_pos"),
                   match_fun = list(`>=`, `<=`)) %>%
  pivot_wider(id_cols = rsid,
              names_from = gene_name,
              values_from = pos)