Pair each combination of two columns and calculate sum for a third column in data.table

Question

Pair each combination of two columns and calculate sum for a third column in data.table

59 views Asked by Ann At 07 September 2023 at 06:42

I have two very large df: df 1 and df2. Df 1 contains the columns "from", "to" and "count". The values in "from" and "to" represent commuting points and can occur multiple times:

from1	to1	count
10020	10020	20
10020	10020	10
10020	22001	NA
30030	20020	2
45001	32001	100
45001	32001	NA
45001	45001	1
90080	45002	NA

In df 2, I would like to create each possible combination for "from" and "to". Then, I would like to fill in a new column "count_total" the sum of commuters for each individual pair. If the combination does not occur in df 1, I would like to fill in 0. For NA, I would like to fill in 0. My desired output:

from2	to2	count_total
10020	10020	30
10020	22001	0
10020	20020	0
10020	32001	0
10020	45001	0
10020	45002	0
30030	10020	0
30030	22001	0
30030	20020	2

...

I tried the following, however, it did not sum up the values for "count_total" correctly.

    df2 <- CJ(from2 = unique(df1$from1), 
                to2 = unique(df1$to1))


    df2[, count_total := sum(df1$count[
             df1$from1 == from2 &
               df1$to1 == to2
                ]), by = .(from2, to2)]

What did I do wrong? Thanks!

Original Q&A

There are 3 answers

Mark On 07 September 2023 at 07:45

Steps:

Load required packages
Process dt with lazy_dt(), so we can use dplyr functions on it
Summarise, to combine rows with the same from1 and to1
Complete the data, which creates rows for every combination of from1 and to1, with default values of 0
Since it's now lazy, we call as.data.table() to have it actually do the work

pacman::p_load(data.table, dtplyr)

dt <- dt |> lazy_dt()

dt |> 
  summarise(count = sum(count, na.rm = TRUE), .by = c(from1, to1)) |>
  complete(from1, to1, fill = list(count = 0)) |> 
  as.data.table()

Output:

    from1   to1 count
 1: 10020 10020    30
 2: 10020 20020     0
 3: 10020 22001     0
 4: 10020 32001     0
 5: 10020 45001     0
 6: 10020 45002     0
 7: 30030 10020     0
 8: 30030 20020     2
 9: 30030 22001     0
10: 30030 32001     0
11: 30030 45001     0
12: 30030 45002     0
13: 45001 10020     0
14: 45001 20020     0
15: 45001 22001     0
16: 45001 32001   100
17: 45001 45001     1
18: 45001 45002     0
19: 90080 10020     0
20: 90080 20020     0
21: 90080 22001     0
22: 90080 32001     0
23: 90080 45001     0
24: 90080 45002     0
    from1   to1 count

s_baldur On 08 September 2023 at 15:19

Using set():

library(data.table)

df2 <- df1[, CJ(from1, to1, unique = TRUE)][, count := 0L]

# CJ has already created a key
for (i in 1L:nrow(df1)) {
  if (is.na(df1$count[i])) next
  row <- df2[df1[i], which = TRUE]
  set(df2, row, "count", value = df2[row, count] + df1$count[i])
}

**r2evans** · Accepted Answer · 2023-09-07T12:34:04+00:00

We can do a merge-then-summarize:

library(data.table)
setDT(df1)
CJ(from2 = unique(df1$from1), to2 = unique(df1$to1)
  )[df1, count2 := i.count, on = .(from2==from1, to2==to1)
  ][, .(count2 = sum(c(0, count2), na.rm = TRUE)), by = .(from2, to2)]
#     from2   to2 count2
#     <int> <int>  <num>
#  1: 10020 10020     10
#  2: 10020 20020      0
#  3: 10020 22001      0
#  4: 10020 32001      0
#  5: 10020 45001      0
#  6: 10020 45002      0
#  7: 30030 10020      0
#  8: 30030 20020      2
#  9: 30030 22001      0
# 10: 30030 32001      0
# ---                   
# 15: 45001 22001      0
# 16: 45001 32001      0
# 17: 45001 45001      1
# 18: 45001 45002      0
# 19: 90080 10020      0
# 20: 90080 20020      0
# 21: 90080 22001      0
# 22: 90080 32001      0
# 23: 90080 45001      0
# 24: 90080 45002      0

TechQA.

Pair each combination of two columns and calculate sum for a third column in data.table

There are 3 answers

Related Questions in MERGE

Related Questions in DATA.TABLE

Related Questions in UNIQUE

Related Questions in KEY-VALUE

Related Questions in CROSS-JOIN

Popular Questions

Trending Questions