Summarize in a column using a condition and return a new row with the summed value

72 views Asked by At

I have a dataset and I am trying to find a solution for it using dplyr. My goal is to summarize the values in the columns value and percentage, but only for the value smaller than 10 and add this to a new item name called: "cheap_stuff", while removing the rows with the low values.

My data looks like this:

df <- data.frame(group=c(rep("A",4), rep("B",4), rep("C",4), rep("D",4)),
                 value=c(1, 23, 15, 5,  3,  45, 7,  21, 4,  8,  26, 30, 3,  9,  37, 68),
                 percentage=c(2.27, 52.27,  34.09,  11.36   ,3.95   ,59.21  ,9.21   ,27.63  ,5.88   ,11.76  ,38.24  ,44.12  ,2.56   ,7.69, 31.62, 58.12),
                 item=c("cheap1","expensive1"   ,"expensive2",  "cheap2",
                 "cheap1",  "expensive1","cheap2","expensive2",
                 "cheap1","cheap2","expensive1","expensive2",
                 "cheap1","cheap2","expensive1","expensive2"))

view(df)
   group value percentage       item
1      A     1       2.27     cheap1
2      A    23      52.27 expensive1
3      A    15      34.09 expensive2
4      A     5      11.36     cheap2
5      B     3       3.95     cheap1
6      B    45      59.21 expensive1
7      B     7       9.21     cheap2
8      B    21      27.63 expensive2
9      C     4       5.88     cheap1
10     C     8      11.76     cheap2
11     C    26      38.24 expensive1
12     C    30      44.12 expensive2
13     D     3       2.56     cheap1
14     D     9       7.69     cheap2
15     D    37      31.62 expensive1
16     D    68      58.12 expensive2

My desired output looks like this:

   group value percentage        item
1      A     6      13.64 cheap_stuff
2      A    23      52.27  expensive1
3      A    15      34.09  expensive2
4      B    10      13.16 cheap_stuff
5      B    45      59.21  expensive1
6      B    21      27.63  expensive2
7      C    12      17.65 cheap_stuff
8      C    26      38.24  expensive1
9      C    30      44.12  expensive2
10     D    12      10.26 cheap_stuff
11     D    37      31.62  expensive1
12     D    68      58.12  expensive2

This post comes in the right direction, Summarize with mathematical conditions in dplyr But, there all values are summed, and a new column is created.

I have tried something like this:

library(dplyr)
df%>%
  group_by(group) %>%
  mutate(item= replace(item, which(value <10),"cheap_stuff")) %>%
  mutate(value = sum(value[value < 10]))

But that fails in the sense that I can not removed the rows that I want, and it write over the rows with expensive values.

# A tibble: 16 × 4
# Groups:   group [4]
   group value percentage item       
   <chr> <dbl>      <dbl> <chr>      
 1 A         6       2.27 cheap_stuff
 2 A         6      52.3  expensive1 
 3 A         6      34.1  expensive2 
 4 A         6      11.4  cheap_stuff
 5 B        10       3.95 cheap_stuff
 6 B        10      59.2  expensive1 
 7 B        10       9.21 cheap_stuff
 8 B        10      27.6  expensive2 
 9 C        12       5.88 cheap_stuff
10 C        12      11.8  cheap_stuff
11 C        12      38.2  expensive1 
12 C        12      44.1  expensive2 
13 D        12       2.56 cheap_stuff
14 D        12       7.69 cheap_stuff
15 D        12      31.6  expensive1 
16 D        12      58.1  expensive2 
1

There are 1 answers

4
one On

Using value<10 instead of grepl:

df %>%
  group_by(group,item=case_when(value < 10~"cheap_stuff",
                                T~item)) %>%
  summarise(value=sum(value),
            percentage=sum(percentage))%>%
  ungroup

   group item        value percentage
   <chr> <chr>       <dbl>      <dbl>
 1 A     cheap_stuff     6       13.6
 2 A     expensive1     23       52.3
 3 A     expensive2     15       34.1
 4 B     cheap_stuff    10       13.2
 5 B     expensive1     45       59.2
 6 B     expensive2     21       27.6
 7 C     cheap_stuff    12       17.6
 8 C     expensive1     26       38.2
 9 C     expensive2     30       44.1
10 D     cheap_stuff    12       10.2
11 D     expensive1     37       31.6
12 D     expensive2     68       58.1

Original answer:

df %>%
  group_by(group,item=case_when(grepl("cheap",item,fixed=T)~"cheap_stuff",
                                T~item)) %>%
  summarise(value=sum(value),
            percentage=sum(percentage))

   group item        value percentage
   <chr> <chr>       <dbl>      <dbl>
 1 A     cheap_stuff     6       13.6
 2 A     expensive1     23       52.3
 3 A     expensive2     15       34.1
 4 B     cheap_stuff    10       13.2
 5 B     expensive1     45       59.2
 6 B     expensive2     21       27.6
 7 C     cheap_stuff    12       17.6
 8 C     expensive1     26       38.2
 9 C     expensive2     30       44.1
10 D     cheap_stuff    12       10.2
11 D     expensive1     37       31.6
12 D     expensive2     68       58.1