How to create group-mean segments in R scatterplot?

49 views Asked by At

I'm trying to partition my data into equal-sized bins, calculate the mean value of wdi_expedu for each bin, then overlay a horizontal segment of bin-specific means on a scatterplot of wdi_expedu and left_seats.

I used this code to add a variable to my dataset breaking it into 4 bins, then creating a new data-frame with the group means for each bin.

j <- 4

data_selected_na <- data_selected_na %>%
  mutate(bin = as.integer(cut_number(wdi_expedu,j)))

bin_means <- data_selected_na %>%
  group_by(bin) %>%
  summarise(mean_wdi_expedu = mean(wdi_expedu))

I tried to use this code to create my scatterplot, but the segments go the whole width of the graph. How do I limit the x-values of the segments so they only span the width of the corresponding bin (i.e., all the x-values in bin 1 are spanned by the segment of the group mean for bin1 1?) I don't want to code the x-limit coordinates by hand, since I will eventually scale the j value up to 100.

ggplot(data_selected_na, aes(x = left_seats, y = wdi_expedu)) +
  geom_point() +
  geom_segment(data = bin_means, aes(x = min(data_selected_na$left_seats), xend = max(data_selected_na$left_seats), y = bin_means$mean_wdi_expedu, yend = bin_means$mean_wdi_expedu), color = "red") +
  labs(title = "Scatterplot with Horizontal Lines for j = 4", x = "Left Seats", y = "Education Spending") +
  theme_minimal()

my attempted scatterplot

1

There are 1 answers

0
stefan On

The issue is that you use the min and max value of left_seats for each of your segments. Instead, if you want a segment per bin you have to use the lower and upper boundary value of the bins. To this end stick with the default labels returned by cut (aside: If you want integer codes you could set labels=FALSE), then extract the lower and upper boundary into separate columns using e.g. tidyr::separate_wider_regex.

Finally, you computed the mean of wdi_edu per bin of wdi_edu. Instead, as you want to compare wdi_edu by left_seats you probably want to bin by left_seats.

Using some fake random example data:

library(ggplot2)
library(dplyr, warn = FALSE)
library(tidyr)

set.seed(123)

j <- 4

data_selected_na <- data.frame(
  left_seats = sample(seq(70), 200, replace = TRUE),
  wdi_expedu = runif(200, 1, 8)
)
data_selected_na <- data_selected_na %>%
  mutate(bin = cut(left_seats, breaks = j))

bin_means <- data_selected_na %>%
  group_by(bin) %>%
  summarise(mean_wdi_expedu = mean(wdi_expedu)) %>%
  tidyr::separate_wider_regex(
    bin,
    patterns = c("\\(",
      x = "\\-?[\\d\\.]+", 
      ",",
      xend = "\\-?[\\d\\.]+",
      "\\]"
    ),
    cols_remove = FALSE
  ) |>
  mutate(across(c(x, xend), as.numeric))

ggplot(
  data_selected_na,
  aes(x = left_seats, y = wdi_expedu)
) +
  geom_point() +
  geom_segment(
    data = bin_means,
    aes(
      x = x,
      xend = xend,
      y = mean_wdi_expedu,
      yend = mean_wdi_expedu
    ),
    color = "red"
  ) +
  labs(
    title = "Scatterplot with Horizontal Lines for j = 4",
    x = "Left Seats", y = "Education Spending"
  ) +
  theme_minimal()