how to bin numerical data when its mixed with character

Question

how to bin numerical data when its mixed with character

54 views Asked by chocoball At 10 February 2024 at 17:47

Let's say I have 3 columns of data in R:

Type: 'A' 'B' 'C' 'D' 'E' 'F' 'G'
Value: UT 30 45 50 62 70 72
Efficiency: 70 72 80 88 90 92 98

I want to bin just the 'numerical' data in the 'Value' column in incremement of 20 and display that on X-axis, while leaving the 'text' value in place [so the x-axis will display: UT 30-50 50-70 70-90], while showing 'Efficiency' on the Y-axis, with color = type.

Binning numerical datatype seems straight forward: bins <- seq(30, 80, by = 20) then plotting, but having that 'UT' is giving me a real challenge.

I'm a noob; just played around

#create bins
bins <- seq(30, 90, by = 20) 

#create plot 
ggplot(df, aes(x = cut(`Value`, breaks = bins, labels = sprintf("%d-%d", bins[-length(bins)], bins[-1])), y = 'Efficiency', color = `Type`))

Original Q&A

There are 1 answers

**knitz3** · Accepted Answer · 2024-02-10T18:29:10+00:00

df <- data.frame(
  Type = c("A", "B", "C", "D", "E", "F", "G"),
  Value = c("UT", "30", "45", "50", "62", "70", "72"),
  Efficiency = c(70, 72, 80, 88, 90, 92, 98)
)

Take a step back and consider the organization of your dataset. Columns in dataframes are vectors of a single type, so by including "UT" in that column, this is a character column:

dplyr::glimpse(df)

Rows: 7
Columns: 3
$ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
$ Value      <chr> "UT", "30", "45", "50", "62", "70", "72"
$ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98

Your Value column appears to originate from numerical data, but you have that pesky UT. In this case, I think it's better to get more detailed with the dataframe so you can more precisely describe the situation:

# New column to give context to the value column
df$Value_type <- ifelse(df$Value == "UT", "UT", NA)
df$Value[df$Value == "UT"] <- NA
df$Value <- as.numeric(df$Value)

# Categories for the Value column
df <- df |>
  dplyr::mutate(Value_cat = dplyr::case_when(
    30 <= Value & Value < 50 ~ "30-50",
    50 <= Value & Value < 70 ~ "50-70",
    70 <= Value & Value < 90 ~ "70-90",
    Value_type == "UT" ~ "UT",
    .default = NA
  ))

# Set factor levels so that any plots have desired order 
df$Value_cat <- factor(df$Value_cat, levels = c(
  "UT", "30-50", "50-70", "70-90"
))

dplyr::glimpse(df)

Rows: 7
Columns: 5
$ Type       <chr> "A", "B", "C", "D", "E", "F", "G"
$ Value      <dbl> NA, 30, 45, 50, 62, 70, 72
$ Efficiency <dbl> 70, 72, 80, 88, 90, 92, 98
$ Value_type <chr> "UT", NA, NA, NA, NA, NA, NA
$ Value_cat  <fct> UT, 30-50, 30-50, 50-70, 50-70, 70-90, 70-90

Now we're a bit closer to the fabled "tidy" dataset: Where rows are each individual observations, and each observation has characteristics that are well-defined in separate variables / columns. You also get the convenience of accessing your Value measurements in a numerical vector. It's generally good to avoid casting numbers as strings when you need to write any sort of logic about their numerical properties.

Since you have a category "UT", your x-axis is really categorical, rather than numeric. I suggest a barplot with "bins" created from these categories, rather than a histogram where you set bin width, as histograms are generally meant for continuous numerical data.

library(ggplot2)
# Since your `fill` is not 1:1 with your x-axis, you can set
# the position of the bars in the position argument
ggplot(data = df, mapping = aes(x = Value_cat, y = Efficiency)) +
  geom_col(aes(fill = Type), position = position_dodge())

TechQA.

how to bin numerical data when its mixed with character

There are 1 answers

Related Questions in R

Related Questions in GGPLOT2

Related Questions in BINNING

Popular Questions

Trending Questions