Why isn't df$VARNAME the same as df[,i] when reading data using the R haven package?

46 views Asked by At

Context: I've read the PISA 2022 data using the haven package and now I want to create an auxiliary df that consists of three columns:

  • variable name (e.g. EFFORT1)
  • variable label (e.g. How much effort did you put into this test?"
  • variable values (e.g. 1, 2, 3, ...)

The issue is that the label and values are accesible if I type attributes(pisa_df$EFFORT1), but NOT if I type attributes(pisa_df[,i]). Why would this be, and is there a way to get around this? I have >1000 variables so typing them one by one is not an option. I've tried something like pisa_df$get(colnames(pisa_df)[i]) but of course it doesn't work.

It seems like a very newbie question but I can't even figure out how to search for possible answers. Thanks in advance!

2

There are 2 answers

3
r2evans On BEST ANSWER

Up front, the reason that attributes(pisa_df$EFFORT1) works but attributes(pisa_df[,1]) does not is because of Why does subsetting a column from a data frame vs. a tibble give different results. Namely, in native R, [.data.frame when reducing to a single column drops to a vector, but tbl_df does not. The base [ can choose to not reduce to a vector by adding the drop=FALSE argument.

mt <- mtcars[1:3,]
mt[,1]
# [1] 21.0 21.0 22.8
mt[,1, drop=FALSE]
#                mpg
# Mazda RX4     21.0
# Mazda RX4 Wag 21.0
# Datsun 710    22.8
tibble(mt)[,1]
# # A tibble: 3 × 1
#     mpg
#   <dbl>
# 1  21  
# 2  21  
# 3  22.8

The workaround is to use $ with names or [[ with a column index,

mt[[1]]
# [1] 21.0 21.0 22.8
tibble(mt)[[1]]
# [1] 21.0 21.0 22.8

In your case, working on a SAS file takes little effort to give what you want. Using the "school questionnaire file" (it was easy for me to get), we can do something like below.

Up front, I'm demonstrating grabbing the labels and unique values for a few columns. Some of the columns are all unique (e.g., SCH has 21,629 rows, and column CNTSCHID has 21,629 distinct values), so I'm not certain if that is as interesting to you. Regardless, while I'm choosing a few, you can use this for all of them without problem.

Also, some of the values are character, some are numeric, so we must either convert all numbers to strings, or we have two separate columns. I'll choose the latter for demonstration, as I think converting all to string would be simpler for you to adapt yourself.

SCH <- haven::read_sas(unz("SCH_QQQ_SAS.zip", "cy08msp_sch_qqq.sas7bdat"))
library(dplyr)
columns <- c(1, 3, 4, 5)
quux <- lapply(columns, function(ind) {
  out <- tibble(column = names(SCH)[ind], label = attributes(SCH[[ind]])$label)
  if (is.character(SCH[[ind]])) {
    cbind(out, tibble(values_chr = unique(SCH[[ind]]))) 
  } else cbind(out, tibble(values_num = unique(SCH[[ind]])))
}) |>
  bind_rows() |>
  tibble()
quux
# # A tibble: 21,794 × 4
#    column label                    values_chr values_num
#    <chr>  <chr>                    <chr>           <dbl>
#  1 CNT    Country code 3-character ALB                NA
#  2 CNT    Country code 3-character QAZ                NA
#  3 CNT    Country code 3-character ARG                NA
#  4 CNT    Country code 3-character AUS                NA
#  5 CNT    Country code 3-character AUT                NA
#  6 CNT    Country code 3-character BEL                NA
#  7 CNT    Country code 3-character BRA                NA
#  8 CNT    Country code 3-character BRN                NA
#  9 CNT    Country code 3-character BGR                NA
# 10 CNT    Country code 3-character KHM                NA
# # ℹ 21,784 more rows
# # ℹ Use `print(n = ...)` to see more rows

The notion is that if a particular column is character, then you would use values_chr (for whatever work you're doing). If you choose only character columns, then you can forego the if/else and just put out values of the distinct strings.

This can be done without dplyr if needed, with just a little more effort.

0
Adriano Mello On

Consider the sample data (dput at the end):

> aux_df
# A tibble: 3 × 6
  EFFORT1 EFFORT2 OCOD1 OCOD2 OCOD3 PROGN   
    <dbl>   <dbl> <chr> <chr> <chr> <chr>   
1      10      10 243   9412  9999  00080002
2       9       8 8189  9999  9999  00080001
3      10      10 9999  9999  9999  0008000

Let's create a tibble with names and labels for it with purrr map, pluck and attr_getter:

aux_labels <- map_dfr(
  colnames(aux_df), 
  \(x) tibble(
    column = x, 
    label = pluck(aux_df, x, attr_getter("label"))))

Names and labels output:

> aux_labels
# A tibble: 6 × 2
  column  label                                                                   
  <chr>   <chr>                                                                   
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment)
2 EFFORT2 How much effort would you have invested? (after cognitive assessment)   
3 OCOD1   ISCO-08 Occupation code - Mother                                        
4 OCOD2   ISCO-08 Occupation code - Father                                        
5 OCOD3   ISCO-08 Occupation code - Self                                          
6 PROGN   Unique national study programme code  

Now, choose acolumn and merge it with the respective values:

aux_output <- cross_join(
  filter(aux_labels, column == "EFFORT1"), 
  select(aux_df, value = "EFFORT1"))

The output:

> aux_output
# A tibble: 3 × 3
  column  label                                                                    value
  <chr>   <chr>                                                                    <dbl>
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment)    10
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment)     9
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment)    10

If you want to make it to some (or all) columns at once, taking into account the possible different classes, you can try this:

# Choose some columns (or all: colnames(aux_df))
aux_columns <- c("EFFORT1", "EFFORT2", "OCOD1")

#
aux_output <- map_dfr(
  aux_columns,
  \(x) cross_join(
    filter(aux_labels, column == x), 
    aux_df %>% 
      select(value = x) %>% 
      mutate(value_class = class(value), value = as.character(value))))

The output:

> aux_output
# A tibble: 9 × 4
  column  label                                                                    value value_class
  <chr>   <chr>                                                                    <chr> <chr>      
1 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10    numeric    
2 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 9     numeric    
3 EFFORT1 How much effort did you put into this test? (after cognitive assessment) 10    numeric    
4 EFFORT2 How much effort would you have invested? (after cognitive assessment)    10    numeric    
5 EFFORT2 How much effort would you have invested? (after cognitive assessment)    8     numeric    
6 EFFORT2 How much effort would you have invested? (after cognitive assessment)    10    numeric    
7 OCOD1   ISCO-08 Occupation code - Mother                                         243   character  
8 OCOD1   ISCO-08 Occupation code - Mother                                         8189  character  
9 OCOD1   ISCO-08 Occupation code - Mother                                         9999  character 

That's it. Here's the sample data's dput:

aux_df <- structure(list(
  EFFORT1 = structure(
    c(10, 9, 10), 
    label = "How much effort did you put into this test? (after cognitive assessment)"), 
  EFFORT2 = structure(
    c(10, 8, 10), 
    label = "How much effort would you have invested? (after cognitive assessment)"), 
  OCOD1 = structure(
    c("243", "8189", "9999"), 
    label = "ISCO-08 Occupation code - Mother"), 
  OCOD2 = structure(
    c("9412", "9999", "9999"), 
    label = "ISCO-08 Occupation code - Father"), 
  OCOD3 = structure(
    c("9999", "9999", "9999"), 
    label = "ISCO-08 Occupation code - Self"), 
  PROGN = structure(
    c("00080002", "00080001", "00080001"), 
    label = "Unique national study programme code")), 
  row.names = c(NA, -3L), 
  class = c("tbl_df", "tbl", "data.frame"))