Context: I've read the PISA 2022 data using the haven package and now I want to create an auxiliary df that consists of three columns:
- variable name (e.g. EFFORT1)
- variable label (e.g. How much effort did you put into this test?"
- variable values (e.g. 1, 2, 3, ...)
The issue is that the label and values are accesible if I type attributes(pisa_df$EFFORT1), but NOT if I type attributes(pisa_df[,i]). Why would this be, and is there a way to get around this? I have >1000 variables so typing them one by one is not an option. I've tried something like pisa_df$get(colnames(pisa_df)[i]) but of course it doesn't work.
It seems like a very newbie question but I can't even figure out how to search for possible answers. Thanks in advance!
Up front, the reason that
attributes(pisa_df$EFFORT1)works butattributes(pisa_df[,1])does not is because of Why does subsetting a column from a data frame vs. a tibble give different results. Namely, in native R,[.data.framewhen reducing to a single column drops to a vector, buttbl_dfdoes not. The base[can choose to not reduce to a vector by adding thedrop=FALSEargument.The workaround is to use
$with names or[[with a column index,In your case, working on a SAS file takes little effort to give what you want. Using the "school questionnaire file" (it was easy for me to get), we can do something like below.
Up front, I'm demonstrating grabbing the labels and unique values for a few columns. Some of the columns are all unique (e.g.,
SCHhas 21,629 rows, and columnCNTSCHIDhas 21,629 distinct values), so I'm not certain if that is as interesting to you. Regardless, while I'm choosing a few, you can use this for all of them without problem.Also, some of the values are
character, some arenumeric, so we must either convert all numbers to strings, or we have two separate columns. I'll choose the latter for demonstration, as I think converting all to string would be simpler for you to adapt yourself.The notion is that if a particular column is
character, then you would usevalues_chr(for whatever work you're doing). If you choose onlycharactercolumns, then you can forego theif/elseand just put outvaluesof the distinct strings.This can be done without
dplyrif needed, with just a little more effort.