I have a dataset with three columns that should theoretically have the same number of unique observations.
Here is a sample of the data:
speciesID common_name species
s001 common lizard Zootoca vivipara
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s002 social tuco-tuco Ctenomys sociabilis
s003 red grouse Lagopus lagopus scoticus
s003 red grouse Lagopus lagopus scoticus
s004 elk Cervus elaphus
The full dataset can be found here.
But, when I check the number of unique observations, they don't match up.
df %>% as_tibble() %>% count(speciesID) %>% nrow() #148 unique values
df %>% as_tibble() %>% count(common_name) %>% nrow() #150 unique values
df %>% as_tibble() %>% count(species) %>% nrow() #147 unique values
Is there a way to figure out which where the 2 missing unique values are from the speciesID column and the 3 missing unique values are from the species column?
Ideally, I would like to be able to identify the problem rows so that I can go back into the raw data and fix the errors (i.e., there should be 150 unique records).
My hope is that there is a way to do this in R instead of manually checking ~700 rows of data.
I did explore using anti_join, but this hasn't been very successful.
I work in R and am most comfortable with dplyr.
One way you can find what is duplicative:
Each of the rows indicates duplicated values in one of the columns. For instance, in
speciesID, we have three values that have 2 rows in them:s011,s030, ands045, and we can see here the specific values forcommon_nameandspeciesfor those pairs.Similarly, for
common_name(only one level is duplicated, nearly perfectly in fact) andspecies(four pairs, some with commonalities).