Identify missing unique values across multiple columns in R

Question

Identify missing unique values across multiple columns in R

41 views Asked by Blundering Ecologist At 21 March 2024 at 22:22

I have a dataset with three columns that should theoretically have the same number of unique observations.

Here is a sample of the data:

speciesID                          common_name                       species
        s001                        common lizard              Zootoca vivipara
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s002                     social tuco-tuco           Ctenomys sociabilis
        s003                           red grouse      Lagopus lagopus scoticus
        s003                           red grouse      Lagopus lagopus scoticus
        s004                                  elk                Cervus elaphus

The full dataset can be found here.

But, when I check the number of unique observations, they don't match up.

df %>% as_tibble() %>% count(speciesID) %>% nrow() #148 unique values       
df %>% as_tibble() %>% count(common_name) %>% nrow() #150 unique values     
df %>% as_tibble() %>% count(species) %>% nrow() #147 unique values

Is there a way to figure out which where the 2 missing unique values are from the speciesID column and the 3 missing unique values are from the species column?

Ideally, I would like to be able to identify the problem rows so that I can go back into the raw data and fix the errors (i.e., there should be 150 unique records).

My hope is that there is a way to do this in R instead of manually checking ~700 rows of data.

I did explore using anti_join, but this hasn't been very successful.

I work in R and am most comfortable with dplyr.

Original Q&A

There are 1 answers

**r2evans** · Accepted Answer · 2024-03-21T22:52:42+00:00

One way you can find what is duplicative:

aa <- readr::read_csv("clean_data_species.csv")[,-1]
distinct(aa) |>
  filter(.by = speciesID, n() > 1)
# # A tibble: 6 × 3
#   speciesID common_name                          species                   
#   <chr>     <chr>                                <chr>                     
# 1 s011      banner-tailed kangaroo rat           Dipodomys spectabilis     
# 2 s011      dwarf mongoose                       Helogale parvula          
# 3 s030      north american red squirrel          Tamiasciurus hudsonicus   
# 4 s030      eurasian red squirrel                Sciurus vulgaris          
# 5 s045      northern spotted owl                 Strix occidentalis caurina
# 6 s045      grey red-backed vole/grey-sided vole Clethrionomys rufocanus   
distinct(aa) |>
  filter(.by = common_name, n() > 1)
# # A tibble: 2 × 3
#   speciesID common_name species    
#   <chr>     <chr>       <chr>      
# 1 s015      great tit   Parus major
# 2 s073      great tit   Parus major
distinct(aa) |>
  filter(.by = species, n() > 1)
# # A tibble: 8 × 3
#   speciesID common_name            species              
#   <chr>     <chr>                  <chr>                
# 1 s015      great tit              Parus major          
# 2 s020      gray jay               Perisoreus canadensis
# 3 s073      great tit              Parus major          
# 4 s074      pied babbler           Turdoides bicolor    
# 5 s109      eurasian kestrel       Falco tinnunculus    
# 6 s110      southern pied babblers Turdoides bicolor    
# 7 s129      canada jay             Perisoreus canadensis
# 8 s106      common kestrel         Falco tinnunculus

Each of the rows indicates duplicated values in one of the columns. For instance, in speciesID, we have three values that have 2 rows in them: s011, s030, and s045, and we can see here the specific values for common_name and species for those pairs.

Similarly, for common_name (only one level is duplicated, nearly perfectly in fact) and species (four pairs, some with commonalities).

TechQA.

Identify missing unique values across multiple columns in R

There are 1 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in ANTI-JOIN

Popular Questions

Trending Questions