creating unique ID column in a large dataset

85 views Asked by At

How to create a column for unique IDs replacing the old unique IDs in a large dataset, as large as around 26000 observations?

I have a dataset with 26000 observations and need to create a unique ID for each year in the dataset. For example, for 2000 I have about 2000 IDs for individuals and the new dataset will consist of a new number for each individual in 2000 for example 20001. Similarly, for every year from 2000 to 2018, I need to create unique IDs with a year number and the already existing ID number and the total number of observations in the dataset is 26668. How to do it in R?

I tried this

New2 <- df1 %>% mutate(NewID = 20000 + (year - min(year)) * 10000 + id)

but this is not generating a unique ID for 2002, 2004 etc. For example for year 2000 the already existing ID for an individual is 1. The new column should look like 20001. For 2002 it should look like 20021. There are about 2000-4000 observations for each year and the years range from 2000-2018. How to resolve this in R?

1

There are 1 answers

4
jay.sf On

Simply use the newIDs you have just created.

> library(dplyr)
> YourData %>%
+   mutate(NewID = newIDs)
  OldID Value  NewID
1     1     1 NewID1
2     2     2 NewID2
3     3     3 NewID3
4     4     4 NewID4

Or do it in one step.

> YourData %>%
+   mutate(NewID = paste0("NewID", seq_len(nrow(.))))
  OldID Value  NewID
1     1     1 NewID1
2     2     2 NewID2
3     3     3 NewID3
4     4     4 NewID4

Edit

This actually works for any number of observations.

> dat %>%
  +   mutate(NewID = paste0("NewID", seq_len(nrow(.))))
OldID Value    NewID
1       1     1   NewID1
2       2     2   NewID2
3       3     3   NewID3
...
10     10    10  NewID10
11     11    11  NewID11
...
100   100   100 NewID100
101   101   101 NewID101
...

If you don't like different lengths of the IDs, you can try sprintf where %05d defines number of digits of the number part.

> dat %>%
  +   mutate(NewID = sprintf("NewID%05d", seq_len(nrow(.))))
OldID Value      NewID
1       1     1 NewID00001
2       2     2 NewID00002
3       3     3 NewID00003

To avoid hard coding the 5, you could do this hack.

> dat %>%
  +   mutate(NewID = sprintf(sprintf('NewID%%0%sd', nchar(nrow(dat))), seq_len(nrow(.))))
OldID Value      NewID
1       1     1 NewID00001
2       2     2 NewID00002
3       3     3 NewID00003

Data:

> YourData <- data.frame(OldID = c("1", "2", "3", "4"), Value = c(1, 2, 3, 4))
> newIDs <- paste0("NewID", seq_len(nrow(YourData)))

> n <- 26000
> dat <- data.frame(OldID = as.character(seq_len(n)), Value = seq_len(n))