I am writing a code in R which should analzye two columns P1 and P2 which both contain ID-code and the respective PAIR column.
I want each individual ID-code to be only used once for a pair, but the individual ID-code can be within P1 and P2 (just in different rows).
Further, I want to exclude logical duplicates. So, if a pair is looking like this "X30112_X30101" then it could be a duplicate from this "X30101_X30112"
On the longrun I am actually looking for the maximum count of pairs which is quite tricky as each ID-code can only be used once but the data shows that a pairing of one individual ID code can be 1:n.
Unfortuenately, I am missing the experience to better describe and I think it might be a combinatorical solve. I would be happy for any kind of help.
What I tried so far?
So far I only tried successfully to solve 1) with an easier dataframe which somewhat worked with this code:
# Sample data: df dataframe
df <- data.frame(
P1 = c("A", "B", "C", "W"),
P2 = c("W", "X", "Y", "A"),
PAIR = c("A_W", "B_X", "C_Y", "W_A")
)
# Function to normalize and sort pairs
normalize_and_sort <- function(pair) {
elements <- unlist(strsplit(pair, "[_\\.]"))
sorted_pair <- paste(sort(elements), collapse = "_")
return(sorted_pair)
}
# Normalize and sort the pairs and keep unique pairs
unique_pairs_df <- data.frame(PAIR = unique(sapply(df$PAIR, normalize_and_sort)))
# Print the unique_pairs_df
print(unique_pairs_df)
PAIR
1 A_W
2 B_X
3 C_Y
But this did not work with my actual dataframe. Maybe because my ID-codes use numbers, too.

Your idea to sort the pairs is just right. With just 2, this is easy with vectorized
pminandpmax:Then you can use any standard code to remove duplicates, like