Pandas drop_duplicates() gives odd results: Has anybody seen this already?

21 views Asked by At

When I create a dataframe with a 3-level multiindex and run drop_duplicates() on it, the function seems to focus only on the first two levels of the index and ignores the third.

index=[('2020-09-30', '2020-12-31', '2021-01-15'), 
       ('2020-09-30', '2020-12-31', '2021-01-30'),
       ('2020-09-30', '2020-12-31', '2021-02-04'), 
       ('2020-09-30', '2020-12-31', '2021-02-04')]

cols=['values']

data=[10,10,10,10]

df=pd.DataFrame(index=pd.MultiIndex.from_tuples(index), data=data, columns=cols)

The dataframe looks like this:


                                  values
2020-09-30 2020-12-31 2021-01-15      10
                      2021-01-30      10
                      2021-02-04      10
                      2021-02-04      10

There is only one duplicated row (the 3rd and 4th).

When I run the drop_duplicate() function, I get this:

In:
df.drop_duplicates()

Out:
                                 values
2020-09-30 2020-12-31 2021-01-15      10

I expected 3 rows back and got only 1. Has anybody come across this problem? Have I done anything or is that a known issue with MultiIndices?

1

There are 1 answers

0
mozway On

drop_duplicates ignores the index, if you want to consider the index and all columns you might use:

out = df[~df.reset_index().duplicated().to_numpy()]