When I create a dataframe with a 3-level multiindex and run drop_duplicates() on it, the function seems to focus only on the first two levels of the index and ignores the third.
index=[('2020-09-30', '2020-12-31', '2021-01-15'),
('2020-09-30', '2020-12-31', '2021-01-30'),
('2020-09-30', '2020-12-31', '2021-02-04'),
('2020-09-30', '2020-12-31', '2021-02-04')]
cols=['values']
data=[10,10,10,10]
df=pd.DataFrame(index=pd.MultiIndex.from_tuples(index), data=data, columns=cols)
The dataframe looks like this:
values
2020-09-30 2020-12-31 2021-01-15 10
2021-01-30 10
2021-02-04 10
2021-02-04 10
There is only one duplicated row (the 3rd and 4th).
When I run the drop_duplicate() function, I get this:
In:
df.drop_duplicates()
Out:
values
2020-09-30 2020-12-31 2021-01-15 10
I expected 3 rows back and got only 1. Has anybody come across this problem? Have I done anything or is that a known issue with MultiIndices?
drop_duplicatesignores the index, if you want to consider the index and all columns you might use: