What is the best way to merge two dataframes that one of them has date ranges and the other one has date WITHOUT any shared columns?

46 views Asked by At

I have two DataFrames:

import pandas as pd

df1 = pd.DataFrame(
    {
        'date': ['2024-01-01','2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08', '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12', '2024-01-13'],
        'price': list(range(13))
    }
)

df2 = pd.DataFrame(
    {
        'start': ['2024-01-01', '2024-01-03', '2024-01-10'],
        'end': ['2024-01-03', '2024-01-08', '2024-01-12'],
        'id': ['a', 'b', 'c']
    }
)

And this is the expected output. I want to add id to df1:

         date  price   id
0  2024-01-01      0  NaN
1  2024-01-02      1    a
2  2024-01-03      2    a
3  2024-01-04      3    b
4  2024-01-05      4    b
5  2024-01-06      5    b
6  2024-01-07      6    b
7  2024-01-08      7    b
8  2024-01-09      8  NaN
9  2024-01-10      9  NaN
10 2024-01-11     10    c
11 2024-01-12     11    c
12 2024-01-13     12  NaN

The process is like this. Let me give you an example for row 1 of the output:

a) The date is 2024-01-02. Look up df2. There is a range for each row of df2. This date is between the first row of df2. Note that start is exclusive and end is inclusive.

b) Get the id from the identified row in df2 and put in the output.

Since there are no common columns between these two DataFrames, I have used a loop to get the output. It works but I am not sure if it is the best way:

df1['date'] = pd.to_datetime(df1.date)
df2[['start', 'end']] = df2[['start', 'end']].apply(pd.to_datetime)
for idx, row in df2.iterrows():
    start = row['start']
    end = row['end']
    id = row['id']

    df1.loc[df1.date.between(start, end, inclusive='right'), 'id'] = id

Any suggestions?

1

There are 1 answers

0
Andrej Kesely On BEST ANSWER

Another options, use pd.cut:

bins = pd.IntervalIndex.from_arrays(df2["start"], df2["end"])
df1["id"] = pd.cut(df1["date"], bins).map(dict(zip(bins, df2["id"])))

print(df1)

Prints:

         date  price   id
0  2024-01-01      0  NaN
1  2024-01-02      1    a
2  2024-01-03      2    a
3  2024-01-04      3    b
4  2024-01-05      4    b
5  2024-01-06      5    b
6  2024-01-07      6    b
7  2024-01-08      7    b
8  2024-01-09      8  NaN
9  2024-01-10      9  NaN
10 2024-01-11     10    c
11 2024-01-12     11    c
12 2024-01-13     12  NaN