Groupby and transform in pandas based on window conditions

169 views Asked by At

Please help to find an optimal solution for this task.

We have a pandas dataframe with two main date columns and many others (and >20mln rows).

Here is a toy example of the dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})
})

The input df

We need to groupby and transform the dataframe on the level of clients but with condition that we work with certain window for each client: only if date1 + 12m <= sample_date.

The outcome would be a new column in df dataframe with these values: The result:

Below is my very slow code, it works but it is very slow: Please help to optimize it using pandas methods, which I'm not still aware of!

# initialzing outcome column
df['count_products'] = np.nan

for i in range(df.shape[0]):
    df_temp = df[(df['date1'] + pd.DateOffset(months=12)) <= df['sample_date'].iloc[i]]
    df_temp = df_temp[df_temp['clients'] == df['clients'].iloc[i]]
    df['count_products'][i] = df_temp.groupby('clients')['products'].count()

I would appreciate any help!

Latest update 31.05.2023: Additional dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

The result

The result should be this:

Latest update: The logic is to count for each client & sample_date pair, the number of products whose date1 is at least or equal to 12M ago.

3

There are 3 answers

0
Ani On BEST ANSWER

The answer is the following:

  1. you can use iterrows(). Iterate in pandas only with iterrows()

  2. there is a better way:

def calc_func(partition):
    partition['count_products'] = partition.apply(lambda row: partition.loc[partition['date1'] <= row['sample_date'],'products'].count(), axis = 1)
    return partition
    
result_df = df.groupby('clients').apply(calc_func)
result_df.groupby(['clients', 'sample_date'], as_index = False)['count_products'].first(), on = ['clients', 'sample_date'])
3
Vons On

If you vectorize everything, it might be faster. Also you need cumcount and first (this assumes the first date is the earliest).

import pandas as pd
import numpy as np

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df1 = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df['date0'] = df.groupby('clients')['date1'].transform('first')

vec=(df['date0'] + pd.DateOffset(months=12)) <= df['sample_date']

df.loc[vec,'count'] = df.loc[vec].groupby('clients')['products'].transform('cumcount')+1

print(df)

df1['date0'] = df1.groupby('clients')['date1'].transform('first')

vec1=(df1['date0'] + pd.DateOffset(months=12)) <= df1['sample_date']

df1.loc[vec1, 'count'] = df1.loc[vec1].groupby('clients')['products'].transform('cumcount')+1

print(df1)

Output

       date1 sample_date  clients  products      date0  count
0 2021-04-15  2022-04-30  client1  product1 2021-04-15    1.0
1 2020-05-01  2022-04-30  client2  product2 2020-05-01    1.0
2 2022-12-31  2022-01-30  client1  product3 2021-04-15    NaN
3 2020-11-01  2021-12-30  client2  product4 2020-05-01    2.0
       date1 sample_date  clients  products      date0  count
0 2018-06-08  2018-05-31  client1  product1 2018-06-08    NaN
1 2019-07-30  2019-07-24  client1  product2 2018-06-08    1.0
2 2021-07-07  2021-06-28  client1  product3 2018-06-08    2.0
3 2020-01-11  2021-12-30  client2  product4 2020-01-11    1.0
10
Emma On

Your filter and groupby code looks good, you just need to apply it for the entire dataframe.

mask = (df['date1'] + pd.DateOffset(months=12)) <= df['sample_date']
df.loc[mask, 'count'] = df.loc[mask].groupby('clients')['products'].transform('count')

Update

IIUC, for each clients & sample_date pair, you want to count products whose date1 is at least or equal to 12M ago.

If that is the case, you can do a self-join and count with a condition.

df = df.merge(df[['clients', 'date1']], on='clients', how='left', suffixes=('', '_y'))

mask = (df.date1 + pd.DateOffset(months=12)) <= df.sample_date
df.loc[mask, 'count_products'] = (df.loc[mask].groupby(['clients', 'sample_date'])
                                  .products
                                  .transform('count'))
df['count_products'] = (df.groupby(['clients', 'sample_date'])
                        .count_products
                        .transform(max))
df = (df.drop_duplicates(subset=['clients', 'sample_date'])
      .drop('date1_y', axis=1))

Result

       date1 sample_date  clients  products  count_products
0 2018-06-08  2018-05-31  client1  product1             NaN
3 2019-07-30  2019-07-24  client1  product2             1.0
6 2021-07-07  2021-06-28  client1  product3             2.0
9 2020-01-11  2021-12-30  client2  product4             1.0