Groupby and transform in pandas based on window conditions

Question

Groupby and transform in pandas based on window conditions

169 views Asked by Ani At 30 May 2023 at 13:55

Please help to find an optimal solution for this task.

We have a pandas dataframe with two main date columns and many others (and >20mln rows).

Here is a toy example of the dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})
})

The input df

We need to groupby and transform the dataframe on the level of clients but with condition that we work with certain window for each client: only if date1 + 12m <= sample_date.

The outcome would be a new column in df dataframe with these values: The result:

Below is my very slow code, it works but it is very slow: Please help to optimize it using pandas methods, which I'm not still aware of!

# initialzing outcome column
df['count_products'] = np.nan

for i in range(df.shape[0]):
    df_temp = df[(df['date1'] + pd.DateOffset(months=12)) <= df['sample_date'].iloc[i]]
    df_temp = df_temp[df_temp['clients'] == df['clients'].iloc[i]]
    df['count_products'][i] = df_temp.groupby('clients')['products'].count()

I would appreciate any help!

Latest update 31.05.2023: Additional dataset:

df = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

The result

Latest update: The logic is to count for each client & sample_date pair, the number of products whose date1 is at least or equal to 12M ago.

Original Q&A

There are 3 answers

Vons On 30 May 2023 at 14:24

If you vectorize everything, it might be faster. Also you need cumcount and first (this assumes the first date is the earliest).

import pandas as pd
import numpy as np

df = pd.DataFrame({'date1': [pd.Timestamp('2021-04-15'), pd.Timestamp('2020-05-01'), pd.Timestamp('2022-12-31'), pd.Timestamp('2020-11-01')],
                   'sample_date': [pd.Timestamp('2022-04-30'), pd.Timestamp('2022-04-30'), pd.Timestamp('2022-01-30'), pd.Timestamp('2021-12-30')],
                  'clients': ['client1', 'client2', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df1 = pd.DataFrame({'date1': [pd.Timestamp('06.08.2018'), pd.Timestamp('30.07.2019'), pd.Timestamp('07.07.2021'), pd.Timestamp('01.11.2020')],
                   'sample_date': [pd.Timestamp('31.05.2018'), pd.Timestamp('24.07.2019'), pd.Timestamp('28.06.2021'), pd.Timestamp('30.12.2021')],
                  'clients': ['client1', 'client1', 'client1', 'client2'],
                  'products': ['product1', 'product2', 'product3', 'product4']})

df['date0'] = df.groupby('clients')['date1'].transform('first')

vec=(df['date0'] + pd.DateOffset(months=12)) <= df['sample_date']

df.loc[vec,'count'] = df.loc[vec].groupby('clients')['products'].transform('cumcount')+1

print(df)

df1['date0'] = df1.groupby('clients')['date1'].transform('first')

vec1=(df1['date0'] + pd.DateOffset(months=12)) <= df1['sample_date']

df1.loc[vec1, 'count'] = df1.loc[vec1].groupby('clients')['products'].transform('cumcount')+1

print(df1)

Output

       date1 sample_date  clients  products      date0  count
0 2021-04-15  2022-04-30  client1  product1 2021-04-15    1.0
1 2020-05-01  2022-04-30  client2  product2 2020-05-01    1.0
2 2022-12-31  2022-01-30  client1  product3 2021-04-15    NaN
3 2020-11-01  2021-12-30  client2  product4 2020-05-01    2.0
       date1 sample_date  clients  products      date0  count
0 2018-06-08  2018-05-31  client1  product1 2018-06-08    NaN
1 2019-07-30  2019-07-24  client1  product2 2018-06-08    1.0
2 2021-07-07  2021-06-28  client1  product3 2018-06-08    2.0
3 2020-01-11  2021-12-30  client2  product4 2020-01-11    1.0

Emma On 30 May 2023 at 14:27

Your filter and groupby code looks good, you just need to apply it for the entire dataframe.

mask = (df['date1'] + pd.DateOffset(months=12)) <= df['sample_date']
df.loc[mask, 'count'] = df.loc[mask].groupby('clients')['products'].transform('count')

Update

IIUC, for each clients & sample_date pair, you want to count products whose date1 is at least or equal to 12M ago.

If that is the case, you can do a self-join and count with a condition.

df = df.merge(df[['clients', 'date1']], on='clients', how='left', suffixes=('', '_y'))

mask = (df.date1 + pd.DateOffset(months=12)) <= df.sample_date
df.loc[mask, 'count_products'] = (df.loc[mask].groupby(['clients', 'sample_date'])
                                  .products
                                  .transform('count'))
df['count_products'] = (df.groupby(['clients', 'sample_date'])
                        .count_products
                        .transform(max))
df = (df.drop_duplicates(subset=['clients', 'sample_date'])
      .drop('date1_y', axis=1))

Result

       date1 sample_date  clients  products  count_products
0 2018-06-08  2018-05-31  client1  product1             NaN
3 2019-07-30  2019-07-24  client1  product2             1.0
6 2021-07-07  2021-06-28  client1  product3             2.0
9 2020-01-11  2021-12-30  client2  product4             1.0

**Ani** · Accepted Answer · 2023-06-20T11:21:30+00:00

The answer is the following:

you can use iterrows(). Iterate in pandas only with iterrows()
there is a better way:

def calc_func(partition):
    partition['count_products'] = partition.apply(lambda row: partition.loc[partition['date1'] <= row['sample_date'],'products'].count(), axis = 1)
    return partition
    
result_df = df.groupby('clients').apply(calc_func)
result_df.groupby(['clients', 'sample_date'], as_index = False)['count_products'].first(), on = ['clients', 'sample_date'])

TechQA.

Groupby and transform in pandas based on window conditions

There are 3 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in GROUP-BY

Related Questions in ARRAY-BROADCASTING

Related Questions in NUMPY-UFUNC

Popular Questions

Trending Questions