Extracting X_train and y_train from tabular data where a column is target and another columns are features with its corresponding ID

38 views Asked by At

I have this DataFrame:

timestep ID x1 x2 y
1 1 2 3 0
2 1 4 5 0
... ... ... ... ...
24 1 5 5 0
... ... ... ... ...
1 9 2 3 1
2 9 4 5 1
... ... ... ... ...
24 9 2 2 1

I expect it converted to the numpy ndarray X_train with shape (9, 24, 2). Where, 9 represent how many IDs. 24 represent n-rows aka timestep. 2 represents how many features (x1 and x2)

It same to the y_train with shape (9, 1) where 9 represents how many IDs and 1 represents scalar value since all y will be same for all 24 timesteps

The minimal reproducible example for this, when accessing X_train[0,:,:] it will returns table with ID=1 with 24 rows (timesteps) like this:

x1 x2
2 3
4 5
... ...
5 5

(the last 24-th row)

1

There are 1 answers

0
mozway On

First sort the rows by ID/timestep with sort_values, then convert to_numpy and reshape:

out = (df.sort_values(by=['ID', 'timestep'])[['x1', 'x2']]
         .to_numpy().reshape((9, 24, 2))
       )

Or, for an arbitrary number of features/columns:

n = df[['ID', 'timestep']].nunique()

out = (df.sort_values(by=['ID', 'timestep'])
         .filter(like='x')
         .to_numpy().reshape((n['ID'], n['timestamp'], -1))
       )

NB. assuming no combinations of ID/timestamps is missing/duplicated

For y, since the IDs are already sorted:

y_train = df.drop_duplicates(subset='ID')['y'].to_numpy()