Featuretools training window not including any dates in calculation

28 views Asked by At

I have created mock data that resembles my actual data. I want to calculate data based on user history slices from cutoff_date - 2 days time ranges. Apparently I get NaN values as an output despite user having rows for requested time ranges and complete data.

import featuretools as ft
import pandas as pd

# Create a simple Users DataFrame
users_df = pd.DataFrame({
    "user_id": [1]
}).set_index('user_id')

# Create a Transactions DataFramf
transactions_df = pd.DataFrame({
    "index": [1, 2, 3, 4, 5],
    "user_id": [1, 1, 1, 1, 1],
    "transaction_time": pd.to_datetime(["2014-1-1", "2014-1-1", "2014-1-2", "2014-1-3", "2014-1-4"]),
    "value": [5, 8, 3, 2, 1]
}).set_index('index')

# Create an EntitySet
es_test = ft.EntitySet(id="user_data")
es_test = es_test.add_dataframe(
    dataframe_name="users",
    dataframe=users_df,
    index="user_id"
)
es_test = es_test.add_dataframe(
    dataframe_name="transactions",
    dataframe=transactions_df,
    index="index",
    time_index="transaction_time"
)
es_test = es_test.add_relationship("users", "user_id", "transactions", "user_id")

# Specify cutoff times 
cutoff_times = pd.DataFrame({
    "user_id": [1, 1],
    "time": pd.to_datetime(["2014-1-1", "2014-1-3"]),
    "label": ["a", "b"]
})

# Calculate features for specific cutoff times
window_fm, window_features = ft.dfs(
    entityset=es_test,
    target_dataframe_name="users",
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,
    training_window="2d"
)

And the check:

window_fm.loc[(1, "2014-01-03")]

returns:

COUNT(transactions)                                      0
MAX(transactions.value)                                NaN
MEAN(transactions.value)                               NaN
MIN(transactions.value)                                NaN
SKEW(transactions.value)                               NaN
STD(transactions.value)                                NaN
SUM(transactions.value)                                0.0
MODE(transactions.DAY(transaction_time))               NaN
MODE(transactions.MONTH(transaction_time))             NaN
MODE(transactions.WEEKDAY(transaction_time))           NaN
MODE(transactions.YEAR(transaction_time))              NaN
NUM_UNIQUE(transactions.DAY(transaction_time))        <NA>
NUM_UNIQUE(transactions.MONTH(transaction_time))      <NA>
NUM_UNIQUE(transactions.WEEKDAY(transaction_time))    <NA>
NUM_UNIQUE(transactions.YEAR(transaction_time))       <NA>
label                                                    b
Name: (1, 2014-01-03 00:00:00), dtype: object

But if we look at our df:

filtered_transactions = transactions_df[
    (transactions_df['user_id'] == 1) &
    (transactions_df['transaction_time'] >= pd.to_datetime("2014-01-01")) &
    (transactions_df['transaction_time'] <= pd.to_datetime("2014-01-03"))
]
print(filtered_transactions)

There is a data that should be used:

 index  user_id transaction_time  value
0      0        1       2014-01-01      5
1      1        1       2014-01-01      8
2      2        1       2014-01-02      3
3      3        1       2014-01-03      2

What is incorrect in my data structure then?

0

There are 0 answers