I have created mock data that resembles my actual data. I want to calculate data based on user history slices from cutoff_date - 2 days time ranges. Apparently I get NaN values as an output despite user having rows for requested time ranges and complete data.
import featuretools as ft
import pandas as pd
# Create a simple Users DataFrame
users_df = pd.DataFrame({
"user_id": [1]
}).set_index('user_id')
# Create a Transactions DataFramf
transactions_df = pd.DataFrame({
"index": [1, 2, 3, 4, 5],
"user_id": [1, 1, 1, 1, 1],
"transaction_time": pd.to_datetime(["2014-1-1", "2014-1-1", "2014-1-2", "2014-1-3", "2014-1-4"]),
"value": [5, 8, 3, 2, 1]
}).set_index('index')
# Create an EntitySet
es_test = ft.EntitySet(id="user_data")
es_test = es_test.add_dataframe(
dataframe_name="users",
dataframe=users_df,
index="user_id"
)
es_test = es_test.add_dataframe(
dataframe_name="transactions",
dataframe=transactions_df,
index="index",
time_index="transaction_time"
)
es_test = es_test.add_relationship("users", "user_id", "transactions", "user_id")
# Specify cutoff times
cutoff_times = pd.DataFrame({
"user_id": [1, 1],
"time": pd.to_datetime(["2014-1-1", "2014-1-3"]),
"label": ["a", "b"]
})
# Calculate features for specific cutoff times
window_fm, window_features = ft.dfs(
entityset=es_test,
target_dataframe_name="users",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="2d"
)
And the check:
window_fm.loc[(1, "2014-01-03")]
returns:
COUNT(transactions) 0
MAX(transactions.value) NaN
MEAN(transactions.value) NaN
MIN(transactions.value) NaN
SKEW(transactions.value) NaN
STD(transactions.value) NaN
SUM(transactions.value) 0.0
MODE(transactions.DAY(transaction_time)) NaN
MODE(transactions.MONTH(transaction_time)) NaN
MODE(transactions.WEEKDAY(transaction_time)) NaN
MODE(transactions.YEAR(transaction_time)) NaN
NUM_UNIQUE(transactions.DAY(transaction_time)) <NA>
NUM_UNIQUE(transactions.MONTH(transaction_time)) <NA>
NUM_UNIQUE(transactions.WEEKDAY(transaction_time)) <NA>
NUM_UNIQUE(transactions.YEAR(transaction_time)) <NA>
label b
Name: (1, 2014-01-03 00:00:00), dtype: object
But if we look at our df:
filtered_transactions = transactions_df[
(transactions_df['user_id'] == 1) &
(transactions_df['transaction_time'] >= pd.to_datetime("2014-01-01")) &
(transactions_df['transaction_time'] <= pd.to_datetime("2014-01-03"))
]
print(filtered_transactions)
There is a data that should be used:
index user_id transaction_time value
0 0 1 2014-01-01 5
1 1 1 2014-01-01 8
2 2 1 2014-01-02 3
3 3 1 2014-01-03 2
What is incorrect in my data structure then?