I feel a bit confused about how to use together hyperparameter tuning and model evaluation correctly.
Should hyperparameter tuning be done on the whole dataset or only on the training set? What is the correct sequence of actions?
Could you please review my code and advise me the best practice considering the issue?
Here I am using hyperparameter tuning on the whole dataset first and then evaluate the model performance only on the train set. Is it correct? Doesn't it lead to data leakage?
Hyperparameter Tuning
numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1),
alphas = np.arange(0, 1.1, 0.1),
random_state=818,
n_jobs = -1)
model = make_pipeline(preprocessor, en_cv)
model.fit(X, y)
best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_
Model evaluation:
ElasticNet = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=l1_ratio))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)
ElasticNet.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(r2, mse)
This code, actually, took about 18 minutes to run on the dataset with about 80000 observations and about 150 columns. Is this considered adequate?
Concerning the hyperparameter tuning question, it should always be done on the training set, not the whole dataset. Tuning hyperparameters on the entire dataset introduces what we call "data leakage", where information from the test set (which should be unseen) influences the model training process. If you do this then you will be getting "leaked/too good" performance estimates.
Concerning the second question, a baseline pipeline would look like this:
Your posted code should become based on the above: