SMOTE for just the training for cross-validation of a Sequential Feature Selection Algorithm after a train/test split

31 views Asked by At
`**Split a Train a Test Dataset**

X_train, X_test, y_train, y_test = train_test_split(X_pre, y, random_state=0, stratify=y,                                                                               train_size=training_fraction)

**Apply SMOTE or Some other Balancing Algorithm**
X_imputed_train_df, y_train = balancing_algorithm.fit_resample(X_imputed_train_df, y_train)

**Apply Sequential Feature Selection & HyperParameter Tuning**
lr1 = LogisticRegression(random_state=42, max_iter=250)
lr2 = LogisticRegression(random_state=42, max_iter=250)

 sss = StratifiedShuffleSplit(n_splits=8, test_size=0.2, random_state=42)

                            sfsLR = SFS(estimator=lr1, 
                                       k_features='best',
                                       forward=boolean_sfs, 
                                       floating=False, 
                                       scoring='f1',
                                       cv=sss)
 pipe_lr = Pipeline([('lr2',lr2)])

lr_param_grid = [{'lr2__penalty': ['l1', 'l2'],
                            'lr2__C': param_range_fl,
                            'lr2__solver': ['liblinear','lbfgs']}]

lr_grid_search = GridSearchCV(estimator=pipe_lr,
                            param_grid=lr_param_grid,
                            scoring='f1',
                            cv=sss)

grid_dict = {0: 'Logistic Regression'}

grids = [lr_grid_search]

SFSList = [sfsLR]

                         j=0
                         for pipe, sfs in zip(grids, SFSList):

                                # Fit the SFS to my Amplified Train Data (The one that has ADASYN/SMOTE samples) [This will also update the SFSList items]

                                sfs = sfs.fit(X_ADASYN3, labels6)

                                # Get selected feature indices
                                selected_feature_indices = list(sfs.k_feature_idx_)

                                # Create a DataFrame with selected features
                                sfsFinal = X_ADASYN3.iloc[:, selected_feature_indices]

                                # Rename columns to original feature names as they got deleted                     
                                sfsFinal.columns = X_ADASYN3.columns[selected_feature_indices]

                                # Fit the pipeline, basically accessing each model_grid_search item and fitting it to our data with the best features.

                                pipe.fit(sfsFinal, labels6)
                                print("I just finished calculating: " + grid_dict[j])
                                j+=1`

Thank you very much.

I am training this model, and it seems like I am applying SMOTE to the entire dataset, but I should be doing it for just the training data during my CV for both the Sequential Feature Selection, and the CV of the GridSearchCV. This is a must correct? How can I modify my code to do so?

0

There are 0 answers