StratifiedKFold results in missing labels?

65 views Asked by At

I was using StratifiedKFold fold from scikit-learn and noticed missing labels. I had 7 labels initially, but after splitting using k fold cross validation, every fold had missed the labels '1', and '5'; but after training somehow my model's confusion matrix was 7x7 (for each fold separately), how? And if its stratified, then shouldn't all values labels get separated class-wise?

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=2024)
print(y_train.value_counts())
y_train.value_counts().sum()
OUTPUT[1]:
Label
0    1422917
3     241329
2       6864
6       1607
5       1468
1       1081
4         27
Name: count, dtype: int64
1675293
Folds_split = StratifiedKFold(n_splits=2, shuffle=True, random_state=2024)

for i, (train_index, test_index) in enumerate(Folds_split.split(X_train, y_train)):
    
    X_train_fold=  X.iloc[train_index]
    X_test_fold =  X.iloc[test_index]
    y_train_fold=  y.iloc[train_index]
    y_test_fold=  y.iloc[test_index]

    print(y_train_fold.value_counts())
    print(len(y_train_fold))
    print(y_test_fold.value_counts())
    print(len(y_test_fold))
    print(len(y_train_fold)+len(y_test_fold))
    print(len(y_train_fold)/(len(y_train_fold)+len(y_test_fold)))
OUTPUT[2]:
Label
0    734822
3     97170
2      4547
6      1097
4        10
Name: count, dtype: int64
837646
Label
0    735395
3     96586
2      4605
6      1046
4        15
Name: count, dtype: int64
837647
1675293
0.4999997015447447
Label
0    735395
3     96586
2      4605
6      1046
4        15
Name: count, dtype: int64
837647
Label
0    734822
3     97170
2      4547
6      1097
4        10
Name: count, dtype: int64
837646
1675293
0.5000002984552553

Where are the counts for label '1' and '5'?

1

There are 1 answers

0
DataJanitor On

As Ben Reiniger pointed out in the comments, the issue is the dataset you are using for slicing.

X_train_fold and X_test_fold are being sliced from the original X dataset, not from the split X_train dataset. Similarly, y_train_fold and y_test_fold are being sliced from the original y dataset. This is problematic because X_train, X_test, y_train, and y_test have already been split using train_test_split. When you use StratifiedKFold on X_train and y_train, the indices train_index and test_index are relative to these subsets, not the original X and y.

So the modified code could look like this:

for i, (train_index, test_index) in enumerate(Folds_split.split(X_train, y_train)):
    X_train_fold = X_train.iloc[train_index]
    X_test_fold = X_train.iloc[test_index]
    y_train_fold = y_train.iloc[train_index]
    y_test_fold = y_train.iloc[test_index]
    ...

The absence of labels '1' and '5' in some folds is likely due to the imbalance in your dataset (as seen in y_train.value_counts()). If a label is very rare, it's possible that it might not appear in a particular split.

When you use the modified code, you should see the counts for labels '1' and '5' in your folds.