I was using StratifiedKFold fold from scikit-learn and noticed missing labels. I had 7 labels initially, but after splitting using k fold cross validation, every fold had missed the labels '1', and '5'; but after training somehow my model's confusion matrix was 7x7 (for each fold separately), how? And if its stratified, then shouldn't all values labels get separated class-wise?
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25, random_state=2024)
print(y_train.value_counts())
y_train.value_counts().sum()
OUTPUT[1]:
Label
0 1422917
3 241329
2 6864
6 1607
5 1468
1 1081
4 27
Name: count, dtype: int64
1675293
Folds_split = StratifiedKFold(n_splits=2, shuffle=True, random_state=2024)
for i, (train_index, test_index) in enumerate(Folds_split.split(X_train, y_train)):
X_train_fold= X.iloc[train_index]
X_test_fold = X.iloc[test_index]
y_train_fold= y.iloc[train_index]
y_test_fold= y.iloc[test_index]
print(y_train_fold.value_counts())
print(len(y_train_fold))
print(y_test_fold.value_counts())
print(len(y_test_fold))
print(len(y_train_fold)+len(y_test_fold))
print(len(y_train_fold)/(len(y_train_fold)+len(y_test_fold)))
OUTPUT[2]:
Label
0 734822
3 97170
2 4547
6 1097
4 10
Name: count, dtype: int64
837646
Label
0 735395
3 96586
2 4605
6 1046
4 15
Name: count, dtype: int64
837647
1675293
0.4999997015447447
Label
0 735395
3 96586
2 4605
6 1046
4 15
Name: count, dtype: int64
837647
Label
0 734822
3 97170
2 4547
6 1097
4 10
Name: count, dtype: int64
837646
1675293
0.5000002984552553
Where are the counts for label '1' and '5'?
As Ben Reiniger pointed out in the comments, the issue is the dataset you are using for slicing.
X_train_foldandX_test_foldare being sliced from the originalXdataset, not from the splitX_traindataset. Similarly,y_train_foldandy_test_foldare being sliced from the originalydataset. This is problematic becauseX_train,X_test,y_train, andy_testhave already been split usingtrain_test_split. When you useStratifiedKFoldonX_trainandy_train, the indicestrain_indexandtest_indexare relative to these subsets, not the originalXandy.So the modified code could look like this:
The absence of labels
'1'and'5'in some folds is likely due to the imbalance in your dataset (as seen iny_train.value_counts()). If a label is very rare, it's possible that it might not appear in a particular split.When you use the modified code, you should see the counts for labels
'1'and'5'in your folds.