So I'm using this dataset : https://www.kaggle.com/datasets/madhavmalhotra/journal-entries-with-labelled-emotions
This video as guidance: https://www.youtube.com/watch?v=YyOuDi-zSiI&t=1077s
So I have changed the value of True to 1 and False to 0. I also have removed classes with below 30 instances in it. Now, I only have text for these classes:
happy 182
satisfied 133
calm 99
calm, happy, satisfied 77
happy, satisfied 73
proud 62
happy, proud, satisfied 54
excited, happy, satisfied 46
calm, satisfied 42
calm, happy 41
excited, happy, proud 37
proud, satisfied 33
frustrated 32
excited, happy 31
excited 31
Name: Emotions Felt, dtype: int64
I'm using this code for swapping between models and machine learning methods:
def build_model (model,mlb_estimator,xtrain,ytrain,xtest,ytest):
clf = mlb_estimator(model)
clf.fit(xtrain,ytrain)
clf_predictions = clf.predict(xtest)
acc = accuracy_score(ytest,clf_predictions)
ham = hamming_loss(y_test,clf_predictions)
result = {"accuracy":acc,"hamming_score":ham}
return result
clf_chain_model = build_model(MultinomialNB(),ClassifierChain,X_train,y_train,X_test,y_test)
I got accuracy:
{'accuracy': 0.1815068493150685, 'hamming_score': 0.2054794520547945}
So my question is,
why my accuracy so low?
how to get higher accuracy?
So I tried swapping models with LogisticRegression, KNeighborsClassifier, DecisionTreeClassifier, GaussianNB, MultinomialNB and RandomForestClassifier. To add, I also swap machine learning methods with BinaryRelevance, ClassifierChain and LabelPowerset for each of the model. I have not tried using neural networks models or BERT yet.
Some of the methods you describe have hyperparameters, which can change the performance of the models significantly. For the KNeighborsClassifier you have the parameter
kthat is really important. Usually, one performs some kind of parameter optimisation, with methods like k-fold cross-validation. This is needed to find the optimal parameter set for your data.You can use GridSearchCV for this. In the documentation, there is also an example for a Support-Vector Machine.