Semi-supervised svm model running forever

52 views Asked by At

I am experimenting with the Elliptic bitcoin dataset and tried checking the performance of the datasets on supervised and semi-supervised models. Here is the code of my supervised SVM model:

classified = class_features_df[class_features_df['class'].isin(['1','2'])]

X = classified.drop(columns=['txId', 'class', 'time step']) 
y = classified[['class']]

# in this case, class 2 corresponds to licit transactions, we change this to 0 as our interest is the illicit transactions
y = y['class'].apply(lambda x: 0 if x == '2' else 1 )

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=15, shuffle=False)

model_svm = svm.SVC(kernel='linear') # Linear Kernel

model.fit(X_train, Y_train)

#find accuracy score
y_pred = model.predict(X_test)
acc = accuracy_score(Y_test, y_pred)

The above code works perfectly well and gives good results, but when trying the same code for semi-supervised learning, I am getting warnings and my model has been running for over an hour (whereas it ran in less than a minute for supervised learning)


unclassified = class_features_df[class_features_df['class'] == 3]

X_unclassified = unclassified[local_features_col + agg_features_col]

predictions = model_svm.predict(X_unclassified.values)


unclassified['class'] = predictions

# Combine the labeled and newly labeled unlabeled data
classified = classified.append(unclassified)


Xtrain = classified.drop(columns=['txId', 'class', 'time step'])
ytrain = classified['class'].astype('int') # astype('int added to remove "'<' not supported between instances of 'int' and 'str' svm)" error)

X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(Xtrain, ytrain, test_size=0.3, random_state=15, shuffle=False)


model_svm.fit(X_train_lab, y_train_lab)

# Evaluate the model on the test set
y_pred = model_svm.predict(X_test_unlab)
acc = accuracy_score(y_test_unlab, y_pred)
print("Accuracy " , acc)

Additional information: classes with values 1 and 2 are labelled transactions, and classes of value 3 are unlabelled or unclassified transactions. Here is a picture of the first 5 values of the dataset: enter image description here

Am I going wrong with my semi-supervised implementation? Or missing any values? Any code help will be appreciated.

0

There are 0 answers