Understanding why the leiden algorithm is not able to find communities for the iris dataset

395 views Asked by At

Good morning!,

I am trying to understand the Leiden algorithm and its usage to find partitions and clusterings. The example provided in the documentation already finds a partition directly, such as the following:

import leidenalg as la
import igraph as ig

G = ig.Graph.Famous('Zachary')

partition = la.find_partition(G, la.ModularityVertexPartition)
G.vs['cluster'] = partition.membership
ig.plot(partition,vertex_size = 30)

If one checks partition.membership, it already gets 4 clusters.

However, I am trying to do a similar thing with the iris dataset and the algorithm is not able to find clusters. I have tried getting the X variables and create a:

  • 1- correlation matrix or,
  • pairwise distances,

but those do not work well (not even by scaling values), because it is not able to create clusters based on observations. I assume correlations are not good to separate them or pairwise distances. What am I not understanding well ?

here is the code for the correlation matrix:

import numpy as np
from sklearn import datasets
import igraph as ig
import leidenalg
import cairo
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data  # Features
y = iris.target  # Class labels
# Create an adjacency matrix based on observation similarity
# adj_matrix = abs(1-np.corrcoef(X))

adj_matrix = pairwise_distances(X)
print(adj_matrix)
# Create an igraph graph object
graph = ig.Graph.Weighted_Adjacency(adj_matrix)
# Apply the Leiden algorithm for community detection evaluating the nº of clusters created by changing the resolution parameter.
for i in np.arange(0.9,1.05,0.05):
    partition = leidenalg.find_partition(graph, leidenalg.CPMVertexPartition,
                                   resolution_parameter = i)
    print(i,len(np.unique(partition.membership)) )

#0.9 1
#0.9500000000000001 1
#1.0 150
#1.0500000000000003 150

As one can see, once it gets to 1, there is 150 cluster (equally to the nº of observations), and before that, it considers everything 1 cluster. Let me know your ideas.

Thank you for you time

1

There are 1 answers

0
Vincent Traag On

Make sure to pass in the weights to find_partition. See the documentation for more detail.

With correlations I highly recommend to use CPM, not Modularity.