What exacrtly does the spark.ml.feature.BucketedRandomProjectionLSH function give as output?

18 views Asked by At

I am trying to use spark.ml.feature.BucketedRandomProjectionLSH for creating LSH hash vectors.

val lsh = new BucketedRandomProjectionLSH()
  .setBucketLength(0.6812920690579612)
  .setNumHashTables(4)
  .setInputCol("features")
  .setOutputCol("hashes")
val lshModel = lsh.fit(repoDF)

val hashedUserDF = lshModel.transform(userDF)
val hashedRepoDF = lshModel.transform(repoDF)
hashedRepoDF.show(false)

this is giving me the following output

// +-------+----------------------------------------------+--------------------------------+
// |repo_id|features                                      |hashes                          |
// +-------+----------------------------------------------+--------------------------------+
// |11     |(6,[0,1,2,3,4,5],[1.0,1.0,1.0,1.0,1.0,1.0])   |[[1.0], [-2.0], [-1.0], [-1.0]] |
// |12     |(6,[0,1,2,3,4,5],[9.0,-2.0,-21.0,9.0,1.0,9.0])|[[21.0], [-28.0], [18.0], [0.0]]|
// |13     |(6,[0,1,2,3,4,5],[1.0,1.0,-3.0,3.0,7.0,9.0])  |[[4.0], [-10.0], [6.0], [-3.0]] |
// |14     |(6,[0,1,2],[1.0,1.0,-3.0])                    |[[2.0], [-3.0], [2.0], [1.0]]   |
// |15     |(6,[1,2],[1.0,1.0])                           |[[-1.0], [0.0], [-2.0], [0.0]]  |
// +-------+----------------------------------------------+--------------------------------+

based on a theoretical understanding of Bucketed Random Projection LSH I was under the impression that the hash value is supposed to be a vector/array consisting of only 1s and -1s depending on which side of the hyperplane your point lies.

The documentation for this library https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html was rather sparse and didn't really explain the output properly. Can anyone help me understand the output or point me to any source that will help me understand that?

0

There are 0 answers