I am trying to use spark.ml.feature.BucketedRandomProjectionLSH for creating LSH hash vectors.
val lsh = new BucketedRandomProjectionLSH()
.setBucketLength(0.6812920690579612)
.setNumHashTables(4)
.setInputCol("features")
.setOutputCol("hashes")
val lshModel = lsh.fit(repoDF)
val hashedUserDF = lshModel.transform(userDF)
val hashedRepoDF = lshModel.transform(repoDF)
hashedRepoDF.show(false)
this is giving me the following output
// +-------+----------------------------------------------+--------------------------------+
// |repo_id|features |hashes |
// +-------+----------------------------------------------+--------------------------------+
// |11 |(6,[0,1,2,3,4,5],[1.0,1.0,1.0,1.0,1.0,1.0]) |[[1.0], [-2.0], [-1.0], [-1.0]] |
// |12 |(6,[0,1,2,3,4,5],[9.0,-2.0,-21.0,9.0,1.0,9.0])|[[21.0], [-28.0], [18.0], [0.0]]|
// |13 |(6,[0,1,2,3,4,5],[1.0,1.0,-3.0,3.0,7.0,9.0]) |[[4.0], [-10.0], [6.0], [-3.0]] |
// |14 |(6,[0,1,2],[1.0,1.0,-3.0]) |[[2.0], [-3.0], [2.0], [1.0]] |
// |15 |(6,[1,2],[1.0,1.0]) |[[-1.0], [0.0], [-2.0], [0.0]] |
// +-------+----------------------------------------------+--------------------------------+
based on a theoretical understanding of Bucketed Random Projection LSH I was under the impression that the hash value is supposed to be a vector/array consisting of only 1s and -1s depending on which side of the hyperplane your point lies.
The documentation for this library https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html was rather sparse and didn't really explain the output properly. Can anyone help me understand the output or point me to any source that will help me understand that?