How to hash a signature matrix to buckets in Locality-sensitive hashing (LSH)

560 views Asked by At

I understand the algorithm behind creating signature matrix from shingles by applying hash functions. However I don't understand how to hash a specific band in a signature matrix to buckets. Assume in matrix M, band b1 we have following values for documents C1-C5:

C1  C2  C3  C4  C5
1   0   0   0   2
3   2   1   2   2
0   1   3   1   1

Just by looking at the values, we see C2 and C4 are identical in this band and they should hash into the same bucket. But the other columns will hash into different buckets.

My question is how to hash these columns into buckets and know if they are identical without actually comparing them. How should I define the hash function to map a column to a bucket? like sum of the elements in the column?

1

There are 1 answers

1
Kuzeko On

Trivial way, assume you have 1024 buckets

hash = 1 
for val  in column:
   hash = (( hash * 33 )  +  val ) % 1024

The % 1024 is necessarily the number of buckets you have. 33 is a magic number which works well in practice with strings

the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.

See http://www.cse.yorku.ca/~oz/hash.html