Modifying the hashCode() method in java such that vectors can generate same hashcode for vectors that have jaccard similarity above a certain threshold with good accuracy
example:
vector 1: [1,1,0,0,1,0] vector 2: [1,1,0,0,0,0]
they have jaccard similarity of: 0.5
How can i modify the hashCode() method in Java such that vectors that have a similarity of 0.5 and above can go into the same bucket/or same hashcode?
Note: I am not doing it the minhash lsh and candidate pair way. It has to generate the hashcode just with vector itself
The goal is not to do it perfectly(which is impossible), but to do it as accurately as possible.
There will be situation where vector A and B, B and C can go together while A and C couldn't. The hashing function has to map it to either A with B, or B with C, or just A,B and C together
This is impossible. Jaccard similarity is calculated among two or more vectors, while the hash code must be dependent only on the contents of a single vector.
You can easily construct three vectors A, B and C such that (A,B) and (B,C) satisfy your criteria, meaning all three generate the same hash code, but (A,C) does not.