I'm trying to implement a KNN algorithm where my variables have 9 dimensions, and originally I have only 1K points in my set but it might grow up to 10-20K. Some of these dimensions are only a scale (from 1-4, 1-6) and others are budget values in millions. I would like to define a distance function to correctly represent the closeness of new values, but not all dimensions are as important. As an example, a very similar budget or value in a small scale (1-4) is much more indicative than a close value in the bigger scale (1-6).
I originally tried a standard normalization in all dimensions to scale the ranges but this makes distances in small range dimensions much more important by default. The idea is to adapt the dimensions in a way I could use a standard distance measure for a KNN algortihm so I can use an optimzed version such as faiss KNN. I'm divided between 2 alternatives:
- Use a different normalization technique or used different depending on the importance of the dimension
- Weight the distance of each dimension and brute force basic KNN algorithm that could give issues when the dataset grows (because from my understanding optimized KNN algorithms only work with well defined distance functions)
I'm open to use other things than KNN or any other solution to find similarity if this solves speed issues/impact of dimensions with different ranges. Any idea of how to better define distance?