I'm working with a medium size dataset (shape=(14013L, 46L)). 
I want to smooth each sample with its knn.
I'm training my model with:
NearestNeighbors(n_neighbors, algorithm='ball_tree',    
    metric=sklearn.metrics.pairwise.cosine_distances)
And the smooth is as follows:
def smooth(x,nbrs,data,alpha):
    """
    input:
        alpha: the smoothing factor
        nbrs: trained NearestNeighbors from sklearn
        data: the original data 
             (since NearestNeighbors returns only the index and not the samples)
        x:    what we want to smooth
    output:
        smoothed x with its nearest neighbours
    """
    distances, indices = nbrs.kneighbors(x)
    distances = map(lambda z:abs(-z+1),distances)[0]
    norm = sum(distances)
    if norm == 0:
        "No neighbours were found."
        return x
    distances = map(lambda z: (1-alpha)*z/norm ,distances)
    indices = map(lambda z: data[z],indices)[0]
    other = np.array([indices[i] * distances[i] for i in range(len(distances))])
    z = x * alpha
    z = z.reshape((1,z.shape[0]))
    smoothed = sum(np.concatenate((other,z),axis=0))
    return smoothed
My questions:
- How is it possible that no neighbors were found ?(I experienced it on my dataset hence the 
ifcondition) - The fitting ("training") takes 18 seconds, but smoothing ~1000 samples takes more than 20 minutes. I'm willing to get less accurate results, if the smoothing process will be shorter. Is it possible? which parameters should I change in order to achieve it?
 
                        
First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too.
I will approach your question from a theoretic point of view. The
radiusparameter is set 1.0 by default. This might be too small for your data set, since if I understand correctly, this will specify the radius from the query to look into. So, I would suggest to run your code with increasing this value, until you get some results. Then decrease it, until you get no results. Increase it a bit again and you have the optimal parameter for your data set.An important parameter is
leaf_size, which actually affects how many points you are going to check when a query arrives. A small value for this parameter might yield faster execution, with less precision.You might also want to check this answer of mine, that explains the trade off between speed and accuracy, a trade off that is critical to understand when doing Nearest Neighbor search.