scipy.sparse.coo_matrix.max returns the maximum value of each row or column, given an axis. I would like to know not the value, but the index of the maximum value of each row or column. I haven't found a way to make this in an efficient manner yet, so I'll gladly accept any help.
Argmax of each row or column in scipy sparse matrix
5.7k views Asked by Jimmy C AtThere are 6 answers
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                I would suggest studying the code for
moo._min_or_max_axis
where moo is a coo_matrix.
mat = mat.tocsc()  # for axis=0
mat.sum_duplicates()
major_index, value = mat._minor_reduce(min_or_max)
not_full = np.diff(mat.indptr)[major_index] < N
value[not_full] = min_or_max(value[not_full], 0)
mask = value != 0
major_index = np.compress(mask, major_index)
value = np.compress(mask, value)
return coo_matrix((value, (np.zeros(len(value)), major_index)),
                      dtype=self.dtype, shape=(1, M))
Depending on the axis it prefers to work with csc over csr.  I haven't had time analyze this, but I'm guessing it should be possible to include argmax in the calculation.
This suggestion may not work.  The key is the mat._minor_reduce method, which does, with some refinement:
ufunc.reduceat(mat.data, mat.indptr[:-1])
That is is applies the ufunc to blocks of the matrix data array, using the indptr to define the blocks.  np.sum, np.maxiumum are ufunc where this works.  I don't know of an equivalent argmax ufunc.
In general if you want to do things by 'row' for a csr matrix (or col of csc), you either have to iterate over the rows, which is relatively expensive, or use this ufunc.reduceat to do the same thing over the flat mat.data vector.  
group argmax/argmin over partitioning indices in numpy
tries to perform a argmax.reduceat.  The solution there might be adaptable to a sparse matrix.
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                If A is your scipy.sparse.coo_matrix, then you get the row and column of the maximum value  as follows:
I=A.data.argmax()
maxrow = A.row[I]
maxcol=A.col[I]
To get the index of maximum value on each row see the EDIT below:
from scipy.sparse import coo_matrix
import numpy as np
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 2, 3, 2])
data = np.array([-3, 4, 11, -7])
A= coo_matrix((data, (row, col)), shape=(4, 4))
print A.toarray()
nrRows=A.shape[0]
maxrowind=[]
for i in range(nrRows):
    r = A.getrow(i)# r is 1xA.shape[1] matrix
    maxrowind.append( r.indices[r.data.argmax()] if r.nnz else 0)
print maxrowind 
r.nnz is the  the count of explicitly-stored values (i.e. nonzero values)
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                As others mention there is now built-in argmax() for scipy.sparse matrices. However, I found it to be quite slow for large matrices so I had a look at the source code. The logic is very smart, but it contains a python loop slowing things down. Taking the source code and reducing it to argmax per row for example (while sacrificing all generality, shape checking etc. for simplicity) and decorating it with numba can give some nice speed improvements.
Here's the function:
import numpy as np
from numba import jit
def argmax_row_numba(X):
    return _argmax_row_numba(X.shape[0], X.indptr, X.data, X.indices)
@jit(nopython=True)
def _argmax_row_numba(shape, indptr, data, indices):
    # prep an array to hold the indices
    ret = np.zeros(shape)
    # figure out which lines actually contain data
    nz_lines, = np.diff(indptr).nonzero()
    # loop through the lines
    for i in nz_lines:
        p, q = indptr[i: i + 2]
        line_data = data[p: q]
        line_indices = indices[p: q]
        am = np.argmax(line_data)
        ret[i] = line_indices[am]
    return ret
Generating a matrix for testing:
from scipy.sparse import random
size = 10000
m = random(m=size, n=size, density=0.0001, format="csr")
n_vals = m.data.shape[0]
m.data = np.random.random(size=n_vals).astype("float")
# the original scipy implementation reformatted to return a np.array
maxima1 = np.squeeze(np.array(m.argmax(axis=1)))
# calling the numba version
maxima2 = argmax_row_numba(m)
# Check that the results are the same
print(np.allclose(maxima1, maxima2))
# True
Timing results:
%timeit m.argmax(axis=1)
# 30.1 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit argmax_row_numba(m)
# 211 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
                        
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                The latest release of the numpy_indexed package (disclaimer: I am its author) can solve this problem in an efficient and elegant manner:
import numpy_indexed as npi
col, argmax = group_by(coo.col).argmax(coo.data)
row = coo.row[argmax]
Here we group by col, so its the argmax over the columns; swapping row and col will give you the argmax over the rows.
                        
                            
                        
                        
                            On
                            
                            
                                                    
                    
                Expanding on the answers from @hpaulj and @joeln and using code from group argmax/argmin over partitioning indices in numpy as suggested, this function will calculate argmax over columns for CSR or argmax over rows for CSC:
import numpy as np
import scipy.sparse as sp
def csr_csc_argmax(X, axis=None):
    is_csr = isinstance(X, sp.csr_matrix)
    is_csc = isinstance(X, sp.csc_matrix)
    assert( is_csr or is_csc )
    assert( not axis or (is_csr and axis==1) or (is_csc and axis==0) )
    major_size = X.shape[0 if is_csr else 1]
    major_lengths = np.diff(X.indptr) # group_lengths
    major_not_empty = (major_lengths > 0)
    result = -np.ones(shape=(major_size,), dtype=X.indices.dtype)
    split_at = X.indptr[:-1][major_not_empty]
    maxima = np.zeros((major_size,), dtype=X.dtype)
    maxima[major_not_empty] = np.maximum.reduceat(X.data, split_at)
    all_argmax = np.flatnonzero(np.repeat(maxima, major_lengths) == X.data)
    result[major_not_empty] = X.indices[all_argmax[np.searchsorted(all_argmax, split_at)]]
    return result
It returns -1 for the argmax of any rows (CSR) or columns (CSC) that are completely sparse (i.e., that are completely zero after X.eliminate_zeros()).
From scipy version 0.19, both
csr_matrixandcsc_matrixsupportargmax()andargmin()methods.