I have a square boolean matrix M of size N, stored by rows and I want to count the number of bits set to 1 for each column.
For instance for n=4:
1101
0101
0001
1001
M stored as { { 1,1,0,1}, {0,1,0,1}, {0,0,0,1}, {1,0,0,1} };
result = { 2, 2, 0, 4};
I can obviously
- transpose the matrix M into a matrix M'
- popcount each row of M'.
Good algorithms exist for matrix transposition and popcounting through bit manipulation.
My question is: would it be possible to "merge" such algorithms into a single one ?
Note that N could be quite large (say 1024 and more) regarding 64 bits architecture.

I made some benchmark between the two approaches:
I wrote a naive version and an AVX2 one for both approaches. I used some functions (found on stackoverflow or elsewhere) for the AVX2 "transpose+popcount" approach.
In my test, I make the assumption that the input is a
nbRowsx32matrix in a bits packed format (nbRows itself being a multiple of 32); the matrix is therefore stored as an array ofuint32_t.The code is the following:
Some remarks:
g++ -O3 -march=native ../Test.cpp -o ./Test -laelf64Now the results:
So it seems that the "row by row" in AVX2 is the best so far.
Note that when I saw this result (less than 2 cycles per row), I made no more effort to optimize the AVX2 "transpose+popcount" method, which should be feasable by computing several popcounts in parallel (I may test it later).