I have a matrix B, a symmetric banded matrix stored as a 1D array, that has to be added to a general matrix A. To illustrate the example let ccc=3, which would imply B is a pentadiagonal matrix, thus the (i,j) elements would be stored as follows:
matrix_B(i,j) [(1,1), (1,2), (1,3), (2,2), (2,3), (2,4), (3,3), ...]
array_B(idx) [ 1, 2, 3, 4, 5, 6, 7, ...]
see here how there is no need to store (2,1), (3,1) and (3,2) since (i,j) = (j,i).
I have the following code, which is an oversimplified version of the actual working one:
subroutine iter(n,ccc,matrix_A,array_B,const)
implicit none
integer,intent(in)::n,ccc
complex*16,intent(inout)::matrix_A(n,n)
complex*16,intent(in)::array_B(n*ccc)
complex*16,intent(in)::const
integer::i,j,idx
do i=1,n
do j=max(i-ccc+1,1),min(i+ccc-1,n) ! elements outside this range are 0 and thus not stored
matrix_A(i,j)=matrix_A(i,j)+array_B(idx(i,j,ccc))*const
end do
end do
end subroutine iter
function idx(i,j,ccc)
integer idx,i,j,ccc
! if(abs(i-j).gt.(ccc-1)) then
! write(*,*) 'wrong values i,j',i,j
! end if
if(i.le.j) then
idx=(i-1)*(ccc)+1+j-i
else
idx=(j-1)*(ccc)+1+i-j
endif
return
end
The problem is, this is a rather slow routine, and it is the bottleneck of a large program. Besides looping over the fast index i (is not so trivial in the real code), I thought about doing something like this:
do i=1,n
j=i
matrix_A(i,j)=matrix_A(i,j)+array_B((i-1)*(ccc)+1+j-i)*const
do j=i+1,min(i+ccc-1,n)
matrix_A(i,j)=matrix_A(i,j)+array_B((i-1)*(ccc)+1+j-i)*const
matrix_A(j,i)=matrix_A(j,i)+array_B((i-1)*(ccc)+1+j-i)*const
end do
end do
Here the function is inlined manually (now i<=j always), and the inner loop is halved, but this does not give significant increase in performance. Is there a way to optimize this further? More specifically, knowing the same value has to be added to matrix_A(i,j) and matrix_A(j,i) (which is not symmetric).
Thank you.
Trying to improve this solution. Using fortran95 and gfortran-8.
I propose 2 things:
I've tested this piece of code (your
iterroutine) and the results look OK, but I leave the benchmarking to you :)