I seem to be missing something when attempting to use NVBLAS with the Intel Fortran compilers.
I appear to be linking and using nvblas.conf correctly as I see feedback from the initialization of NVBLAS at runtime. However, NVBLAS does not seem to be intercepting the calls to DGEMM as only the CPU implementation is executed. This is despite using:
NVBLAS_CPU_RATIO_CGEMM 0.0
in nvblas.conf (or removing it entirely).
If I disable access to the CPU BLAS implementation by removing:
NVBLAS_CPU_BLAS_LIB /ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs/libmkl_rt.so
the program crashes at runtime, as I would expect.
The compiler options I am currently using are shown below, I have also tried manually linking MKL, but with the same results.
# Compiler options
FFLAGS=-O3 -axAVX,SSE4.2 -msse3 -align array32byte -fpe1 -fno-alias -openmp -mkl=parallel -heap-arrays 32
# Linker options
LDFLAGS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas
# List of libraries used
LIBS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas
An example of a call to DGEMM is as follows:
call dgemm('N','T',nCols2,nCols1,nOcc(s),2.0d0/dble(nSpins),C2,nRowsP,C(:,:,s),nRowsP,0.0d0,P(i21,i11,s),nOrbsP)
Unfortunately I am currently limited to using the Intel compilers but that restriction will be lifted shortly (at which point I will use CUDA Fortran to optimize data movement).
I am not sure what is going on here. If I take a very simple DGEMM example (cribbed directly from the MKL fortran guide):
If I build the code with the Intel compiler (12.1) and run it under nvprof (note I don't have access to MKL at the moment so I am using OpenBLAS built with ifort):
I get what I expect - offload of the DGEMM call to the GPU. When I do this:
I get no offload to the GPU. If you can't reproduce this, then the problem is either with your compiler version (you haven't said which one you are using), if you can, then perhaps the somewhat fancier build options you are using are interacting with NVBLAS in an unexpected way