NVBLAS with Intel Fortran compilers

766 views Asked by At

I seem to be missing something when attempting to use NVBLAS with the Intel Fortran compilers.

I appear to be linking and using nvblas.conf correctly as I see feedback from the initialization of NVBLAS at runtime. However, NVBLAS does not seem to be intercepting the calls to DGEMM as only the CPU implementation is executed. This is despite using:

NVBLAS_CPU_RATIO_CGEMM 0.0 

in nvblas.conf (or removing it entirely).

If I disable access to the CPU BLAS implementation by removing:

NVBLAS_CPU_BLAS_LIB  /ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs/libmkl_rt.so

the program crashes at runtime, as I would expect.

The compiler options I am currently using are shown below, I have also tried manually linking MKL, but with the same results.

# Compiler options
FFLAGS=-O3 -axAVX,SSE4.2 -msse3 -align array32byte -fpe1 -fno-alias -openmp -mkl=parallel -heap-arrays 32

 # Linker options
LDFLAGS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

# List of libraries used
LIBS= -L/ccc/home/wilkinson/EMPIRE-2064/src/dynamiclibs -lnvblas

An example of a call to DGEMM is as follows:

call dgemm('N','T',nCols2,nCols1,nOcc(s),2.0d0/dble(nSpins),C2,nRowsP,C(:,:,s),nRowsP,0.0d0,P(i21,i11,s),nOrbsP)

Unfortunately I am currently limited to using the Intel compilers but that restriction will be lifted shortly (at which point I will use CUDA Fortran to optimize data movement).

1

There are 1 answers

1
talonmies On

I am not sure what is going on here. If I take a very simple DGEMM example (cribbed directly from the MKL fortran guide):

      PROGRAM   MAIN

      IMPLICIT NONE

      DOUBLE PRECISION ALPHA, BETA
      INTEGER          M, K, N, I, J
      PARAMETER        (M=8000, K=8000, N=8000)
      DOUBLE PRECISION A(M,K), B(K,N), C(M,N)


      PRINT *, "Initializing data for matrix multiplication C=A*B for "
      PRINT 10, " matrix A(",M," x",K, ") and matrix B(", K," x", N, ")"
10    FORMAT(a,I5,a,I5,a,I5,a,I5,a)
      PRINT *, ""
      ALPHA = 1.0 
      BETA = 0.0

      PRINT *, "Intializing matrix data"
      PRINT *, ""
      DO I = 1, M
        DO J = 1, K
          A(I,J) = (I-1) * K + J
        END DO
      END DO

      DO I = 1, K
        DO J = 1, N
          B(I,J) = -((I-1) * N + J)
        END DO
      END DO

      DO I = 1, M
        DO J = 1, N
          C(I,J) = 0.0
        END DO
      END DO

      PRINT *, "Computing matrix product using DGEMM subroutine"
      CALL DGEMM('N','N',M,N,K,ALPHA,A,M,B,K,BETA,C,M)
      PRINT *, "Computations completed."
      PRINT *, ""

      PRINT *, "Top left corner of matrix A:"
      PRINT 20, ((A(I,J), J = 1,MIN(K,6)), I = 1,MIN(M,6))
      PRINT *, ""

      PRINT *, "Top left corner of matrix B:"
      PRINT 20, ((B(I,J),J = 1,MIN(N,6)), I = 1,MIN(K,6))
      PRINT *, ""

 20   FORMAT(6(F12.0,1x))

      PRINT *, "Top left corner of matrix C:"
      PRINT 30, ((C(I,J), J = 1,MIN(N,6)), I = 1,MIN(M,6))
      PRINT *, ""

 30   FORMAT(6(ES12.4,1x))

      PRINT *, "Example completed."
      STOP 

      END

If I build the code with the Intel compiler (12.1) and run it under nvprof (note I don't have access to MKL at the moment so I am using OpenBLAS built with ifort):

$ ifort -o nvblas_test nvblas_test.f -L/opt/cuda-7.5/lib64 -lnvblas
$ echo -e "NVBLAS_CPU_BLAS_LIB  /opt/openblas/lib/libopenblas.so\nNVBLAS_AUTOPIN_MEM_ENABLED\n" > nvblas.conf

$ nvprof --print-gpu-summary ./nvblas_test
==23978== NVPROF is profiling process 23978, command: ./nvblas_test
[NVBLAS] Config parsed
 Initializing data for matrix multiplication C=A*B for 
 matrix A( 8000 x 8000) and matrix B( 8000 x 8000)

 Intializing matrix data

 Computing matrix product using DGEMM subroutine
 Computations completed.

 Top left corner of matrix A:
          1.           2.           3.           4.           5.           6.
       8001.        8002.        8003.        8004.        8005.        8006.
      16001.       16002.       16003.       16004.       16005.       16006.
      24001.       24002.       24003.       24004.       24005.       24006.
      32001.       32002.       32003.       32004.       32005.       32006.
      40001.       40002.       40003.       40004.       40005.       40006.

 Top left corner of matrix B:
         -1.          -2.          -3.          -4.          -5.          -6.
      -8001.       -8002.       -8003.       -8004.       -8005.       -8006.
     -16001.      -16002.      -16003.      -16004.      -16005.      -16006.
     -24001.      -24002.      -24003.      -24004.      -24005.      -24006.
     -32001.      -32002.      -32003.      -32004.      -32005.      -32006.
     -40001.      -40002.      -40003.      -40004.      -40005.      -40006.

 Top left corner of matrix C:
 -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15
 -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15
 -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15
 -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15
 -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15
 -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16

 Example completed.
==23978== Profiling application: ./nvblas_test
==23978== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 92.15%  8.56855s       512  16.736ms  9.6488ms  21.520ms  void magma_lds128_dgemm_kernel<bool=0, bool=0, int=5, int=5, int=3, int=3, int=3>(int, int, int, double const *, int, double const *, int, double*, int, int, int, double const *, double const *, double, double, int)
  7.38%  685.77ms      1025  669.04us     896ns  820.55us  [CUDA memcpy HtoD]
  0.47%  44.017ms        64  687.77us  504.56us  763.05us  [CUDA memcpy DtoH]

I get what I expect - offload of the DGEMM call to the GPU. When I do this:

$ echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf 
$ nvprof --print-gpu-summary ./nvblas_test
==23991== NVPROF is profiling process 23991, command: ./nvblas_test
[NVBLAS] Config parsed
 Initializing data for matrix multiplication C=A*B for 
 matrix A( 8000 x 8000) and matrix B( 8000 x 8000)

 Intializing matrix data

 Computing matrix product using DGEMM subroutine
 Computations completed.

 Top left corner of matrix A:
          1.           2.           3.           4.           5.           6.
       8001.        8002.        8003.        8004.        8005.        8006.
      16001.       16002.       16003.       16004.       16005.       16006.
      24001.       24002.       24003.       24004.       24005.       24006.
      32001.       32002.       32003.       32004.       32005.       32006.
      40001.       40002.       40003.       40004.       40005.       40006.

 Top left corner of matrix B:
         -1.          -2.          -3.          -4.          -5.          -6.
      -8001.       -8002.       -8003.       -8004.       -8005.       -8006.
     -16001.      -16002.      -16003.      -16004.      -16005.      -16006.
     -24001.      -24002.      -24003.      -24004.      -24005.      -24006.
     -32001.      -32002.      -32003.      -32004.      -32005.      -32006.
     -40001.      -40002.      -40003.      -40004.      -40005.      -40006.

 Top left corner of matrix C:
 -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15  -1.3653E+15
 -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15  -3.4131E+15
 -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15  -5.4608E+15
 -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15  -7.5086E+15
 -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15  -9.5563E+15
 -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16  -1.1604E+16

 Example completed.
==23991== Profiling application: ./nvblas_test
==23991== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%     768ns         1     768ns     768ns     768ns  [CUDA memcpy HtoD]

I get no offload to the GPU. If you can't reproduce this, then the problem is either with your compiler version (you haven't said which one you are using), if you can, then perhaps the somewhat fancier build options you are using are interacting with NVBLAS in an unexpected way