Why is my code crashing with "Illegal address during kernel execution" when I use subroutines to copy or deallocate members of derived type variables?

68 views Asked by At

I'm porting a big Fortran code to OpenACC and I'm struggling with some (rather simple) operations. I tried to summarize the problem in the test example that can be found below. I do know that the code is not optimal (it is splitting the array along the 2nd dimension, making the data non-contiguous), but this is not the point. I'd like to use this rather simple example to better understand OpenACC procedures. Here is the code.

First a module, where I define a structure with some allocatable data (in the form of pointers) and the dimensions, plus the subroutines I use to copy the data and to get the indices for the chunking of the data

module test
implicit none
!
type test_type
  integer                            :: n1,n2,n3
  integer, pointer, dimension(:,:,:) :: dat
end type test_type
!
contains
  subroutine copydat(datain,dataout,indi,inde)
    !$ACC ROUTINE VECTOR
    implicit none
    integer, intent(in)          :: indi, inde
    type(test_type), intent(in)  :: datain
    type(test_type), intent(out) :: dataout
    !
    dataout%dat(:,indi:inde,:) = datain%dat(:,indi:inde,:)
  end subroutine copydat
  !
  subroutine get_indices(n2,i_chunk,n_chunks,indi,inde)
    !$ACC ROUTINE SEQ
    implicit none
    integer, intent(in)  :: n2, i_chunk, n_chunks
    integer, intent(out) :: indi, inde
    !
    indi = 1 + (i_chunk - 1)*n2/n_chunks
    inde = i_chunk*n2/n_chunks
  end subroutine get_indices
end module test

Second, the main program, where we read the dimensions of data from a file ('num.txt'), we initialize data1 with some data, we copy the data from data1 to data2 on the device, we update the data in data2 on the host, and finally we check the results.

program main
use openacc
use test
implicit none
integer         :: n_chunks, i_chunk, n1, n2, n3, i1, i2, i3, indi, inde
type(test_type) :: data1, data2

open(10,file='num.txt')
read(10,*) n1
read(10,*) n2
read(10,*) n3
read(10,*) n_chunks
close(10)

write(*,'(4(A,I3),A)') 'Data of size ', n1 , ', ', n2, ', ', n3, ', splitted in ', n_chunks, ' chunks along dim2'

data1%n1 = n1
data1%n2 = n2
data1%n3 = n3
data2%n1 = n1
data2%n2 = n2
data2%n3 = n3
allocate(data1%dat(n1,n2,n3),data2%dat(n1,n2,n3))
do i3 = 1,n3
  do i2 = 1,n2
    do i1 = 1,n1
      data1%dat(i1,i2,i3) = i1 + (i2-1)*n1 + (i3-1)*n1*n2
    end do
  end do
end do

!$ACC ENTER DATA COPYIN(data1,data2)
!$ACC ENTER DATA CREATE(data1%dat,data2%dat)
!$ACC UPDATE DEVICE(data1%dat)

!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)
do i_chunk=1,n_chunks
  ! --- Current chunk indices ---
  call get_indices(n2,i_chunk,n_chunks,indi,inde)
  print *, "#", __pgi_gangidx(), ": ", indi, inde
  data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:)
  !call copydat(data1,data2,indi,inde)
end do
!$ACC END LOOP
!$ACC END PARALLEL

!$ACC UPDATE SELF(data2%dat)
!$ACC EXIT DATA DELETE(data1%dat,data2%dat)
!$ACC EXIT DATA DELETE(data1,data2)

print *, maxval(abs(data1%dat-data2%dat))

deallocate(data1%dat,data2%dat)

end program main

The idea is that we want to split the data in the array along the 2nd dimension in n_chunks parts. Each gang should copy the data from data1 to data2. For testing I'm using in num.txt: 128 256 32 64

This means that each gang should copy a block of 128 x 4 x 32 from data1 to data2. The code is compiled on a Tesla P100 with nvfortran -acc -ta=tesla:cc60 -fast -Minfo=accel -o main.exe test.F90 Here are my questions:

  1. First, why if I remove the condition PRIVATE(indi,inde) from the directive !$ACC PARALLEL DEFAULT(present) I get the following runtime error?

Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I thought it was enough to declare the private variables on the loop directive, but in this example it seems like either I declare the two scalars as private on the parallel region, or I don't declare them as private at all (is the latter a problem, or since they are declared as intent(in) in get_indices, they are kept private?).

  1. If I use data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:) in the loop, I get the correct result and everything works fine. The loop is scheduled with 64 gangs and vectorized with a vector_length of 128. On the other hand, if I comment out the line data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:) and activate the command call copydat(data1,data2,indi,inde), the code tries to vectorize the operation with a vector_length of 32 and crashes with the error

Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

If I decrease n2 from 256 to 128, the code is still vectorized with a length of 32, but at least it doesn't crash. I don't understand this behaviour. Can someone explain?

EDIT:
Concerning question 1., I tried 4 different combinations of the private clause:
Try A:

!$ACC PARALLEL DEFAULT(present)
!$ACC LOOP GANG INDEPENDENT

Try B:

!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)

Try C:

!$ACC PARALLEL DEFAULT(present) PRIVATE(indi,inde)
!$ACC LOOP GANG INDEPENDENT

Try D:

!$ACC PARALLEL DEFAULT(present)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)

Try A, B, and C all work correctly, while Try D crashes with the error
Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
I'm a bit confused, since I was expecting "Try D" to force indi and inde private among gangs, but shared among workers and threads, whereas I was expecting the other three cases to have indi and inde private for gans and also for threads. What am I missing here? What is the behaviour of indi and inde (private/shared) for the 4 different cases outlined above?

Concerning question 2., I'd like to report also an additional strange behaviour: if instead of having deallocate(data1%dat,data2%dat) in program, I introduce the subroutine

subroutine deldat(data0)
  implicit none
  type(test_type), intent(inout)  :: data0
  !
  deallocate(data0%dat)
end subroutine deldat

in the module test, and I call it in program with

call deldat(data1)
call deldat(data2)

(i.e., I replace the explicit deallocate with the subroutine calls), the code also crashes with the error
Failing in Thread:1 call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
One thing that I noticed is that when I compile the code with deallocate(data1%dat,data2%dat), the compiler gives
87, Loop is parallelizable
Loop carried reuse of data2%dat$p prevents parallelization

One the other hand, if I use the routine deldat to deallocate the arrays, the compiler writes
87, Loop is parallelizable
Loop carried reuse of dat$f prevents parallelization
Loop is parallelizable
Loop carried reuse of data2%dat$p prevents parallelization

that is, it adds the two lines
Loop is parallelizable
Loop carried reuse of dat$f prevents parallelization
to the message. I don't know if this is a bug of the compiler (the fact that the code crashes when I deallocate within the subroutine), or if it is something to be expected.

In the code I'm porting to GPU, the derived datatypes have a lot of different fields (not just one as in the example above). Therefore, we created three subroutines to allocate, copy, and deallocate the members of the derived datatypes. Unfortunately, at present I am unable to use these routines with OpenACC directives because of these problems.

1

There are 1 answers

5
Mat Colgrove On

First, why if I remove the condition PRIVATE(indi,inde) from the directive !$ACC PARALLEL DEFAULT(present) I get the following runtime error?

By default, scalars are thread (vector) private, but when calling a seq routine from a gang region, only one thread executes the routine. Hence the other thread's "indi" and "inde" variables are uninitialized.

By putting the "private" on a gang loop, the scalars are gang private but shared by the vectors, hence all threads have initialized values.

if I comment out the line data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:) and activate the command call copydat(data1,data2,indi,inde), the code tries to vectorize the operation with a vector_length of 32 and crashes with the error

This looks like a compiler issue to me so I went ahead and filed a problem report, TPR #35448, and sent it to engineering for investigation.

I know we've had other issues with derived types in the past (they're tricky to get correct).

The work arounds would be to inline the routine via the "-Minline" flag, or pass in the "dat" arrays directly rather than through the derived type.

If I decrease n2 from 256 to 128, the code is still vectorized with a length of 32, but at least it doesn't crash. I don't understand this behaviour. Can someone explain?

I'll need to wait for engineering to understand the root cause, though for the vector length 32, that's expected. We lower the vector length to 32 when there's vector routines in order to support reductions in vector routines as well as to remove the need to synchronize threads. Given the kernel will use 150 registers for calls, the occupancy already gets lowered, so the additional threads don't really help or at least offset by the additional thread syncs.

If possible, you'll want to try inlining routines to avoid this overhead.

EDIT:

Adding more info since Fabio is still having issues while the code works for me. One possible difference is that I put the test module in the same source as the main program. In this case, the compiler is able to inline "copydat". If split into separate sources, it doesn't appear to get inlined.

To fix this, I changed the array syntax to use explicit loops. We may want to do this anyway so we can apply the "loop" directive to all dimensions. As seen in the compiler feedback, only the middle dimension is getting implicitly parallelized with the array syntax.

Also with array syntax, per the Fortran standard, the right-hand side must be fully evaluated before assignment, which means the compiler must create temp arrays to hold the intermediary results. While the compiler can often optimize away the temp arrays, it may not be in this case. Hence it's possible the issues he's seeing are with a heap overflow due to the allocation of the temp array. The device heap is quite small by default plus allocation can hurt performance.

Here's my updated version.

test_mod.F90:

module test
implicit none
!
type test_type
  integer                            :: n1,n2,n3
  integer, pointer, dimension(:,:,:) :: dat
end type test_type
!
contains
  subroutine copydat(datain,dataout,indi,inde,n1,n3)
    !$ACC ROUTINE VECTOR
    implicit none
    integer, intent(in)          :: indi, inde, n1, n3
    type(test_type), intent(in)  :: datain
    type(test_type), intent(out) :: dataout
    integer :: i,j,k
    !
    !$acc loop vector collapse(3)
    do i = 1, n1
      do j = indi,inde
        do k = 1,n3
          dataout%dat(i,j,k) = datain%dat(i,j,k)
        enddo
      enddo
    enddo
    !dataout%dat(:,indi:inde,:) = datain%dat(:,indi:inde,:)
  end subroutine copydat
  !
  subroutine get_indices(n2,i_chunk,n_chunks,indi,inde)
    !$ACC ROUTINE SEQ
    implicit none
    integer, intent(in)  :: n2, i_chunk, n_chunks
    integer, intent(out) :: indi, inde
    !
    indi = 1 + (i_chunk - 1)*n2/n_chunks
    inde = i_chunk*n2/n_chunks
  end subroutine get_indices
end module test

test.F90:

program main
use openacc
use test
implicit none
integer         :: n_chunks, i_chunk, n1, n2, n3, i1, i2, i3, indi, inde
type(test_type) :: data1, data2

open(10,file='num.txt')
read(10,*) n1
read(10,*) n2
read(10,*) n3
read(10,*) n_chunks
close(10)

write(*,'(4(A,I3),A)') 'Data of size ', n1 , ', ', n2, ', ', n3, ', splitted in ', n_chunks, ' chunks along dim2'

data1%n1 = n1
data1%n2 = n2
data1%n3 = n3
data2%n1 = n1
data2%n2 = n2
data2%n3 = n3
allocate(data1%dat(n1,n2,n3),data2%dat(n1,n2,n3))
do i3 = 1,n3
  do i2 = 1,n2
    do i1 = 1,n1
      data1%dat(i1,i2,i3) = i1 + (i2-1)*n1 + (i3-1)*n1*n2
    end do
  end do
end do

!$ACC ENTER DATA COPYIN(data1,data2)
!$ACC ENTER DATA CREATE(data1%dat,data2%dat)
!$ACC UPDATE DEVICE(data1%dat)

!$ACC PARALLEL DEFAULT(present)
!$ACC LOOP GANG INDEPENDENT PRIVATE(indi,inde)
do i_chunk=1,n_chunks
  ! --- Current chunk indices ---
  call get_indices(n2,i_chunk,n_chunks,indi,inde)
  print *, "#", __pgi_gangidx(), ": ", indi, inde
  !data2%dat(:,indi:inde,:) = data1%dat(:,indi:inde,:)
  call copydat(data1,data2,indi,inde,n1,n3)
end do
!$ACC END LOOP
!$ACC END PARALLEL

!$ACC UPDATE SELF(data2%dat)
!$ACC EXIT DATA DELETE(data1%dat,data2%dat)
!$ACC EXIT DATA DELETE(data1,data2)

print *, maxval(abs(data1%dat-data2%dat))

deallocate(data1%dat,data2%dat)

end program main

Compilation:

Note that I'm using NVHPC 24.3, but tested with earlier versions of the compiler as well. Also, this is on a Pascal device to match what Fabio's using, but I tested on Ampere and Volta as well.

% nvfortran -V24.3 -acc -Minline -Minfo=inline,accel test_mod.F90 test.F90
test_mod.F90:
test.F90:
test_mod.F90:
copydat:
     10, Generating NVIDIA GPU code
         19, !$acc loop vector collapse(3) ! threadidx%x
         20,   ! threadidx%x collapsed
         21,   ! threadidx%x collapsed
     19, Loop is parallelizable
     20, Loop is parallelizable
     21, Loop is parallelizable
get_indices:
     29, Generating acc routine seq
         Generating NVIDIA GPU code
test.F90:
main:
     32, Generating enter data copyin(data1,data2)
     33, Generating enter data create(data2%dat(:,:,:),data1%dat(:,:,:))
     34, Generating update device(data1%dat(:,:,:))
     36, Generating implicit firstprivate(n_chunks)
         Generating NVIDIA GPU code
         38, !$acc loop gang ! blockidx%x
         43,   ! threadidx%x collapsed
             !$acc loop vector(128) collapse(3) ! threadidx%x
     36, Generating default present(data1,data1%dat(:,:,:),data2%dat(:,:,:),data2)
     38, Generating implicit firstprivate(n1,n3,n2)
     40, get_indices inlined, size=4, file test_mod.F90 (29)
     43, copydat inlined, size=12, file test_mod.F90 (10)
          43, Loop is parallelizable
     48, Generating update self(data2%dat(:,:,:))
     49, Generating exit data delete(data1%dat(:,:,:),data2%dat(:,:,:))
     50, Generating exit data delete(data2,data1)

If using separate compilation, then we need to perform a two pass compilation. The first pass to collect the inline information into an inline library, then compile using this info in the second pass.

% nvfortran -V24.3 -acc -Mextract=lib:inlib test_mod.F90
% nvfortran -V24.3 -acc -Mextract=lib:inlib test.F90
% nvfortran -V24.3 -c -acc -Minline=lib:inlib -Minfo=inline,accel test_mod.F90
copydat:
     10, Generating NVIDIA GPU code
         19, !$acc loop vector collapse(3) ! threadidx%x
         20,   ! threadidx%x collapsed
         21,   ! threadidx%x collapsed
     19, Loop is parallelizable
     20, Loop is parallelizable
     21, Loop is parallelizable
get_indices:
     29, Generating acc routine seq
         Generating NVIDIA GPU code
% nvfortran -V24.3 -c -acc -Minline=lib:inlib -Minfo=inline,accel test.F90
main:
     32, Generating enter data copyin(data1,data2)
     33, Generating enter data create(data2%dat(:,:,:),data1%dat(:,:,:))
     34, Generating update device(data1%dat(:,:,:))
     36, Generating implicit firstprivate(n_chunks)
         Generating NVIDIA GPU code
         38, !$acc loop gang ! blockidx%x
         43,   ! threadidx%x collapsed
             !$acc loop vector(128) collapse(3) ! threadidx%x
     36, Generating default present(data1,data1%dat(:,:,:),data2%dat(:,:,:),data2)
     38, Generating implicit firstprivate(n1,n3,n2)
     40, get_indices inlined, size=4, file test_mod.F90 (29)
     43, copydat inlined, size=12, file test_mod.F90 (10)
          43, Loop is parallelizable
     48, Generating update self(data2%dat(:,:,:))
     49, Generating exit data delete(data1%dat(:,:,:),data2%dat(:,:,:))
     50, Generating exit data delete(data2,data1)
% nvfortran -V24.3 -acc test_mod.o test.o
% a.out
Data of size 128, 256,  32, splitted in  64 chunks along dim2
 #           55 :           221          224
 #           15 :            61           64
 #           21 :            85           88
 #           27 :           109          112
 #           43 :           173          176
... cut due to length ...
 #           63 :           253          256
            0