I was using OpenMP with target offload and found that my application segfaults when limiting OpenMP to a single thread. I could boil it down to the following snippet:
#include <omp.h>
int main(){
int res = 0;
#pragma omp parallel num_threads(1)
{
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:res)
{
#pragma omp target in_reduction(+:res) nowait
{
res++;
}
}
}
}
}
Compiled with
clang++ -fopenmp -fopenmp-targets=nvptx64 --offload-arch=sm_61 -O0 main.cpp
using clang 17.0.0 and cuda 12.1, ran on a Ubuntu 22.04 machine with a 12700k/1080Ti.
This segfaults when num_threads is set to 1, but works fine with more than one thread (e.g. num_threads(2)) or when not specifying nowait on the target task, so that it synchronizes at the end of the target region.
From my understanding, this should work just fine even with a single thread.
According to mailing list/github issues, as of writing this, support for
in_reductionin clang is incomplete/missing.