In SYCL, there are three types of memory: host memory, device memory, and Unified Shared Memory (USM). For host and device memory, data exchange requires explicit copying. Meanwhile, data movement from and to USM is automatically managed by the SYCL runtime implicitly.
Unfortunately, during the process of implementing GPU acceleration for a numerical kernel using SYCL, I found an up-to 4000% decrease of performance just by switching from sycl::malloc_device() to sycl::malloc_shared() - even if all I do is repeatedly resubmitting the same SYCL kernel, without any attempt to access data from the host.
When building the code with sycl::malloc_device() with OpenSYCL targeting AMD HIP GFX906 (Radeon VII / Instinct MI50), the program finishes in 0.27 seconds:
$ time ./fdtd.elf
simulate 16974593 cells for 10 timesteps.
real 0m0.271s
user 0m0.253s
sys 0m0.020s
When building the same code with sycl::malloc_shared(), the program takes 10.6 seconds to complete:
simulate 16974593 cells for 10 timesteps.
real 0m10.649s
user 0m15.172s
sys 0m0.196s
This is a 3925% slowdown.
After enabling "Above 4G Decoding" and "Re-sizable BAR" support in BIOS, now it takes 3.8 seconds instead of 10.6 seconds. But this doesn't fix the actual problem of needless memory tranfers - a 1300% performance hit is still pretty significant.
I also tested a similar kernel using the Intel DPC++ compiler previously, and saw similar results on the same hardware.
I suspect that the slowdown is caused by needless host and device copying, but I'm not sure. What heuristics does a SYCL runtime use to determine whether copying is needed?
The sample code is attached below.
ArrayNXYZ.hpp: 4-dimensional array (n, x, y, z) wrapper class.
#include <sycl/sycl.hpp>
template <typename T>
struct ArrayXYZN
{
ArrayXYZN() {}
inline T& operator() (const unsigned int n, const unsigned int x, const unsigned int y, const unsigned int z) const
{
size_t offset = n * n_stride + x * x_stride + y * y_stride + z;
return array[offset];
}
unsigned long n_stride, x_stride, y_stride, size;
T *array;
};
template <typename T>
ArrayXYZN<T>* CreateArrayXYZN(sycl::queue Q, const unsigned int* numLines)
{
unsigned int n_max = 3;
unsigned int x_max = numLines[0];
unsigned int y_max = numLines[1];
unsigned int z_max = numLines[2];
unsigned long n_stride = x_max * y_max * z_max;
unsigned long x_stride = y_max * z_max;
unsigned long y_stride = z_max;
if (n_stride % 128 != 0)
{
n_stride += 128 - (n_stride % 128);
}
// allocate 1D linear buffer
size_t size = n_stride * n_max;
#ifdef USM
T *buf = sycl::malloc_shared<T>(size, Q);
#else
T *buf = sycl::malloc_device<T>(size, Q);
#endif
// zero memory
Q.submit([&](sycl::handler& h) {
h.memset(buf, 0, size * sizeof(T));
});
Q.wait();
// allocate wrapper class
ArrayXYZN<T>* array = new ArrayXYZN<T>();
array->n_stride = n_stride;
array->x_stride = x_stride;
array->y_stride = y_stride;
array->size = size * sizeof(T);
array->array = buf;
return array;
}
fdtd.cpp:
#include <sycl/sycl.hpp>
#include "ArrayNXYZ.hpp"
/*
* UpdateVoltages
*
* Using Finite Difference Time Domain (FDTD) method,
* calculate new electric field array "volt" based on
* magnetic field "curr" and two electromagnetic field
* operators "vv" and "vi", precalculated from the
* physical materials before starting up simulation.
*/
void UpdateVoltages(
const ArrayXYZN<float>& volt,
const ArrayXYZN<float>& curr,
const ArrayXYZN<float>& vv,
const ArrayXYZN<float>& vi,
int x, int y, int z
)
{
// note: each (x, y, z) cell has three polarizations
// x, y, z, these are different from the cell's
// coordinates (x, y, z)
//for x polarization
float volt0 = volt(0, x, y, z);
volt0 *= vv(0, x, y, z);
volt0 +=
vi(0, x, y, z) * (
curr(2, x, y , z ) -
curr(2, x, y-1, z ) -
curr(1, x, y , z ) +
curr(1, x, y , z-1)
);
//for y polarization
float volt1 = volt(1, x, y, z);
volt1 *= vv(1, x, y, z);
volt1 +=
vi(1, x, y, z) * (
curr(0, x , y, z ) -
curr(0, x , y, z-1) -
curr(2, x , y, z ) +
curr(2, x-1, y, z )
);
//for z polarization
float volt2 = volt(2, x, y, z);
volt2 *= vv(2, x, y, z);
volt2 +=
vi(2, x, y, z) * (
curr(1, x , y , z) -
curr(1, x-1, y , z) -
curr(0, x , y , z) +
curr(0, x , y-1, z)
);
volt(0, x, y, z) = volt0;
volt(1, x, y, z) = volt1;
volt(2, x, y, z) = volt2;
}
int main(void)
{
const unsigned int numLines[3] = {257, 257, 257};
const int timesteps = 10;
sycl::queue Q;
ArrayXYZN<float>& volt = *CreateArrayXYZN<float>(Q, numLines);
ArrayXYZN<float>& curr = *CreateArrayXYZN<float>(Q, numLines);
ArrayXYZN<float>& vv = *CreateArrayXYZN<float>(Q, numLines);
ArrayXYZN<float>& vi = *CreateArrayXYZN<float>(Q, numLines);
size_t size = numLines[0] * numLines[1] * numLines[2];
fprintf(stderr, "simulate %ld cells for %d timesteps.\n", size, timesteps);
for (int i = 0; i < timesteps; i++) {
Q.submit([&](sycl::handler &h) {
h.parallel_for<class Voltage>(
sycl::range(numLines[0] - 1, numLines[1] - 1, numLines[2] - 1),
[=](sycl::item<3> itm) {
/*
* The first cell on each dimension has data dependency
* outside the simulation box (boundary condition).
* Ignore them for now.
*/
int x = itm.get_id(0) + 1;
int y = itm.get_id(1) + 1;
int z = itm.get_id(2) + 1;
UpdateVoltages(volt, curr, vv, vi, x, y, z);
}
);
});
Q.wait();
}
}
I've solved the problem myself. There are three problems.
XNACK
The fundamental problem is that
xnackis disabled.Exactly what
XNACKdoes, or how can it be enabled, is poorly documented in all but a few places. I believe this answer is the only comprehensive guide on the entire Web.XNACK(GPU retry on page fault) is needed for on-demand page migration between the host and the GPU. Without it, HIP's shared memory operates in a degraded mode - memory will not be automatically migrated based on access patterns. Thus, if USM is to be used,XNACKmust be enabled. You can check whetherxnackis enabled by looking at your platform name. If it hasxnack-(e.g.gfx906:sramecc+:xnack-), it meansXNACKis disabled. If it hasxnack+, it meansXNACKis enabled.Unfortunately, not all dedicated GPUs are supported. Most GPUs from the GFX10/GFX11 series since RDNA do not support XNACK. Thus, the use of Unified Share Memory, which is the recommended practice and heavily used in SYCL programming, suffers a serious hit.
If you're lucky, it turns out that many dedicated GPUs in the GFX9 series supports XNACK (based on the ISA table in ROCm), but it's disabled by the
amdgpukernel driver by default. It's possibly due to stability concerns as it's still an experimental feature. It can be manually enabled by theamdgpukernel module parameternoretry=0or the boot-time kernel argumentamdgpu.noretry=0.To enable
XNACK:It must be supported by the hardware.
It must be enabled in the Linux kernel via the
noretry=0flag. After enabling,clinfoorrocminfoshould reportxnack+in GPU's ISA name.It must be enabled in the runtime via the environmental variable
HSA_XNACK=1before running a HIP program.(Optionally) Compile your code with a
xnack+target (e.g usegfx906:xnack+instead of a plain target namegfx906. This should maximize performance, but your binary will no longer run on devices withoutXNACK. I found in my case, there's almost no performance difference.To check whether
XNACKis really enabled, AMD has a small demo program/opt/rocm/hip/bin/hipInfo. Run it withAMD_LOG_LEVEL=4 HSA_XNACK=1 ./hipInfo, it should reportxnack: 1at the beginning of the output.On my particular distro (Gentoo), one needs to build
dev-libs/rocr-runtimewithUSE=debugto allow debugging. ThehipInfoprogram is not built by default, but it can be found in/usr/share/hip/samples/1_Utils/hipInfo. ChangeMakefile'sHIPCC=$(HIP_PATH)/bin/hipcctoHIPCC=hipccand runmake.After enabling
XNACK, performance of my code becomes normal, and the performance hit is now only 200%, not 1000% or 4000%.Without XNACK:
With XNACK:
prefetch()andmem_advise()The next problem is how to achieving good performance without
xnack. The answer is using performance hintsprefetch()andmem_advise(). WhenXNACKis disabled, this is essentially a manual copy from host to the GPU.Also, since the overhead of USM is not zero, and page migration is imperfect, they're also needed to maximize performance if
XNACKis supported.prefetch()
One should prefetch data before the GPU needs to use data. Add the following lines immediately after
CreateArrayXYZN():After this change, the performance hit reduced to only 200%, not 1000% or 4000%.
mem_advise()
Then, one uses platform-specific performance hints to tell the underlying runtime that we want to make the data stay on the GPU. Unfortunately, there's no standard for the available hints. So it's device-specific, and you may need to use a lookup-table in your program.
For OpenSYCL with AMD HIP's backend, it passes
mem_advise()hints directly into HIP'shipMemAdvise()(source code). AMD HIP provides the following useful hints of our interests:Thus, I added the following lines:
After this modification, USM performance is now almost as good as device memory.
I found
hipMemAdviseSetReadMostlyandhipMemAdviseSetPreferredLocationhad no effect, buthipMemAdviseSetCoarseGrainwas able to close the final performance gap between device memory and USM - at the expense of data coherency during simultaneous execution between host and device, this may or may not be acceptable for your application. I believe that for my use case, explicitQ.wait()is adequate.Above 4G Decoding & Resizable BAR
Finally, enabling "4G Decoding" and "Resizable BAR" can improve performance of host-to-GPU data transfer. After enabling these features in firmware, I saw that the performance hit without
XNACKor prefetching reduced from 4000% to 1300%. It's not a real solution to the problem, but helps to maximize performance after USM is fixed using the previous methods.Discussion
Lack of
XNACKThe fundamental problem appears to be that most AMD discrete GPUs either disabled the
XNACKfeature by default, or unsupport it outright. Even though the silicon theoretically appears to have this capabilities since GFX8, according to the ISA table in ROCm.Exactly what
XNACKdoes, or how can it be enabled, is poorly documented in all but a few places.What is XNACK
According to AMD's tutorial:
The documentation of Oak Ridge National Laboratory's supercomputer also states:
How do I enable XNACK
It must be supported by the hardware.
It must be enabled in the Linux kernel via the
noretry=0flag. After enabling,clinfoorxnack+should reportxnack+in GPU's ISA name.It must be enabled in the runtime via the environmental variable
HSA_XNACK=1before running a HIP program.Unfortunately, many dedicated desktop GPUs do not support it, making USM almost useless.
If you're lucky, it turns out that many dedicated GPUs in the GFX9 series supports
XNACK. According to the Linux kernel source code:It turns out that many dedicated GPUs in the GFX9 series supports XNACK (based on the ISA table in ROCm), but it's disabled by the
amdgpukernel driver by default. It's possibly due to stability concerns as it's still an experimental feature. It can be manually enabled by the kernel parameternoretry=0or the boot-time kernel argumentamdgpu.noretry=0.The Linux kernel documentation says:
Support Status
Unfortunately, many dedicated GPUs since RDNA (most GPUs from the GFX10/GFX11 series) do not support XNACK.
Even on supercomputer cards like the MI100 or the MI250x, support is non-existent until recently, even then it's still experimental by now. According to a 2020 research paper:
AMD ROCm developers currently states the feature is still experimental:
SYCL
It appears that the SYCL runtime has very limited capabilities on managing implicit memory transfers. I asked "what heuristics does a SYCL runtime use to determine whether copying is needed?". The answer is that there's currently little or none (unlike, say a CPU's memory controller). On AMD GPUs, OpenSYCL's USM is implemented as
hipMallocManaged(), thus, SYCL's on-demand paging depends entirely on HIP.Thus, good performance is achieved by using programmer-provided hints.
The SYCL 2020 specification also says: