I'm using Cudafy.NET and I have some difficulties about the BlockSize. It is generating different results in some situations. Shortly the difference is at here:
//correct results when using this line
gpu.Launch(1, 7, "kernelfx_alldata", 10, devdata, devnmin, devnmax, devgmin, devgmax, devtest);
//incorrect results when using this line
gpu.Launch(1, 8, "kernelfx_alldata", 10, devdata, devnmin, devnmax, devgmin, devgmax, devtest);
The detailed explanation about the problem:
I have 10 items to loop. The GridSize is 1.
CASE 1: When CudafyModes.Target = eGPUType.OpenCL and the BlockSize is 1,2,3,4,5,6 and 7. The results are correct.
CASE 2: CudafyModes.Target = eGPUType.OpenCL and the BlockSize is 8,9,10,11, .... and more. The results are incorrect.
CASE 3: CudafyModes.Target = eGPUType.Emulator and the BlockSize is 1,2,3,4,5,6,7,8,9,10,11, .... and more. The results are correct.
The example code is shown below. Initializing the variables:
double[,] data;
double[] nmin, nmax, gmin, gmax;
void initializeVars()
{
data = new double[10, 10];
for (int i = 0; i < 10; i++)
{
data[i, 0] = 100 + i;
data[i, 1] = 32 + i;
data[i, 2] = 22 + i;
data[i, 3] = -20 - i;
data[i, 4] = 5522 + 10 * i;
data[i, 5] = 40 + i;
data[i, 6] = 14 - i;
data[i, 7] = 12 + i;
data[i, 8] = -10 + i;
data[i, 9] = 10 + 10 * i;
}
nmin = new double[10];
nmax= new double[10];
gmin = new double[10];
gmax = new double[10];
for (int i = 0; i < 10; i++)
{
nmin[i] = -1;
nmax[i] = 1;
gmin[i] = i;
gmax[i] = 11 * i*i+1;
}
}
gpu Launch Code:
private void button1_Click(object sender, EventArgs e)
{
CudafyModes.Target = eGPUType.OpenCL;
CudafyModes.DeviceId = 0;
CudafyTranslator.Language = eLanguage.OpenCL;
CudafyModule km = CudafyTranslator.Cudafy();
Cudafy.Host.GPGPU gpu = Cudafy.Host.CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
gpu.LoadModule(km);
initializeVars();
double[,] devdata = gpu.Allocate<double>(data); gpu.CopyToDevice(data, devdata);
double[] devnmin = gpu.Allocate<double>(nmin); gpu.CopyToDevice(nmin, devnmin);
double[] devnmax = gpu.Allocate<double>(nmax); gpu.CopyToDevice(nmax, devnmax);
double[] devgmin = gpu.Allocate<double>(gmin); gpu.CopyToDevice(gmin, devgmin);
double[] devgmax = gpu.Allocate<double>(gmax); gpu.CopyToDevice(gmax, devgmax);
double[] test = new double[10];
double[] devtest = gpu.Allocate<double>(test);
gpu.Launch(1, 8, "kernelfx_alldata", 10, devdata, devnmin,
devnmax, devgmin, devgmax, devtest);
gpu.CopyFromDevice(devtest, test);
gpu.FreeAll();
}
the Cudafy kernel
[Cudafy]
public static void kernelfx_alldata(GThread thread, int N, double[,] data, double[] nmin, double[] nmax, double[] gmin, double[] gmax, double[] test)
{
int tid = thread.threadIdx.x + thread.blockIdx.x * thread.blockDim.x;
while (tid < N)
{
double[] tmp = thread.AllocateShared<double>("tmp", 10);
tmp[0] = 1;
for (int i = 1; i < 10; i++)
{
tmp[i] = data[tid, i - 1];
}
for (int i = 1; i < 10; i++)
{
tmp[i] = (nmax[i - 1] - nmin[i - 1]) / (gmax[i - 1] - gmin[i - 1]) * (tmp[i] - gmin[i - 1]) + nmin[i - 1];
}
test[tid] = tmp[1];
tid = tid + thread.blockDim.x * thread.gridDim.x;
}
}
The Correct (CASE 1 and CASE 3) Results are:
test[0]=199.0 test[1]=201.0 test[2]=203.0 test[3]=205.0 test[4]=207.0 test[5]=209.0 test[6]=211.0 test[7]=213.0 test[8]=215.0 test[9]=217.0Incorrect (CASE 2) results are:
test[0]=213.0 test[1]=213.0 test[2]=213.0 test[3]=213.0 test[4]=213.0 test[5]=213.0 test[6]=213.0 test[7]=213.0 test[8]=217.0 test[9]=217.0
When the BlockSize is lower then 8, the results are correct. But when the BlockSize is greater then 8 the results are incorrect. In order to use the gpu efficiently the blockSize must be greater then 8.
What is the problem on this code?
Best Regards...
Declaring tmp as 2d array, first column is the threadId solves the problem. The working code is below: