Wrote a program that uses C++ AMP to run on my laptop's GPU (Intel HD Graphics 520). The GPU kernel is long so I will give a high level description (but let me know if more is needed).
Note that I fall into the "know enough to be dangerous" category of programmer.
parallel_for_each(accelerator_view, number_of_runs.extent, [data](index<1> idx) restrict(amp)
{
double total = data.starting_total[idx];
//these "working variables" are used for a variety of things in the code
double working_variable = 0.0;
double working_variable2 = 0.0;
for (int i = 0; i < 20; i++)
{
...do lots of stuff. "total" is changed by various factors...
//total is still a positive number that is greater than zero
//working_variable now has a positive non-zero value, and I want to find what %
//of the remaining total that value is
working_variable2 = 1.0 / total;
working_variable2 = working_variable * working_variable2;
//Note that if I write it like this the same issue will happen:
working_variable2 = working_variable / total;
...keep going and doing more things, write some values to data..
if (total == 0)
break;
}
}
When I run this without doing much else on my computer this runs just fine and I get the results I expect.
Where it gets really tricky is when I am stressing the system (or I think I am stressing the system). I test stress the system by 1) Kicking off my program 2) Opening up Chrome 3) Going to Youtube and starting a video
When I do that I get unexpected results (either when I am opening a program or running a video). I traced it back to the "1.0 / total" calculation returning infinity (inf), even though "total" is greater than zero. Here is an example of what I output to the console when this issue happens:
total = 51805.6
1.0 / total = inf
precise_math::pow(total, -1) = 1.93029e-05
I am running the kernel about 1.6 million times and I'll see between 0 and 15 of those 1.6 million hit this issue. The number of issues varies and which threads hit the issue varies.
So I feel confident that "total" is not zero and this is not a divide by zero situation. What am I missing? What could be causing this issue? Any way to prevent this? I am thinking of replacing all division in the kernel with the pow(num, -1)
P.S. Yes I am aware that the part of the answer is "don't watch videos while running". It is the opening of programs during execution that I am most concerned about.
Thank you!