omp library c++ pragma for

86 views Asked by At

I have this code, but I need help to use OpenMP #pragmas to increase its speed. I want to parallelize the for loops over variables i and j. The values of n and m can be much larger, too. For example, n = 1000, m = 500.

My code:

for (int i = 0; i < n; ++i) {
    for (int j = 0; j < m; ++j) {
        matrix[i][j] = {static_cast<double>(distribution(gen)),
                        static_cast<double>(distribution(gen))};
    }
}

#pragma omp parallel
{
    n = 50;
    m = 50;
    ReS = 0;
    ImS = 0;

    for (int nu = 0; nu < 1; nu++) {
#pragma omp for num_threads(4)
        for (int i = 0; i < n; i++) {
            for (int j = 0; j < m; j++) {
                double* matrixPtr = &matrix[i][j].first;
                double* matrixPtrSecond = &matrix[i][j].second;

                for (double theta = 0; theta < PI; theta += theta_plus) {
                    for (double phi = 0; phi < 2 * PI; phi += theta_plus) {
                        double angle = *matrixPtr * cos(phi) * sin(theta) +
                                       *matrixPtrSecond * sin(phi) * sin(theta);
                        ReS += cos(angle);
                        ImS += sin(angle);
                    }
                }
            }
        }
    }
}
1

There are 1 answers

0
John Bollinger On

The

#pragma omp parallel
{
    // ...

In your original code causes the whole associated block to be executed in parallel by each of the threads of a team. That seems counterproductive. At minimum, the intention is unclear.

The ...

for(int nu = 0; nu < 1; nu++)

... loop unconditionally executes exactly one iteration. That seems pointless.

The main computation performs frequent updates to shared variables ReS and ImS. That would be an absolute performance killer, but the structure is well suited for using a reduction to resolve that problem.

You said (I think) that you wanted to parallelize the loops over i and j. The code you present parallelizes only the outer of those, but of course that carries the inner loop with it. The loops structure is compatible with parallelizing over the (i,j) pairs instead of just over i, so I'll show that below, though I'm by no means sure that that will make much difference.

Here's how that might apply to the main loop:

#if 0
 // Remove this
 #pragma omp parallel
 {
#endif
n = 50;
m = 50;
ReS = 0;
ImS = 0;

#pragma omp parallel for collapse(2) num_threads(4) reduction(+:ReS,ImS)
for (int i = 0; i < n; i++) {
    for (int j = 0; j < m; j++) {
        double *matrixPtr = &matrix[i][j].first;
        double *matrixPtrSecond = &matrix[i][j].second;

        for (double theta = 0; theta < PI; theta += theta_plus) {
            for (double phi = 0; phi < 2 * PI; phi += theta_plus) {
                double angle = *matrixPtr * cos(phi) * sin(theta)
                             + *matrixPtrSecond * sin(phi) * sin(theta);
                ReS += cos(angle);
                ImS += sin(angle);
            }
        }
    }
}
#if 0
}
#endif

There may also be opportunities to substantially speed even the serial version of the computation, as discussed in comments on the question, but I'm focusing here on the use of OpenMP directives, which is what the question seems to be asking about.