I recently started working with OpenMP to do some 'research' for an project in university. I have a rectangular and evenly spaced grid on which I'm solving a partial differential equation with an iterative scheme. So I basically have two for-loops (one in x- and y-direction of the grid each) wrapped by a while-loop for the iterations.
Now I want to investigate different parallelization schemes for this. The first (obvious) approach was to do a spatial a parallelization on the for loops. Works fine too.
The approach I have problems with is a more tricky idea. Each thread calculates all grid points. The first thread starts solving the equation at the first grid row (y=0). When it's finished the thread goes on with the next row (y=1) and so on. At the same time thread #2 can already start at y=0, because all the necessary information are already available. I just need to do a kind of a manual synchronization between the threads so they can't overtake each other.
Therefore I used an array called check. It contains the thread-id that is currently allowed to work on each grid row. When the upcoming row is not 'ready' (value in check[j] is not correct), the thread goes into an empty while-loop, until it is.
Things will get clearer with a MWE:
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main()
{
    // initialize variables
    int iter = 0;                       // iteration step counter
    int check[100] = { 0 };             // initialize all rows for thread #0
    #pragma omp parallel num_threads(2)
    {
        int ID, num_threads, nextID;
        double u[100 * 300] = { 0 };
        // get parallelization info
        ID = omp_get_thread_num();
        num_threads = omp_get_num_threads();
        // determine next valid id
        if (ID == num_threads - 1) nextID = 0;
        else nextID = ID + 1;
        // iteration loop until abort criteria (HERE: SIMPLIFIED) are valid 
        while (iter<1000)
        {
            // rows (j=0 and j=99 are boundary conditions and don't have to be calculated)
            for (int j = 1; j < (100 - 1); j++)
            {
                // manual sychronization: wait until previous thread completed enough rows
                while (check[j + 1] != ID)
                {
                    //printf("Thread #%d is waiting!\n", ID);
                }
                // gridpoints in row j
                for (int i = 1; i < (300 - 1); i++)
                {
                    // solve PDE on gridpoint
                    // replaced by random operation to consume time
                    double ignore = pow(8.39804,10.02938) - pow(12.72036,5.00983);
                }
                // update of check array in atomic to avoid race condition
                #pragma omp atomic write
                {
                    check[j] = nextID;
                }
            }// for j
            #pragma omp atomic write
            check[100 - 1] = nextID;
            #pragma omp atomic
            iter++;
            #pragma omp single
            {
                printf("Iteration step: %d\n\n", iter);
            }
        }//while
    }// omp parallel
}//main
The thing is, this MWE actually works on my machine. But if I copy it into my project, it doesn't. Additionally the outcome is always different: It stops either after the first iteration or after the third.
Another weird thing: when I remove the slashes of the comment in the inner while-loop it works! The output contains some
"Thread #1 is waiting!"
but that's reasonable. To me it looks like I created somehow a race condition, but I don't know where.
Does somebody has an idea what the problem could be? Or a hint how to realize this kind of synchronization?
                        
I think you are mixing up atomicity and memory consitency. The OpenMP standard actually describes it very nicely in
1.4 Memory Model (emphasis mine):
1.4.3 The Flush Operation
To avoid that, you should also make the read of
check[]atomicand specify theseq_cstclause to youratomicconstructs. This clause forces an implicit flush to the operation. (It is called a sequentially consistent atomic construct)Disclaimer: I can't really try the code right now.
Furhter Notes:
I think the
iterstop criteria is bogus, the way you use it, but I guess that's irrelevant given that it is not your actual criteria.I assume this variant will perform worse than the spatial decomposition. You loose a lot of data locality, especially on NUMA systems. But of course it is fine to try and measure.
There seems to be a discrepancy between your code (using
check[j + 1]) and your description "At the same time thread #2 can already start at y=0"