I have used a little test program to test the efficiency of OpenMP for parallelizing a recursive computation using arbitrary precision with the mpfr/gmp libraries. As expected OpenMP overhead makes the parallel version slower at first, but with sufficient bits used the parallel version becomes faster.
The sequential loops go like:
....
for ( i = 0; i < 1000; i++ ) {
    mpfr_set_d ( z1, 0.0, MPFR_RNDN );
    mpfr_set_d ( z2, 0.0, MPFR_RNDN );
    ...
    iter = 0;
    while ( iter < 10000 ) {
         mpfr_sqr ( tmp1, z1, MPFR_RNDN );
         mpfr_sqr ( tmp2, z2, MPFR_RNDN );
         mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
         mpfr_add ( tr, tr, cr, MPFR_RNDN );
         mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
         ...
         iter++;
    }
}
and the parallel version:
....
omp_set_dynamic(0);
for ( i = 0; i < 10; i++ ) {
    mpfr_set_d ( z2, 0.0, MPFR_RNDN );
    mpfr_set_d ( z1, 0.0, MPFR_RNDN );
    ...
    iter = 0;
    while ( iter < 10000 ) {
#pragma omp parallel num_threads(4)
    {
        switch ( omp_get_thread_num() ) {
        case 0:
        mpfr_sqr ( tmp1, z1, MPFR_RNDN );
        mpfr_sqr ( tmp2, z2, MPFR_RNDN );
        mpfr_sub ( tr, tmp1, tmp2, MPFR_RNDN );
        mpfr_add ( tr, tr, cr, MPFR_RNDN ); break;
        case 1:
        mpfr_mul_2si ( tmp3, z1, 1, MPFR_RNDN );
        mpfr_mul ( ti, tmp3, z2, MPFR_RNDN );
        mpfr_add ( ti, ti, ci, MPFR_RNDN ); break;
        ...
        mpfr_mul_2si ( tti, tti, 1, MPFR_RNDN ); break;
        }
    }
        mpfr_set ( z1, tr, MPFR_RNDN );
        mpfr_set ( z2, ti, MPFR_RNDN );
        mpfr_set ( d1, ttr, MPFR_RNDN );
        mpfr_set ( d2, tti, MPFR_RNDN );
        iter++;
    }
}
Running times in seconds system A: Sequential
- 320 Bits: 11
 - 640 Bits: 16
 - 960 Bits: 21
 - 2560 Bits: 60
 - 5000 Bits: 152
 
Running times in seconds system A: Parallel
- 320 Bits: 15
 - 640 Bits: 16
 - 960 Bits: 18
 - 2560 Bits: 32
 - 5000 Bits: 65
 
Running times in seconds system B: Sequential
- 320 Bits: 13
 - 640 Bits: 18
 - 960 Bits: 27
 - 2560 Bits: 80
 - 5000 Bits: 202
 
Running times in seconds system B: Parallel
- 320 Bits: 51
 - 640 Bits: 54
 - 960 Bits: 56
 - 2560 Bits: 76
 - 5000 Bits: 128
 
System A is Fedora 19 kernel 3.11.10-200.fc19.x86_64
Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
System B is Linux Centos 6.5 kernel 2.6.32-431.1.2.0.1.el6.x86_64
Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
ltrace shows about same percentages for called functions/system calls. Both systems use latest gmp, mpfr and gcc versions. Why is system B so much worse (e.g. many times more OpenMP overhead) than system A? Has the Linux kernel got so much better in this regard? Any kernel parameters etc. I should look at? CPU hardware differences/limitations? Any other explanations? Do I have to install Fedora 19 on B to fix this?
Update: Thanks for the tip. It did change results for system B.
Running times in seconds system B: Parallel
- 320 Bits: 51 -> 23
 - 640 Bits: 54 -> 26
 - 960 Bits: 56 -> 29
 - 2560 Bits: 76 -> 47
 - 5000 Bits: 128 -> 99
 
B still is behind A but the gap has got a lot smaller.