fast double exp2 function in C

775 views Asked by At

I need a version of the following fast exp2 function working in double precision, can you help me ? Please don't answer saying that it is an approximation so a double version is pointless and casting the result to (double) is enough..thanks. The function which I found somewhere and which works for me is the following and it is much faster than exp2f(), but unfortunately I could not find any double precision version:

inline float fastexp2(float p)
{
 if(p<-126.f) p= -126.f;
 int w=(int)p;
 float z=p-(float)w;
 if(p<0.f) z+= 1.f;
 union {uint32_t i; float f;} v={(uint32_t)((1<<23)*(p+121.2740575f+27.7280233f/(4.84252568f -z)-1.49012907f * z)) };
 return v.f;
}
1

There are 1 answers

10
njuffa On

My assumption is that the existing code from the question assumes IEEE-754 binary floating-point computation, in particular mapping C's float type to IEEE-754's binary32 format.

The existing code suggests that only floating-point results in the normal range are of interest: subnormal results are avoided by clamping the input from below and overflow is ignored. So for float computation valid inputs are in the interval [-126, 128). By exhaustive test, I found that the maximum relative error of the function in the question is 7.16e-5, and that the root-mean-square (RMS) error is 1.36e-5.

My assumption is that the desired change to double computation should widen the range of allowed inputs to [-1022, 1024), and that identical relative accuracy should be maintained. The code is written in a fairly cryptic fashion. So as a first step, I rearranged it into a more readable version. In a second step, I tweaked the coefficients of the core approximation so as to minimize the maximum relative error. This results in the following ISO-C99 code:

/* compute 2**p, for p in [-126, 128). Maximum relative error: 5.04e-5; RMS error: 1.03e-5 */
float fastexp2 (float p)
{
    const int FP32_MIN_EXPO = -126; // exponent of minimum binary32 normal
    const int FP32_MANT_BITS = 23;  // number of stored mantissa (significand) bits
    const int FP32_EXPO_BIAS = 127; // binary32 exponent bias
    float res;

    p = (p < FP32_MIN_EXPO) ? FP32_MIN_EXPO : p; // clamp below
    /* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
    float w = floorf (p); // integral part
    float z = p - w;      // fractional part
    /* approximate 2**z-1 for z in [0, 1) */
    float approx = -0x1.6e7592p+2f + 0x1.bba764p+4f / (0x1.35ed00p+2f - z) - 0x1.f5e546p-2f * z;
    /* assemble the exponent and mantissa components into final result */
    int32_t resi = ((1 << FP32_MANT_BITS) * (w + FP32_EXPO_BIAS + approx));
    memcpy (&res, &resi, sizeof res);
    return res;
}

Refactoring and retuning of the coefficients resulted in slight improvements in accuracy, with maximum relative error of 5.04e-5 and RMS error of 1.03e-5. It should be noted that floating-point arithmetic is generally not associative, therefore any re-association of floating-point operations, either by manual code changes or compiler transformations could affect the stated accuracy and requires careful re-testing.

I do not expect any need to modify the code as it compiles into efficient machine code for common architectures, as can be seen from trial compilations with Compiler Explorer, e.g. gcc with x86-64 or gcc with ARM64.

At this point it is obvious what needs to be changed for switching to double computation. Change all instances of float to double, all instances of int32_t to int64_t, change type suffixes for literal constants and math functions, and change the floating-point format specific parameters for IEEE-754 binary32 to those for IEEE-754 binary64. The core approximation needs re-tuning to make best possible use of double precision coefficients in the core approximation.

/* compute 2**p, for p in [-1022, 1024). Maximum relative error: 4.93e-5. RMS error: 9.91e-6 */
double fastexp2 (double p)
{
    const int FP64_MIN_EXPO = -1022; // exponent of minimum binary64 normal
    const int FP64_MANT_BITS = 52;   // number of stored mantissa (significand) bits
    const int FP64_EXPO_BIAS = 1023; // binary64 exponent bias
    double res;

    p = (p < FP64_MIN_EXPO) ? FP64_MIN_EXPO : p; // clamp below
    /* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
    double w = floor (p); // integral part
    double z = p - w;     // fractional part
    /* approximate 2**z-1 for z in [0, 1) */
    double approx = -0x1.6e75d58p+2 + 0x1.bba7414p+4 / (0x1.35eccbap+2 - z) - 0x1.f5e53c2p-2 * z;
    /* assemble the exponent and mantissa components into final result */
    int64_t resi = ((1LL << FP64_MANT_BITS) * (w + FP64_EXPO_BIAS + approx));
    memcpy (&res, &resi, sizeof res);
    return res;
}

Both maximum relative error and root-mean-square error decreases very slightly to 4.93e-5 and 9.91e-6, respectively. This is as expected, because for an approximation that is roughly accurate to 15 bits it matters little whether intermediate computation is performed with 24 bits of precision or 53 bits of precision. The computation uses a division, and this tends to be slower for double than for float on all platforms I am familiar with, so the double-precision port doesn't seem to provide any significant advantages other than perhaps eliminating conversion overhead if the calling code uses double-precision computation.