Why does my Google Benchmark result depend on the order of execution?

461 views Asked by At

I am trying to benchmark the potential performance increase of emplace_back vs. push_back using Google Benchmark:

#include <benchmark/benchmark.h>

#include <vector>
#include <string>
using namespace std;

static void BM_pushback(benchmark::State& state) {

    for(auto _ : state) {
        vector<string> data;
        for(int i=0;i<10000000;++i)
            data.push_back("A long string to avoid sbo"); 
    }
}

static void BM_emplaceback(benchmark::State& state) {

    for(auto _ : state) {
        vector<string> data;
        for(int i=0;i<10000000;++i)
            data.emplace_back("A long string to avoid sbo");
    }
}

BENCHMARK(BM_pushback)->UseRealTime();
BENCHMARK(BM_emplaceback)->UseRealTime();

BENCHMARK_MAIN();

Using g++ with -O3 I get:

Running ./benchmark
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.60, 0.48, 0.53
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_pushback/real_time     886383300 ns    886276679 ns            1
BM_emplaceback/real_time  698194513 ns    698159138 ns            1

If I, however, change the order of execution, that is

BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK(BM_pushback)->UseRealTime();

I receive opposite timings:

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_emplaceback/real_time  863837627 ns    863814885 ns            1
BM_pushback/real_time     765406073 ns    765407578 ns            1

Can someone explain this behavior? I think, this might have something to do with caching but what exactly is going on here? Just as a note: What seems to help is to reduce the benchmarking size, i.e. adding only 10000 instead of 10000000 strings to my vector (which leads to more cycles being performed by the library).

0

There are 0 answers