I am trying to benchmark the potential performance increase of emplace_back vs. push_back using Google Benchmark:
#include <benchmark/benchmark.h>
#include <vector>
#include <string>
using namespace std;
static void BM_pushback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.push_back("A long string to avoid sbo");
}
}
static void BM_emplaceback(benchmark::State& state) {
for(auto _ : state) {
vector<string> data;
for(int i=0;i<10000000;++i)
data.emplace_back("A long string to avoid sbo");
}
}
BENCHMARK(BM_pushback)->UseRealTime();
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK_MAIN();
Using g++ with -O3 I get:
Running ./benchmark
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.60, 0.48, 0.53
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_pushback/real_time 886383300 ns 886276679 ns 1
BM_emplaceback/real_time 698194513 ns 698159138 ns 1
If I, however, change the order of execution, that is
BENCHMARK(BM_emplaceback)->UseRealTime();
BENCHMARK(BM_pushback)->UseRealTime();
I receive opposite timings:
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_emplaceback/real_time 863837627 ns 863814885 ns 1
BM_pushback/real_time 765406073 ns 765407578 ns 1
Can someone explain this behavior? I think, this might have something to do with caching but what exactly is going on here? Just as a note: What seems to help is to reduce the benchmarking size, i.e. adding only 10000 instead of 10000000 strings to my vector (which leads to more cycles being performed by the library).