Looping Performance in C++

Today I was testing the performance of a piece of code, which is basically accessing each element in a container within a for loop. But the result is quite shocking because I found the std::for_each version is 10 times faster than the raw loop. What?

Test for yourself

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <vector>
#include <array>
#include <algorithm>
#include <chrono>
#include <iostream>
int main()
{
std::array<std::vector<float>, 6> buffers;
for (int i = 0; i < 6; ++i) {
buffers[i] = std::vector<float>(480000, 0.5f);
}
auto tstart = std::chrono::high_resolution_clock::now();
auto accum = 0;
for (int i = 0; i < 6; ++i) {
for (size_t j = 0; j < buffers[i].size(); ++j) {
if (buffers[i][j] < 1.0f)
++accum;
}
}
auto tend = std::chrono::high_resolution_clock::now();
auto duration = tend - tstart;
std::cout << "Raw loop: " << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << std::endl;
tstart = std::chrono::high_resolution_clock::now();
accum = 0;
for (const auto& buffer : buffers) {
for (const auto& value : buffer) {
if (value < 1.0f)
++accum;
}
}
tend = std::chrono::high_resolution_clock::now();
duration = tend - tstart;
std::cout <<"Range-based for loop: " << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << std::endl;
tstart = std::chrono::high_resolution_clock::now();
accum = 0;
std::for_each(buffers.begin(), buffers.end(),
[&accum](const std::vector<float>& buffer) {
std::for_each(buffer.begin(), buffer.end(), [&accum](float value) { if (value < 1.0f) ++accum; });
}
);
tend = std::chrono::high_resolution_clock::now();
duration = tend - tstart;
std::cout << "std::for_each: " << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << std::endl;
}

I was using VS2015 under Debug build. Here’s the output:

Raw loop: 978

Range-based for loop: 426

std::for_each: 66

However when I switched to Release build:

Raw loop: 2

Range-based for loop: 2

std::for_each: 5

That’s what I’ve been expecting. And when I changed time precision to nanoseconds it turned out that raw loop is slightly faster than the range-based for loop.

Lesson learned

The compiler sure knows how to optimize your code. So do your profiling with optimization on.