Benchmarking Clang’s -fbuiltin-std-forward
-fbuiltin-std-forward
UPDATE, 2022-12-26: Well, I have eggnog on my face. Turns out I used the wrong Clang binary for some of my benchmarks! So today I reran all of the numbers. I also took this opportunity to control for the cost of
#include <utility>
(which might have unfairly penalized thestd::forward
numbers), and to use this Python script to shuffle and interleave the benchmark iterations. These changes greatly reduced the observed effects! If you want to see the previous numbers and find out how they changed, please check the git history.
In 2022 we saw a lot of interest (finally!) in the costs of std::move
and
std::forward
. For example, in April Richard Smith landed
-fbuiltin-std-forward
in Clang; in September Vittorio Romeo lamented
“The sad state of debug performance in C++”;
and in December the MSVC team landed [[msvc::intrinsic]]
.
Recall that std::forward<Arg>(arg)
should be used only on forwarding references,
and that when you do, it’s exactly equivalent to static_cast<Arg&&>(arg)
, or equivalently decltype(arg)(arg)
.
But historically std::forward
has been vastly more expensive to compile, because as far as the compiler is concerned,
it’s just a function template that needs to be instantiated, codegenned, inlined, and so on.
Way back in March 2015 — seven and a half years ago! — Louis Dionne did a little compile-time benchmark
and found that he could win “an improvement of […] about 13.9%” simply by search-and-replacing all of
Boost.Hana’s std::forward
s into static_cast
s. So he did that.
Now, these days, Clang understands std::forward
just like it understands strlen
. (You can disable that
clever behavior with -fno-builtin-strlen
, -fno-builtin-std-forward
.) As I understand it, this means that
Clang will avoid generating debug info for instantiations of std::forward
, and also inline it into the AST
more eagerly. Basically, Clang can short-circuit some of the compile-time cost of std::forward
. But does
it short-circuit enough of the cost to win back Louis’s 13.9% improvement? Would that patch from 2015
still pass muster today? Let’s reproduce Louis’s benchmark numbers and find out!
In the spirit of reproducibility, I’m going to walk through my entire benchmark-gathering process here. If you just want to see the pretty bar graphs, you might as well jump to the Conclusion.
All numbers were collected on my Macbook Pro running OS X 12.6.1, using a top-of-tree Clang and libc++
built in RelWithDebugInfo
mode. (This Clang was actually
my trivially-relocatable
branch,
but that doesn’t matter for our purposes.) I tried not to do anything that would grossly interfere with
the laptop’s performance during the test (e.g. hunt polyomino snakes);
still, be aware that this experiment was poorly controlled in that respect.
Let’s begin!
Check out the old revision of Hana
First, let’s build that old revision of Hana with our top-of-tree Clang and libc++. Instructions for building Clang and libc++ (somewhat bit-rotted, I admit) are in “How to build LLVM from source, monorepo version” (2019-11-09).
$ git clone https://github.com/boostorg/hana/
$ cd hana
$ git checkout 540f665e51~
At this point you might need to make some local changes to get the old revision of Hana to build. I had to make the following five changes:
1. To avoid accidentally including headers from /usr/local/include/boost/hana
:
CMakeLists.txt
-find_package(Boost)
+set(Boost_FOUND 0)
2. Because libc++ 16 will change the format of _LIBCPP_VERSION
:
include/boost/hana/config.hpp
-# define BOOST_HANA_CONFIG_LIBCPP BOOST_HANA_CONFIG_VERSION( \
- ((_LIBCPP_VERSION) / 1000) % 10, 0, (_LIBCPP_VERSION) % 1000)
+# define BOOST_HANA_CONFIG_LIBCPP BOOST_HANA_CONFIG_VERSION(16, 0, 0)
3. To avoid a compiler warning:
include/boost/hana/core/make.hpp
- static_assert((sizeof...(X), false),
+ static_assert(((void)sizeof...(X), false),
4. To fix an ambiguity with the uint
from <sys/types.h>
on OS X:
test/tuple.cpp
- prepend(uint<0>, tuple_c<unsigned int, 1>),
+ prepend(boost::hana::uint<0>, tuple_c<unsigned int, 1>),
5. I also made this change up front, because we’re going to need this when we
switch to the real std::forward
,
and we don’t want the cost of including <utility>
to be a confounding factor.
$ echo '#include <utility>' >>include/boost/hana/detail/std/forward.hpp
Build the test case
$ mkdir build
$ cd build
$ cmake .. -DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
This gives us an initial set of timing results:
detail::forward<T> -O0 including link time |
user | system | real |
---|---|---|---|
make compile.test.tuple |
114.115s | 6.726s | 127.441s |
make compile.test.tuple |
118.420s | 6.987s | 133.369s |
make compile.test.tuple |
114.480s | 6.430s | 127.353s |
make compile.test.tuple |
116.171s | 6.585s | 130.244s |
Ignore the linker
make
is compiling and linking, but we really only care about the cost of
compiling. So let’s eliminate the cost of linking.
StackOverflow
provides this incantation:
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
detail::forward<T> -O0 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
105.757s | 4.984s | 113.710s |
make compile.test.tuple |
108.211s | 5.322s | 118.285s |
make compile.test.tuple |
114.888s | 6.059s | 125.776s |
make compile.test.tuple |
107.775s | 5.261s | 117.796s |
Compare -O0
, -O2 -g
, and -O3
At this point I realize that we’re getting CMake’s default “Debug” build type,
which is basically -O0
— not terribly realistic. So let’s also benchmark
the build types “RelWithDebugInfo” (-O2 -g
) and “Release” (-O3
).
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++ \
-DCMAKE_BUILD_TYPE=RelWithDebInfo
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
detail::forward<T> -O2 -g |
user | system | real |
---|---|---|---|
make compile.test.tuple |
214.718s | 6.197s | 225.538s |
make compile.test.tuple |
207.280s | 5.663s | 215.013s |
make compile.test.tuple |
207.382s | 6.160s | 216.759s |
make compile.test.tuple |
215.180s | 6.244s | 223.713s |
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++ \
-DCMAKE_BUILD_TYPE=Release
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
detail::forward<T> -O3 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
166.621s | 4.348s | 175.636s |
make compile.test.tuple |
161.489s | 3.985s | 168.599s |
make compile.test.tuple |
163.130s | 4.390s | 170.926s |
make compile.test.tuple |
170.153s | 4.604s | 178.123s |
Replace detail::std::forward
with std::forward
Now, Boost.Hana actually uses a hand-rolled template named detail::std::forward
instead of the STL std::forward
. That’s certainly going to mess with our
numbers, if Clang doesn’t realize that detail::std::forward
behaves like
std::forward
. Let’s replace detail::std::forward
with std::forward
and collect again:
$ git stash
$ git checkout 540f665e51~
$ git grep -l 'std::forward' .. | xargs sed -i -e 's/::boost::hana::detail::std::forward/::std::forward/g'
$ git grep -l 'std::forward' .. | xargs sed -i -e 's/boost::hana::detail::std::forward/::std::forward/g'
$ git grep -l 'std::forward' .. | xargs sed -i -e 's/hana::detail::std::forward/::std::forward/g'
$ git grep -l 'std::forward' .. | xargs sed -i -e 's/detail::std::forward/::std::forward/g'
$ git commit -a -m 'dummy message'
$ git stash pop
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++ \
-DCMAKE_BUILD_TYPE=Debug
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
std::forward<T> -O0 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
103.223s | 5.352s | 115.064s |
make compile.test.tuple |
100.498s | 5.166s | 110.846s |
make compile.test.tuple |
100.564s | 4.951s | 109.218s |
make compile.test.tuple |
101.097s | 5.087s | 113.160s |
std::forward<T> -O2 -g |
user | system | real |
---|---|---|---|
make compile.test.tuple |
197.714s | 5.325s | 204.809s |
make compile.test.tuple |
205.491s | 6.348s | 216.938s |
make compile.test.tuple |
210.890s | 6.581s | 221.418s |
make compile.test.tuple |
201.158s | 5.645s | 208.776s |
std::forward<T> -O3 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
159.796s | 4.275s | 167.517s |
make compile.test.tuple |
156.718s | 4.398s | 164.323s |
make compile.test.tuple |
160.336s | 4.120s | 169.296s |
make compile.test.tuple |
166.621s | 4.760s | 175.783s |
Compare -fno-builtin-std-forward
My understanding is that all of Clang’s special handling for std::forward
can be toggled on and off via -fno-builtin-std-forward
(see the relevant commit).
So we should discover if -fno-builtin-std-forward
actually does slow down the
compile.
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++ \
-DCMAKE_CXX_FLAGS=-fno-builtin-std-forward \
-DCMAKE_BUILD_TYPE=Debug
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
std::forward<T> -O0 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
108.377s | 5.352s | 120.684s |
make compile.test.tuple |
112.791s | 5.913s | 125.567s |
make compile.test.tuple |
110.949s | 5.461s | 121.821s |
make compile.test.tuple |
108.780s | 5.253s | 118.924s |
std::forward<T> -O2 -g -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
222.592s | 7.022s | 233.819s |
make compile.test.tuple |
220.630s | 6.608s | 230.786s |
make compile.test.tuple |
216.572s | 6.156s | 226.708s |
make compile.test.tuple |
212.081s | 5.916s | 221.148s |
std::forward<T> -O3 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
160.622s | 3.822s | 167.232s |
make compile.test.tuple |
164.214s | 4.055s | 171.536s |
make compile.test.tuple |
171.738s | 4.612s | 180.935s |
make compile.test.tuple |
163.612s | 4.103s | 171.175s |
We should also collect these numbers for the original detail::std::forward
.
detail::forward<T> -O0 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
109.730s | 5.318s | 120.951s |
make compile.test.tuple |
108.594s | 5.197s | 120.456s |
make compile.test.tuple |
106.325s | 5.006s | 116.721s |
make compile.test.tuple |
106.106s | 4.907s | 116.015s |
detail::forward<T> -O2 -g -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
213.302s | 6.133s | 222.003s |
make compile.test.tuple |
213.664s | 6.035s | 221.804s |
make compile.test.tuple |
221.149s | 6.862s | 231.551s |
make compile.test.tuple |
209.036s | 5.749s | 217.968s |
detail::forward<T> -O3 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
161.180s | 4.017s | 168.324s |
make compile.test.tuple |
170.037s | 4.576s | 178.451s |
make compile.test.tuple |
167.476s | 4.481s | 179.675s |
make compile.test.tuple |
164.460s | 4.157s | 169.649s |
Switch to static_cast<T&&>
Now the moment we’ve been waiting for!
Apply the commit that switched Hana from detail::std::forward
to static_cast<T&&>
:
$ git stash
$ git checkout 540f665e51
$ git stash pop
and run all the same benchmarks again:
$ cd .. ; rm -rf build ; mkdir build ; cd build
$ cmake .. -DCMAKE_CXX_LINK_EXECUTABLE=/usr/bin/true \
-DCMAKE_CXX_COMPILER=$HOME/llvm-project/build/bin/clang++
$ for i in 1 2 3 4; do make clean; time make -j1 compile.test.tuple; done
static_cast<T&&> -O0 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
91.421s | 4.558s | 102.832s |
make compile.test.tuple |
95.103s | 4.918s | 106.348s |
make compile.test.tuple |
95.386s | 4.780s | 105.589s |
make compile.test.tuple |
98.711s | 5.516s | 111.344s |
static_cast<T&&> -O2 -g |
user | system | real |
---|---|---|---|
make compile.test.tuple |
202.589s | 5.949s | 211.770s |
make compile.test.tuple |
194.883s | 5.456s | 202.536s |
make compile.test.tuple |
196.569s | 5.693s | 207.602s |
make compile.test.tuple |
193.946s | 5.444s | 201.535s |
static_cast<T&&> -O3 |
user | system | real |
---|---|---|---|
make compile.test.tuple |
153.371s | 3.884s | 159.933s |
make compile.test.tuple |
149.820s | 3.681s | 156.982s |
make compile.test.tuple |
153.347s | 4.038s | 161.922s |
make compile.test.tuple |
151.943s | 3.947s | 159.044s |
static_cast<T&&> -O0 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
96.991s | 5.295s | 109.769s |
make compile.test.tuple |
93.719s | 4.668s | 105.419s |
make compile.test.tuple |
94.057s | 4.684s | 99.843s |
make compile.test.tuple |
93.761s | 4.699s | 103.461s |
static_cast<T&&> -O2 -g -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
191.906s | 5.193s | 199.158s |
make compile.test.tuple |
196.225s | 5.444s | 204.091s |
make compile.test.tuple |
194.927s | 5.341s | 202.128s |
make compile.test.tuple |
201.406s | 5.967s | 210.618s |
static_cast<T&&> -O3 -fno-builtin-std-forward |
user | system | real |
---|---|---|---|
make compile.test.tuple |
153.070s | 3.940s | 160.831s |
make compile.test.tuple |
155.854s | 3.989s | 164.267s |
make compile.test.tuple |
150.787s | 3.783s | 158.836s |
make compile.test.tuple |
149.737s | 3.739s | 157.007s |
Conclusion
All of the numbers above, collated into a single bar graph:
The -fno-builtin
numbers are kind of silly, because ordinary programmers won’t
be toggling that option and because it only hurts anyway. So here’s the same graph
without the -fno-builtin
bars:
On top-of-tree Clang, Boost.Hana’s replacement of std::forward<T>
with static_cast<T&&>
produces a compile-speed improvement somewhere between 3 and 6 percent on Louis’s benchmark.
Replacing the opaque detail::std::forward
with static_cast
gives 7 to 10 percent.
Both are significantly lessened from the 13.9% speedup Louis saw on his own machine back in 2015.
Note that switching from detail::std::forward
(unrecognized by Clang) to
the standard std::forward
helps quite a bit on this benchmark, indicating that
-fbuiltin-std-forward
is doing its job. The opt-out flag -fno-builtin-std-forward
does,
as expected, return things to their previous (worse) state.
Practically speaking, would you see any compile-time speedup if you made a similar change to your own codebase?
Unlikely, unless your codebase looks a lot like this one.
Hana sees a speedup because it uses a huge amount of perfect forwarding, and because
this specific benchmark is a stress test focused on std::tuple
. Contrariwise, the typical industry
codebase ought to spend most of its time compiling non-templates, and use std::forward
only rarely.
But if you’re a library writer, it seems that “for compile-time performance, avoid instantiating
std::forward
” is still plausible advice: much less applicable today on Clang than it was seven years
ago (or even one year ago), but a small effect is still noticeable.
I’d be interested to hear the results of this benchmark on a recent GCC or MSVC.
See also: