Trivial relocation, std::swap
, and a $2000 prize
std::swap
, and a $2000 prizeThis is the first in a series of at least three weekly blog posts. Each post
(I,
II,
III,
IV)
will explain
one of the problems facing P1144 “std::is_trivially_relocatable
”
and P2786R0 “Trivial relocatability options” as
we try to (1) resolve their technical differences and (2) convince the rest of the C++ Committee
that these resolutions are actually okay to ship.
Today’s topic is trivial swapping.
Is trivial swap beneficial?
Trivial swappability first showed up on this blog in June 2018,
even before P1144 was a thing.
Back then I said that optimizing std::swap
into three memmoves had “zero benefit as far as I can see.” But soon
after that, I implemented a trivial swap
in my libc++ fork, and saw that the benefits were actually
significant. For example, std::rotate
is based on swap
, so any improvement in swap
is picked up for free
by std::rotate
(Godbolt).
And by std::partition
(Godbolt).
And by std::swap_ranges
, std::array::swap
, and so on.
The notion that some types are “trivially swappable” was also picked up by Nathan Myers in
P2187 “std::swap_if
” (July 2020),
although that particular paper ended up not going anywhere.
Microsoft’s STL goes out of its way to recognize trivially copyable types in swap
, array::swap
,
swap_ranges
, etc., and thus collects bug reports from people asking why it doesn’t use trivial swap
in more places and for more types:
- “Automatically using the vectorized std::swap_ranges and std::reverse” (July 2018)
- “std::swap of arrays […] specialization for trivial types” (April–May 2022)
So, trivial swap does seem like it would be nice to have.
Does trivial relocatability imply trivial swappability?
The Standard’s [utility.swap]/3 defines the semantics of std::swap
as follows:
Exchanges values stored in two locations.
It doesn’t say any more than that about how the values are swapped. So a reasonable naïve implementation would be
alignas(T) char temp[sizeof(T)];
std::relocate_at(&a, (T*)temp);
std::relocate_at(&b, &a);
std::relocate_at((T*)temp, &b);
That is: “Relocate a
’s value into temporary housing; then relocate b
into a
’s old home; finally,
relocate a
’s value from the temporary back into b
.” Compare this to the
present-day implementation
T temp = std::move(a);
a = std::move(b);
b = std::move(temp);
// temp.~T() at end of scope
which was in turn significantly more efficient than C++98’s
T temp = a;
a = b;
b = temp;
// temp.~T() at end of scope
The change from C++98 to C++11’s move-enabled implementation was kind of a big deal:
nothing pre-C++11 had ever given std::swap
permission to “move-from” a
like that. Similarly, nothing
pre-P1144 has ever given std::swap
permission to “relocate-from” a
. This would be something new.
Relocating-from is a particularly big deal because it involves destroying the original object.
If we swap a
and b
as above, we’ll end the lifetime of the original a
object
and then start a new object’s lifetime (also at a
) on the next line.
If I took a pointer to a
before the swap, is it “dangling” after the swap because the object
it was pointing at has been destroyed? There’s still an object at that address, of course; but
now it’s a “different” object. (C++’s formal wording doesn’t handle this kind of thing very well.
The relevant wording is in [basic.life]/8,
around the term of art transparently replaceable; it’s currently the subject of
multiple CWG issues,
which should make these lifetime shenanigans perfectly legal once the wording issues are resolved.)
Now, in practice, we wouldn’t bother using std::relocate_at
inside std::swap
, because for
non-trivially-relocatable types std::relocate_at
likely costs more than move-assignment.
So all we need to do is distinguish trivially swappable types (which we’ll swap “by implementation magic”
without ending their lifetimes) from non-trivially swappable types (which we’ll swap the
traditional C++11 way).
P2786R0 section 9.6, top of page 17, makes the intriguing claim that “all trivially swappable types are trivially relocatable.” The idea is that if
T
is trivially swappable, then you can always relocate a singleT
simply by pretending that you have already constructed aT
at the destination address, trivially swapping the realT
with the imaginaryT
, and then quitting the pretense that anyT
exists at the source address anymore.My impression is that “trivially swappable” and “trivially relocatable” are, indeed, exactly synonymous. But P2786’s argument feels like sophistry: It’s not obvious how one could implement “trivial relocation in terms of trivial swapping” in practice, and of course it wouldn’t be efficient to do so. Contrariwise, “trivial swapping in terms of trivial relocation” is easily implemented in practice.
Uh-oh: std::swap
can’t unconditionally memcpy
Let’s assume we’ve solved all of the formal wording issues above. We still face a practical problem: sometimes a trivially swappable type can’t be trivially swapped! I’ve previously covered this issue in “When is a trivially copyable type not trivially copyable?” (2018-07-13).
Consider this code:
struct Lesser {
explicit Lesser(int, int);
short s;
char c;
};
void swapit(Lesser& a, Lesser& b) {
std::swap(a, b);
}
You might think that swapit
could be implemented as three calls to memmove
:
from &a
to the stack, from &b
to &a
, and from the stack back to &b
. So far
so good. But how many bytes should those memmove
s copy? sizeof(Lesser)
?
Wrong!
template<class Comp>
struct Widget {
[[no_unique_address]] Comp comp;
char d;
};
On the Itanium ABI used by GCC and Clang, Widget<Lesser>::d
is stored in the tail-padding
byte of Widget<Lesser>::comp
. This means that if you swap all 4 bytes of the Lesser
object, you’re actually also swapping the byte stored in its tail padding.
So if you have implemented Widget
’s member swap
function in the obvious straightforward way…
void swap(Widget& y) noexcept {
using std::swap;
swap(comp, y.comp);
swap(d, y.d);
}
…you’ll find that it swaps the d
values twice! This is an unmitigated disaster.
We can’t have this.
But wait, doesn’t Microsoft use memcpy
for trivial swapping?
Yes, and they see this “disaster” in practice. But, because Microsoft’s ABI uses the tail-padding so much less often than Itanium’s ABI, the disaster is mitigated by happening only occasionally, not constantly.
Here’s a concrete example of MSVC getting it wrong (Godbolt):
struct __declspec(align(8)) TailPad {
int a;
};
struct Evil : TailPad {
int b;
};
int main() {
Evil e1 = {1,2};
Evil e2 = {3,4};
TailPad& tp1 = e1;
TailPad& tp2 = e2;
std::swap(tp1, tp2); // should swap 1 and 3, leave 2 and 4 alone
assert(e1.b == 2); // fails at runtime
}
MSVC’s std::swap
dispatches to the precompiled library routine __std_swap_ranges_trivially_swappable_noalias
with parameters that ask it to swap sizeof(TailPad)
bytes — but that’s too many,
and it ends up swapping Evil::b
when we didn’t mean to.
All three of libc++, libstdc++, and MSVC currently have similar misbehavior when copying trivially copyable (yet tail-padded) objects. (This is now libstdc++ bug #108846.)
std::copy(&tp1, &tp1 + 1, &tp2)
std::move(&tp1, &tp1 + 1, &tp2)
std::copy_backward(&tp1, &tp1 + 1, &tp2 + 1)
std::move_backward(&tp1, &tp1 + 1, &tp2 + 1)
std::copy_n(&tp1, 1, &tp2)
std::move_n(&tp1, 1, &tp2)
std::rotate_copy
,std::set_union
, etc., which dispatch tostd::copy
std::swap_ranges(&tp1, &tp1 + 1, &tp2)
(MSVC only)
The simplest fix in these cases is to check the size of the input range. If it’s a range of 1 element,
it might be a potentially overlapping subobject, and so we can’t memcpy
. If it’s a contiguous range of
2 or more elements, then it’s safe to memcpy
: as far as I know, neither major ABI
ever stores things in the tail padding of an array element.
But std::swap
itself doesn’t know “the size of the input range.” It only ever receives single
objects. So swap
isn’t as easy to fix as copy
.
If std::swap
can’t memcpy
, then how do we trivially swap?
My current draft of D1144R7 (git) suggests five possible ways forward:
#1. Per Dana Jansens,
bring the ABI notion of “data size” into standard C++. Lesser
’s “data size” is less than its sizeof
,
so it’s simply not trivially relocatable. (But it remains trivially copyable? This doesn’t satisfy me,
for multiple reasons.)
#2. Bring it in behind the scenes: allow the compiler to consider any type non-trivially relocatable
for implementation-defined secret reasons. (For example, Lesser
might be trivially relocatable on MSVC, because MSVC’s ABI
promises never to store anything in its tail padding; but non-trivial on GCC, because GCC’s ABI doesn’t promise.)
#3. Bring it all the way into the core language, e.g. by adding datasizeof(T)
alongside sizeof(T)
.
std::swap
then can use memcpy for trivially swappable types, as long as it swaps datasizeof(T)
bytes
and not sizeof(T)
bytes.
#4. Completely disallow the public generic algorithm std::swap(T&, T&)
from using trivial relocation.
Behind the scenes, vendors can always create a private entrypoint __swap_complete_objects(T&, T&)
that does
the exact same thing as swap
but with a precondition that the objects are complete.
std::rotate
, std::partition
, and so on could be reimplemented in terms of __swap_complete_objects
.
This solution is unsatisfying because it requires heroics from the vendor.
Remember, the benefit of trivial swap
is that it speeds up rotate
and partition
for free. If the
vendor has to go reimplement rotate
and partition
in terms of some new private functionality, it’s
not free anymore.
#5. Bless what MSVC is doing today. Recharacterize that situation I called an “unmitigated disaster”:
It’s not a bug, it’s a precondition. Does std::swap
swap the wrong number of bytes when you use it on
overlapping subobjects? Don’t do that, then!
P1144R6 implicitly assumed solution #5;
but that was before we came up with examples like Widget<Lesser>
, that seem pretty obviously not the
programmer’s fault. I think we have to make Widget<Lesser>
work. Therefore I no longer think
solution #5 is acceptable.
Solution #4 is the only solution that is fully backward-compatible with C++23, so it’s the most likely
path forward. However, those “heroics” to reimplement things in terms of __swap_complete_objects
turn out
to be extremely difficult. This week I gave it a quick try in libc++, and gave up in despair.
The problem is all the layers of indirection:
rotate
is implemented in terms ofswap_ranges
, which is implemented in terms ofiter_swap
, which is a customization point implemented in terms ofswap
, which is a customization point. Only the very highest level knows whether it’s safe to usememmove
; only the very lowest level needs to know.
“Implementation bounty” X-Prize
I found solution #4 above very difficult to implement; but maybe I’m missing something. So I’m putting my money where my mouth is!
I hereby offer a $2000 (USD) cash prize to the first person who produces an efficient and conforming implementation
of std::swap_ranges
, std::rotate
, and std::partition
— using memmove
-based trivial swapping where possible,
but falling back to a conforming std::swap
implementation when the objects involved might be
potentially overlapping subobjects.
UPDATE, 2023-03-04: My libc++ fork now passes all eight cases in the test suite! So the prize for
patches to libc++, specifically, is claimed. But as of this writing the prize remains claimable for
patches to libstdc++ or Microsoft STL. See “Update on my trivial swap
prize offer” (2023-03-04).
Here (backup) is the test suite that the winning implementation will pass. It consists of seven mandatory test cases:
must_use_memmove_1
:std::rotate
on iterators of typeTR*
must_use_memmove_2
:std::ranges::rotate
on iterators of typeTR*
must_use_memmove_3
:std::partition
on iterators of typeTR*
must_use_memmove_4
:std::ranges::partition
on iterators of typeTR*
must_use_memmove_5
:std::swap_ranges
on iterators of typeTR*
must_not_fail_6
:std::swap
on two potentially overlapping subobjects of typeTR
must_not_fail_7
:std::rotate
on a range of potentially overlapping subobjects of typeTR
And one bonus test case that would be nice-to-have, but you’ll win the prize even without it:
should_use_memmove_8
:std::rotate
on iterators of typereverse_iterator<TR*>
“Passing” tests 1–5 and 8 is defined as “generating code with no calls to operator delete
,”
i.e., generating zero calls to TR::operator=(TR&&)
. “Passing” tests 6 and 7 simply means not
failing the assertions. And of course your patched library should remain conforming — for example,
it should pass its own upstream test suite.
As of 2023-02-22, Clang trunk passes only tests 6 and 7; my libc++ fork passes only tests 1, 3, 5, and 8. (UPDATE: As of 2023-03-04, my libc++ fork passes all eight tests.)
The easiest solutions for me to judge will be patches against libc++ (either against trunk or against my P1144 fork), because I can build them locally. Patches against libstdc++ or Microsoft STL are also acceptable, but will be harder for me to evaluate without help. By the way, if you patch GCC to natively recognize trivially relocatable types, I’d be thrilled to assist you in getting your patched version its own place in the dropdown on Godbolt Compiler Explorer!
If you have a solution that passes all seven test cases, you can email me, or find me on the cpplang Slack.