P1144 PMR koans

This is the fourth in a series of blog posts (I, II, III, IV) that I started after the Issaquah WG21 meeting, in order to (1) resolve the technical differences between P1144 “std::is_trivially_relocatable” and the now-multiple Bloomberg papers on the topic (P2786R1, P2839R0, and the comparative analysis P2814R0), and (2) convince the rest of the C++ Committee that these resolutions are actually okay to ship.

This post’s topic is the reason (as I understand it) that Bloomberg is suddenly re-interested in trivial relocatability: the polymorphic allocator model.

If the C++ allocator model and/or PMR are unfamiliar to you, you should check out yesterday’s “A not-so-quick introduction to the C++ allocator model” (2023-06-02).

If the unfamiliar part is trivial relocation, check out the wealth of information in my “Trivially Relocatable” talk from C++Now 2019, and/or the #relocatability tag on this blog.

Things we’d like to have

Let’s start with some “friendly” examples. We want the implementation to optimize each of these examples in the “obvious” way. This section contains no tricks or gotchas. Those will come in the next section.

We’ll use the following types in the examples:

struct TC { int i; };
struct TR { std::shared_ptr<int> p; };

static_assert(std::is_trivially_copyable_v<TC>);
static_assert(std::is_trivially_relocatable_v<TR>);

1. Trivial relocatability should propagate in the natural way

We already relied on this premise when we said that TR was trivially relocatable. Rule-of-Zero types, all of whose members are trivially relocatable, must be considered trivially relocatable themselves.

2. Vector reallocation should use memcpy

std::vector<TC> v(100);
v.reserve(v.capacity() + 1);

Today, this will use TC’s move-constructor and destructor in a loop; but we’d like it to use a single 400-byte memcpy instead. Right? (P1144 and P2786 agree on this.)

There are several subcategories here, each with their own ramifications:

2b. Vector reallocation of PMR types should use memcpy

std::vector<std::pmr::vector<int>> v(100);
v.reserve(v.capacity() + 1);

2c. Reallocation in `pmr::vector` should use memcpy

std::pmr::vector<TC> v(100);
v.reserve(v.capacity() + 1);

2d. Reallocation of PMR types in `pmr::vector` should use memcpy

std::pmr::vector<std::pmr::vector<int>> v(100);
v.reserve(v.capacity() + 1);

P2786 §4.3 and §6.4 strongly imply that P2786 would like to support this use-case in particular. Also notice that (2b) and (2d) imply is_trivially_relocatable_v<pmr::vector<int>>; this will become important later. I don’t think there’s any way to achieve (2b) or (2d) without violating (3)…

3. PMR shouldn’t be a special case

namespace My {
  template<class T>
  struct polymorphic_allocator { ~~~~ };

  template<class T>
  using vector = std::vector<T, My::polymorphic_allocator<T>>;
};

My::vector<My::vector<int>> v(100);
~~~~

PMR types have many non-trivial layers of indirection (e.g. polymorphic_allocator::construct and the soon-to-be-undeprecated polymorphic_allocator::destroy. We’ll be tempted to say “The implementation doesn’t actually need to look through all these layers; it can just look for exactly pmr::foo and special-case it somehow.” Here are three counterarguments:

C++ should always permit reinventing the wheel. My::polymorphic_allocator should be able to work exactly as well as pmr::polymorphic_allocator. If Joe Coder can’t implement PMR without dabbling in secret vendor-specific magic, that’s bad.
Vendors themselves think special cases are yucky. If you manage your own implementation (cough Bloomberg), feel free to add all the special cases you want; but if WG21 standardizes a proposal that permits “bad performance, or good performance after a bunch of ad-hoc special-case code,” well, that’s just a verbose way of saying “bad performance.” (Vendors already don’t like PMR. You want them to cruft up their code with special cases just to rescue PMR’s performance? That’s not going to happen.) In other words, if the vendor can’t implement PMR without dabbling in vendor-specific magic, that’s bad.
Most importantly, if you really believe the vendor should just special-case PMR, then you don’t need P1144 or P2786 at all! Look at libstdc++, for example; it’s recognized deque as a special case for vector reallocation since late 2018. Vendors don’t need the Standard’s permission to add special cases, when they want to (which is rare). A proposal must handle arbitrary user-defined types like My::vector and My::allocator, or it has no point at all.

4. Vector insert and erase should use memmove

std::vector<TC> v(100);
v.erase(v.begin());

Today, this will call TC’s move-assignment operator 99 times and then destroy the rightmost TC. But surely we’d like it to destroy the leftmost TC and then memmove 99*sizeof(TC) bytes leftward. Otherwise we’re leaving performance on the table.

P2786 doesn’t mention this case, but Bloomberg’s bsl::vector does this exact optimization; it would be silly to forbid std::vector from doing it. See “Should assignment affect is_trivially_relocatable?” (2024-01-02).

Again there are subcategories here, each with their own ramifications. The analogous cases to (2b) and (2c) are actually a little tricky:

std::vector<std::pmr::vector<int>> v(100);
v.erase(v.begin());

std::pmr::vector<TC> v(100);
v.erase(v.begin());

4d. Insert and erase of PMR types in `pmr::vector` should use memmove

std::pmr::vector<std::pmr::vector<int>> v(100);
v.erase(v.begin());

Today, this will call pmr::vector<int>’s move-assignment operator 99 times and then destroy the rightmost pmr::vector. But since all of the elements in v have the same allocator, this is equivalent to destroying the leftmost pmr::vector and then memmoving 99*sizeof(v[0]) bytes leftward.

We’d really like to make this work, if we can.

5. `swap_ranges` of trivially copyable types should use memswap

TC a[100];
TC b[100];
std::swap_ranges(a, a+100, b);

Today, this will call std::swap(a[i], b[i]) in a loop; but we’d like it to swap a’s bytes with b’s, in whatever size chunks are most performant. Right?

5b. `rotate` of trivially copyable types should use memswap

TC a[100];
std::rotate(a, a+50, a+100);

In practice this is equivalent to std::swap_ranges(a, a+50, a+50), and will automatically benefit from whatever performance enhancements we give swap_ranges.

Problematic koans

The memory-stomping deallocator

Godbolt:

template<class T>
struct A : std::allocator<T> {
  template<class U>
  void destroy(U *p) {
    std::destroy_at(p);
    memset(p, 0xDE, sizeof(U));
  }
};

int main() {
  std::vector<int, A<int>> v = {1,2,3,4,5};
  v.erase(v.begin());
  assert(v.data()[4] == 0xDEDEDEDE); // UB
}

Allocator A’s intent is to mitigate the damage of dangling pointer dereferences by overwriting every destroyed object with 0xDEDEDEDE. But if v.erase ends the lifetime of v[4] by relocating out of it (instead of calling A::destroy on it), then v[4] won’t get overwritten. This defeats A’s intention.

The effect in this case is theoretically unobservable: the line reading v.data()[4] has UB. (vector would be within its rights to go back after the destruction of v[4] and write into that memory an arbitrary value; say, “5”.)

The natural solution in this case is to say that vector::erase isn’t allowed to use relocation if A::destroy is non-trivial. The core language refuses to make a type “naturally” trivially relocatable unless its value-semantic operations are defaulted; vector likewise should refuse to trivially relocate a type unless the allocator’s construct and destroy are defaulted.

Allocators already nerf the vector pessimization: a nothrow-move-constructible type can still throw from A::construct. This is LWG 2461.

The PMR aggregate

Godbolt:

struct Agg {
  std::pmr::vector<int> pv_;
};

int main() {
  std::vector<Agg> v;
  v.push_back({ std::pmr::vector<int>({1,2}, &mr1) });
  v.push_back({ std::pmr::vector<int>({3,4}, &mr2) });
  v.erase(v.begin());
  assert(v[0].pv_.get_allocator().resource() == &mr1);
}

If you accept (3) plus any of (2b,2d,4d), then you ought to accept is_trivially_relocatable_v<pmr::vector<int>>. Then (1) implies that Agg is also trivially relocatable; then (4) implies that this v.erase should use memmove. But then we have a problem!

There are several natural solutions in this case:

Reject (2b,2d,4d) or (3): pmr::vector<int> simply can’t be trivially relocatable. This means it gets none of the optimizations we’re talking about, unless you reject (3) and write special-case optimizations for it.
Reject (4,4d): insert and erase simply can’t use memmove.
Make the “strong claim” (as I historically have) that v violates vector’s preconditions. We’ve put into v a type which satisfies std::relocatable but the values we inserted, in this context, fail to model std::relocatable. This is UB, just like if we tried to std::sort an array of float where some of the values were NaN. The entire STL has an invisible precondition that types that look value-semantic (statically at compile time) must also act value-semantic (dynamically at runtime), and if they don’t, that’s simply “library UB” — all bets are off.
Make the weaker claim that Agg does model std::relocatable, but only by treating pv_.get_allocator() as non-salient. Then our mistake wasn’t in the erase, but only that the assert expects something it has no right to expect. We may reasonably expect a certain value in v[0], but we mustn’t expect anything in particular of its non-salient attributes. It would be equally groundless of us to base the correctness of our C++ program on the assumption that, say, v[0].pv_.capacity() == 2.

The evil allocator-aware type

This is why (4c) was tricky. Godbolt:

struct Widget {
  using allocator_type = std::pmr::polymorphic_allocator<Widget>;
  Widget(int i) : i_(i) {}
  explicit Widget(int i, allocator_type) : i_(i) {}
  explicit Widget(const Widget& rhs, allocator_type) : i_(rhs.i_ + 100) {}
  int i_;
};
static_assert(std::is_trivially_copyable_v<Widget>);

int main() {
  std::pmr::vector<Widget> v = {1,2,3};
  assert(v[0].i_ == 101);
  v.reserve(v.capacity() + 1);
  assert(v[0].i_ == 201);
}

If you accept (2d), then you should accept that v’s reallocation should be able to use memcpy. After all, the element type Widget is trivially copyable, just like int! But Widget’s constructor overload set has been carefully crafted to count the number of times it’s called via polymorphic_allocator::construct.

This snippet is by far the most pathological, but it’s also the most clearly well-defined. We’re not inspecting deallocated memory, nor non-salient state. Widget clearly models all the semantic requirements of std::relocatable — heck, it’s trivially copyable! So there are relatively few ways we can deal with this.

Reject (2c,2d): As in “The memory-stomping deallocator,” say vector<T,A> isn’t allowed to reallocate in terms of relocation when A::construct or A::destroy is non-trivial. This implies that pmr::vector<int> can’t use memcpy to reallocate its buffer.
Reject (3), e.g. by using our intimate knowledge of polymorphic_allocator::construct to let pmr::vector<T> use memcpy as long as !uses_allocator<T, allocator_type>. Even then, (2d) remains out of reach.

Playing around with this example, you’ll find that GCC has trouble with the very first assert: GCC 13+ ingeniously constructs Widget(int, allocator_type) directly into v’s storage, instead of constructing Widget(int) into an initializer_list and then Widget(const Widget&, allocator_type) into v. That’s now bug #110102.

The rotate

Godbolt:

std::pmr::vector<int> vv[] = {
  std::pmr::vector<int>({1,2,3}, &mr1),
  std::pmr::vector<int>({4,5,6}, &mr2),
};
std::rotate(vv, vv+1, vv+2);
  // In practice, equivalent to swap(vv[0], vv[1])

If you accept (3) plus any of (2b,2d,4d), then you accept is_trivially_relocatable_v<pmr::vector<int>>. If you then accept (5,5b), you’ll expect this rotate to use memswap. But that would swap the arenas of vv[0] and vv[1]!

Swapping the arenas would actually be preferable to what this code does today, i.e., UB and heap corruption. But there is a third possible outcome, preferred by Bloomberg: to copy vv[0]’s data into mr2 and copy vv[1]’s data into mr1, just as if we had done vv[0] = std::exchange(vv[1], vv[0]) manually.

There are a few ways to deal with this:

Reject is_trivially_relocatable_v<pmr::vector<int>>; but nobody wants to do that.
Make the strong claim that the range vv violates the preconditions of rotate, as if you tried to sort a range of NaNs.
Make the weaker claim that the rotate is well-defined but get_allocator() isn’t salient, so it’s actually okay that vv[0] switched arenas as a result of the rotate.
Claim that rotate(vv, vv+1, vv+2) should actually be well-defined, e.g. by adopting P0178 “Allocators and swap” (Meredith, February 2016). As a corollary, reject (5,5b): generic algorithms like swap, swap_ranges, rotate, sort, etc. simply aren’t allowed to use trivial relocation, even on TC, because they must act as-if they’re using ADL swap and they’re no longer allowed to assume that ADL swap swaps only the values of its arguments. (See LWG 2153 for more details.)

I think this is the biggest disagreement between P1144 and Bloomberg: On the one hand I’m completely unwilling to give up the performance benefits of (5,5b), and on the other hand the Bloomberg authors seem unwilling yet to agree either that allocators are non-salient or that it’s okay for rotate(vv, vv+1, vv+2) to remain undefined.

Here are a couple of pie-in-the-sky ideas that might lead to answers of their own:

Our last claim above complained that the library couldn’t tell whether ADL swap was “defaulted.” We could teach the core language about swap and allow it to be defaulted, the same way C++20 allows an ADL operator== to be defaulted. A defaulted operation can be queried for triviality, exactly as I proposed in “The Best Type Traits C++ Doesn’t Have” (C++Now 2018). N3746 “A C++1y Swap Operator” (Brown, August 2013) was an early attempt to bring swap into the core language. We could try again, without the operator syntax (which I think N3746 proposed only because C++11 couldn’t =default anything that wasn’t a special member function).
C++’s object model physically ensures that a contiguous range of polymorphic objects (e.g. Cat[5], vector<Cat>) always have the same dynamic type. (If you can legally use pointer arithmetic to get from one to the next, they must have the same type.) Could we logically mandate somehow that a contiguous range of PMR-style objects must all be associated with the same arena? P2685 “Language Support for Scoped Objects” (Meredith & Berne, May 2023) is related, but I haven’t read more than the introduction yet. (Obvious pitfall: Piecewise contiguous ranges, like deque<pmr::string>. Contiguity is not cut-and-dried; in the limit this leads to the strong claim.)

This concludes the set of problematic koans… for now. If I stumble across any more problematic snippets, I’ll update this post.

Posted 2023-06-03

allocators implementation-divergence relocatability value-semantics varna-2023 wg21

P1144 PMR koans

Things we’d like to have

1. Trivial relocatability should propagate in the natural way

2. Vector reallocation should use memcpy

2b. Vector reallocation of PMR types should use memcpy

2c. Reallocation in pmr::vector should use memcpy

2d. Reallocation of PMR types in pmr::vector should use memcpy

3. PMR shouldn’t be a special case

4. Vector insert and erase should use memmove

4d. Insert and erase of PMR types in pmr::vector should use memmove

5. swap_ranges of trivially copyable types should use memswap

5b. rotate of trivially copyable types should use memswap

Problematic koans

The memory-stomping deallocator

The PMR aggregate

The evil allocator-aware type

The rotate

2c. Reallocation in `pmr::vector` should use memcpy

2d. Reallocation of PMR types in `pmr::vector` should use memcpy

4d. Insert and erase of PMR types in `pmr::vector` should use memmove

5. `swap_ranges` of trivially copyable types should use memswap

5b. `rotate` of trivially copyable types should use memswap