C++ idiom of the day: arrow_proxy

I’m currently working on a reference implementation of P0429R6 flat_map. Here’s a C++ metaprogramming idiom I learned in the process.

Consider a zip_range that refers to two other containers and lets you iterate them in parallel:

std::vector<int> keys;
std::vector<int> values;
for (auto&& kv : make_zip_range(keys, values)) {
std::cout << kv.first << ": " << kv.second << "\n";
kv.second += 1;
}


Notice that in C++17 we can use CTAD and structured bindings to rewrite this as

for (auto&& [k, v] : zip_range(keys, values)) {
std::cout << k << ": " << v << "\n";
}


I don’t recommend it, but you can do it.

We can implement most of zip_range easily:

template<class Keys, class Values>
class zip_range {
using KeyIterator = typename Keys::iterator;
using ValueIterator = typename Values::iterator;

Keys *keys_;
Values *values_;
public:
using iterator = zip_iterator<KeyIterator, ValueIterator>;

iterator begin();
iterator end();

// boilerplate and const versions omitted
};

template<class KeyIterator, class ValueIterator>
class zip_iterator {
KeyIterator kit_;
ValueIterator vit_;
public:
using KeyType = typename KeyIterator::value_type;
using ValueType = typename ValueIterator::value_type;

using iterator_category = std::common_type_t<
nonstd::iterator_category_t<KeyIterator>,
nonstd::iterator_category_t<ValueIterator>
>;
using difference_type = ptrdiff_t;
using value_type = std::pair<KeyType, ValueType>;


…But now we hit a stumbling block! What should our reference type be? This is the type returned by operator*, which is also the type of auto&& kv in our sample code above. So it needs to be something with .first and .second members, but somehow kv.second must refer to a particular element in values! This gives us the right idea:

    using reference = std::pair<KeyType&, ValueType&>;

reference operator*() const {
return {*kit_, *vit_};
}


In fact, to deal correctly with vector<bool>, we should use the member typedefs provided by KeyIterator and ValueIterator:

    using KeyReference = typename KeyIterator::reference;
using ValueReference = typename ValueIterator::reference;
using reference = std::pair<KeyReference, ValueReference>;

reference operator*() const {
return {*kit_, *vit_};
}


So far so good.

The real stumbling block: operator->

In the preceding paragraph, we succeeded at making this snippet work:

std::vector<int> keys = ...;
std::vector<int> values = ...;
auto zit = make_zip_range(keys, values).begin();
int& x = (*zit).first;
assert(&x == &keys.front());


Now we hit the real issue, which is what I really want to talk about!

int& x = zit->first;
assert(&x == &keys.front());


Recall the way an overloaded operator-> works in C++. For a standard smart pointer, it would look like this:

T *operator->() const {
return rawptr_;
}


That is, int& x = zit->first will be evaluated the same as if we’d written

decltype(auto) temp = zit.operator->();
int& x = temp->first;


So we need our operator-> to return a pointer to a type with a first member. Basically, we need to return a pair<KeyReference, ValueReference>*. Okay, but where does that pair object live? Who controls its lifetime?

The crazy unsafe way to solve our problem is:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = reference*;

reference operator*() const {
return {*kit_, *vit_};
}

pointer operator->() const {
static reference r;
r = {*kit_, *vit_};
return &r;
}


This is unsafe because if we wrote something like use(zit->first, (zit+1)->first) we’d have undefined behavior: two modifications of r without an intervening sequence point.

Note that the above snippet actually won’t compile because reference is not default-constructible, but we can fix that by using std::optional<reference> r.

C++ weirdness to the rescue!

Notice that our expansion of zit->first is expressed in terms of temp->first. That is, operator-> is in some sense “recursive”: if our zip_iterator’s overloaded operator-> returns an object with an overloaded operator->, then the compiler will call that operator->, and so on, all the way down, until it (hopefully) bottoms out at a call to the built-in -> that works on native pointers.

Therefore a safe, but still crazy, way to solve our problem is:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = std::unique_ptr<reference>;

reference operator*() const {
return {*kit_, *vit_};
}

pointer operator->() const {
return std::make_unique<reference>(*kit_, *vit_);
}


This works because zit->first evaluates zit.operator->() to produce a unique_ptr temp1, and then evaluates temp1.operator->() to produce a raw pointer reference *temp2, and finally evaluates temp2->first to produce a KeyReference. However, if we do it this way, every single use of zit->first causes a heap allocation! This is certainly not okay. (Godbolt.)

So what we do is a little trick called an arrow proxy. It looks like this:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = arrow_proxy<reference>;

reference operator*() const {
return {*kit_, *vit_};
}

pointer operator->() const {
return pointer{{*kit_, *vit_}};
}


where arrow_proxy is defined as this seven-line helper struct:

template<class Reference>
struct arrow_proxy {
Reference r;
Reference *operator->() {
return &r;
}
};


This works because zit->first evaluates zit.operator->() to produce an arrow_proxy<reference> temp1, and then evaluates temp1.operator->() to produce a raw pointer to the pair stored inside the arrow_proxy object itself. The arrow_proxy, just like the unique_ptr in our previous example, lives just long enough to make this code work — it lives until the end of the full-expression containing the -> operator, and then is destroyed, taking the pair<KeyReference, ValueReference> with it. (Godbolt shows perfect codegen.)

Non-const operator->()

Notice that in our finished arrow_proxy, the operator->() is a non-const member function. This goes against everything we’ve been taught! (See const is a contract” (2019-01-03).) Normally, operator-> represents the “dereferencing” operation, and you don’t need to modify an iterator in order to dereference it. That’s why zip_iterator::operator->() const is declared const.

Yet Godbolt shows that if we add const to arrow_proxy::operator->(), we get weird compiler errors. This is because on all three major library implementations, vector<bool>::reference::operator=(bool) is a non-const member function! (This was probably a mistake, but it’s too late for any library vendor to fix it now.)

We could eliminate the compiler error by making r a mutable member of arrow_proxy, but in my opinion that cure would be worse than the disease.

My intuition here is that zip_iterator::operator->() const is properly declared const because it represents the “dereferencing” operation; but arrow_proxy::operator->() can be declared non-const because it doesn’t “represent” any semantic operation. An arrow_proxy is not a pointer-like thing that is semantically “dereferenced.” We shouldn’t bother trying to make it “const-correct” according to semantic rules, because it has no semantics; it’s nothing but an invisible implementation detail of zip_iterator.

So there you are. When implementing zip_iterator, you’re going to need an arrow_proxy.

Various implementations of the arrow_proxy idiom — not all 100% conforming to my recommendations here — can be found in pybind11, Paul Fultz’s hmr, Ryan Haining’s cppitertools, Boost iterator_facade (which calls it operator_arrow_dispatch::proxy), and Boost iterator_archetypes.

Posted 2019-02-06