C++ idiom of the day: arrow_proxy

I’m currently working on a reference implementation of P0429R6 flat_map. Here’s a C++ metaprogramming idiom I learned in the process.

Consider a zip_range that refers to two other containers and lets you iterate them in parallel:

std::vector<int> keys;
std::vector<int> values;
for (auto&& kv : make_zip_range(keys, values)) {
    std::cout << kv.first << ": " << kv.second << "\n";
    kv.second += 1;
}

Notice that in C++17 we can use CTAD and structured bindings to rewrite this as

for (auto&& [k, v] : zip_range(keys, values)) {
    std::cout << k << ": " << v << "\n";
}

I don’t recommend it, but you can do it.

We can implement most of zip_range easily:

template<class Keys, class Values>
class zip_range {
    using KeyIterator = typename Keys::iterator;
    using ValueIterator = typename Values::iterator;

    Keys *keys_;
    Values *values_;
public:
    using iterator = zip_iterator<KeyIterator, ValueIterator>;

    iterator begin();
    iterator end();

    // boilerplate and const versions omitted
};

template<class KeyIterator, class ValueIterator>
class zip_iterator {
    KeyIterator kit_;
    ValueIterator vit_;
public:
    using KeyType = typename KeyIterator::value_type;
    using ValueType = typename ValueIterator::value_type;

    using iterator_category = std::common_type_t<
        nonstd::iterator_category_t<KeyIterator>,
        nonstd::iterator_category_t<ValueIterator>
    >;
    using difference_type = ptrdiff_t;
    using value_type = std::pair<KeyType, ValueType>;

…But now we hit a stumbling block! What should our reference type be? This is the type returned by operator*, which is also the type of auto&& kv in our sample code above. So it needs to be something with .first and .second members, but somehow kv.second must refer to a particular element in values! This gives us the right idea:

    using reference = std::pair<KeyType&, ValueType&>;

    reference operator*() const {
        return {*kit_, *vit_};
    }

In fact, to deal correctly with vector<bool>, we should use the member typedefs provided by KeyIterator and ValueIterator:

    using KeyReference = typename KeyIterator::reference;
    using ValueReference = typename ValueIterator::reference;
    using reference = std::pair<KeyReference, ValueReference>;

    reference operator*() const {
        return {*kit_, *vit_};
    }

So far so good.

The real stumbling block: operator->

In the preceding paragraph, we succeeded at making this snippet work:

std::vector<int> keys = ...;
std::vector<int> values = ...;
auto zit = make_zip_range(keys, values).begin();
int& x = (*zit).first;
assert(&x == &keys.front());

Now we hit the real issue, which is what I really want to talk about!

int& x = zit->first;
assert(&x == &keys.front());

Recall the way an overloaded operator-> works in C++. For a standard smart pointer, it would look like this:

T *operator->() const {
    return rawptr_;
}

That is, int& x = zit->first will be evaluated the same as if we’d written

decltype(auto) temp = zit.operator->();
int& x = temp->first;

So we need our operator-> to return a pointer to a type with a first member. Basically, we need to return a pair<KeyReference, ValueReference>*. Okay, but where does that pair object live? Who controls its lifetime?

The crazy unsafe way to solve our problem is:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = reference*;

reference operator*() const {
    return {*kit_, *vit_};
}

pointer operator->() const {
    static reference r;
    r = {*kit_, *vit_};
    return &r;
}

This is unsafe because if we wrote something like use(zit->first, (zit+1)->first) we’d have undefined behavior: two modifications of r without an intervening sequence point.

Note that the above snippet actually won’t compile because reference is not default-constructible, but we can fix that by using std::optional<reference> r.

C++ weirdness to the rescue!

Notice that our expansion of zit->first is expressed in terms of temp->first. That is, operator-> is in some sense “recursive”: if our zip_iterator’s overloaded operator-> returns an object with an overloaded operator->, then the compiler will call that operator->, and so on, all the way down, until it (hopefully) bottoms out at a call to the built-in -> that works on native pointers.

Therefore a safe, but still crazy, way to solve our problem is:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = std::unique_ptr<reference>;

reference operator*() const {
    return {*kit_, *vit_};
}

pointer operator->() const {
    return std::make_unique<reference>(*kit_, *vit_);
}

This works because zit->first evaluates zit.operator->() to produce a unique_ptr temp1, and then evaluates temp1.operator->() to produce a raw pointer reference *temp2, and finally evaluates temp2->first to produce a KeyReference. However, if we do it this way, every single use of zit->first causes a heap allocation! This is certainly not okay. (Godbolt.)

So what we do is a little trick called an arrow proxy. It looks like this:

using reference = std::pair<KeyReference, ValueReference>;
using pointer = arrow_proxy<reference>;

reference operator*() const {
    return {*kit_, *vit_};
}

pointer operator->() const {
    return pointer;
}

where arrow_proxy is defined as this seven-line helper struct:

template<class Reference>
struct arrow_proxy {
    Reference r;
    Reference *operator->() {
        return &r;
    }
};

This works because zit->first evaluates zit.operator->() to produce an arrow_proxy<reference> temp1, and then evaluates temp1.operator->() to produce a raw pointer to the pair stored inside the arrow_proxy object itself. The arrow_proxy, just like the unique_ptr in our previous example, lives just long enough to make this code work — it lives until the end of the full-expression containing the -> operator, and then is destroyed, taking the pair<KeyReference, ValueReference> with it. (Godbolt shows perfect codegen.)

Non-const operator->()

Notice that in our finished arrow_proxy, the operator->() is a non-const member function. This goes against everything we’ve been taught! (See const is a contract” (2019-01-03).) Normally, operator-> represents the “dereferencing” operation, and you don’t need to modify an iterator in order to dereference it. That’s why zip_iterator::operator->() const is declared const.

Yet Godbolt shows that if we add const to arrow_proxy::operator->(), we get weird compiler errors. This is because on all three major library implementations, vector<bool>::reference::operator=(bool) is a non-const member function! (This was probably a mistake, but it’s too late for any library vendor to fix it now.)

We could eliminate the compiler error by making r a mutable member of arrow_proxy, but in my opinion that cure would be worse than the disease.

My intuition here is that zip_iterator::operator->() const is properly declared const because it represents the “dereferencing” operation; but arrow_proxy::operator->() can be declared non-const because it doesn’t “represent” any semantic operation. An arrow_proxy is not a pointer-like thing that is semantically “dereferenced.” We shouldn’t bother trying to make it “const-correct” according to semantic rules, because it has no semantics; it’s nothing but an invisible implementation detail of zip_iterator.

So there you are. When implementing zip_iterator, you’re going to need an arrow_proxy.


Various implementations of the arrow_proxy idiom — not all 100% conforming to my recommendations here — can be found in pybind11, Paul Fultz’s hmr, Ryan Haining’s cppitertools, Boost iterator_facade (which calls it operator_arrow_dispatch::proxy), and Boost iterator_archetypes.

Posted 2019-02-06