Does bulk memmove speed up std::remove_if? (No.)
std::remove_if? (No.)This morning I was reading the umpteenth std-proposals thread proposing some variety of
unstable_remove
and it occurred to me that one odd thing about a swap-and-pop-based unstable_remove
is that it tends to replace large swaths of contiguous removals by reversing the
elements that are kept. For example (Godbolt):
template<class BidirIt, class Pred>
BidirIt unstable_remove_if(BidirIt first, BidirIt last, Pred pred) {
while (true) {
first = std::find_if(first, last, pred);
if (first == last) return first;
while (true) {
--last;
if (first == last) return first;
if (!pred(*last)) break;
}
*first++ = std::move(*last);
}
}
int main() {
auto in234 = [](int x) { return 2 <= x && x <= 4; };
std::vector<int> v = {1,2,3,4,5,6,7,8,9};
v.erase(unstable_remove_if(v.begin(), v.end(), in234), v.end());
// v is {1,9,8,7,5,6}
}
It would be more aesthetically pleasing to produce {1,7,8,9,5,6}: for trivially copyable
elements, this would be a single memmove. In a sense, “reversing the elements” is extra work
that we might save time by avoiding. But to avoid that work, we might have to do even more
work in the form of bookkeeping. I haven’t attempted to benchmark any such change to
unstable_remove.
But the same aesthetic consideration applies to the ordinary, stable std::remove_if.
Traditionally it’s a loop over operator=, like this:
template <class FwdIt, class Pred>
FwdIt smooth_remove_if(FwdIt first, FwdIt last, Pred pred) {
FwdIt dfirst = std::find_if(first, last, pred);
if (dfirst != last) {
for (first = std::next(dfirst); first != last; ++first) {
if (!pred(*first)) {
*dfirst++ = std::move(*first);
}
}
}
return dfirst;
}
But what if we “chunked” the writes to *dfirst, like this?
template <class FwdIt, class Pred>
FwdIt chunky_remove_if(FwdIt first, FwdIt last, Pred pred) {
FwdIt dfirst = std::find_if(first, last, pred);
if (dfirst != last) {
for (first = std::next(dfirst); first != last; ++first) {
if (!pred(*first)) {
FwdIt sfirst = first;
first = std::find_if(std::next(first), last, pred);
dfirst = std::move(sfirst, first, dfirst);
if (first == last) break;
}
}
}
return dfirst;
}
Delegating the work to the std::move algorithm
allows the library to use memmove for that work, if the element type happens to be trivially
copyable. The nested loop complicates the generated code (Godbolt),
but is it worth it? Do we actually save CPU cycles?
Well, if the average length of a “run” of survivors is long, then we
might expect the cost of a single large memmove to beat out the cost of operator=-in-a-loop.
But at the other extreme, if the average length of a run is only 1 element, then memmove won’t
save anything; we’ll be paying all the bookkeeping cost for none of the benefit.
By “bookkeeping” I mean not only the extra register and icache pressure of the more complicated algorithm, but also the overhead of the call to memmove itself, and that memmove will necessarily start by checking whether it needs to handle forward overlap. (That branch is completely predictable — it doesn’t — but is still overhead absent from the “smooth” version.)
I wrote a little benchmark (backup)
to time a call to std::remove_if on an array of a million ints,
with a bit-masking predicate that removed half of the elements; one in 8;
one in 128; or one in 1024. I contrived this benchmark deliberately to play
to the “chunky” algorithm’s strengths: trivially copyable elements, with large swaths moved
contiguously.
The “smooth” column here is merely to prove that my smooth_remove_if implementation
is essentially the same as libc++’s default implementation:
my chunky_remove_if can’t blame its defeat on “library vendor magic.”
Yet it is defeated, and decisively so:
| Elements removed | std |
smooth | chunky | Penalty | smooth assignments |
chunky memmoves |
|---|---|---|---|---|---|---|
| 1 in 2 | 3122 us | 3157 us | 3797 us | +20% | 499549 | 249971 |
| 1 in 8 | 1082 us | 1105 us | 1471 us | +33% | 874707 | 109724 |
| 1 in 128 | 416 us | 420 us | 468 us | +11% | 992148 | 7750 |
| 1 in 1024 | 327 us | 326 us | 396 us | +21% | 998482 | 958 |
As expected, chunky_remove_if loses when the expected length of a memmove is short.
But it also loses when the expected length of a memmove is long! And, counterintuitively,
as we remove fewer elements (and thus execute more individual assignments in the “smooth”
case and fewer individual memmoves in the “chunky” case), smooth_remove_if gets faster and
faster, and chunky_remove_if’s performance penalty doesn’t budge.
I tentatively blame this on the branch predictor. The fewer elements we remove, the more
predictable the result of pred(*first). Removing half the elements is the worst case for
smooth_remove_if simply because that turns it into a tight loop over a single completely
unpredictable branch. Making that branch more predictable dominates every other consideration,
including any speedup we might get from memmove (which, after all, is also a tight loop over
a mostly predictable branch: “have we counted all the way to n yet?”)
Conclusion: “Chunking” the elements moved by remove_if into larger memmoves
isn’t an optimization; it’s a pessimization.
This doesn’t directly say anything about unstable_remove, but it does suggest that
when it comes to benchmarking remove-related algorithms, we ought not to rabbit-hole on
simply minimizing the number of move-assignments; that kind of thing may be outweighed
by cache and branch-prediction effects.
