std::priority_queue
is missing an operation
std::priority_queue
is missing an operationAbout a year ago, I read this great paper by Melissa O’Neill titled “The Genuine Sieve of Erathosthenes” (2009). She explains how most programs (especially those written by beginners, but also those written by experts) that claim to implement a “sieve” are really implementing a different algorithm, which she names the “unfaithful sieve” — and then goes on to show that the unfaithful sieve is asymptotically even worse than the naïve “trial division” algorithm.
Whereas the original [unfaithful-sieve] algorithm crosses off all multiples of a prime at once, we perform these “crossings off” in a lazier way: crossing off just-in-time. For this purpose, we will store a table in which, for each prime
p
that we have discovered so far, there is an “iterator” holding the next multiple ofp
to cross off. Thus, instead of crossing off all the multiples of, say, 17, at once (impossible, since there are infinitely many for our limit-free algorithm), we will store the first one (at 17 × 17; i.e., 289) in our table of upcoming composite numbers. When we come to consider whether 289 is prime, we will check our composites table and discover that it is a known composite with 17 as a factor, remove 289 from the table, and insert 306 (i.e., 289 + 17). In essence, we are storing “iterators” in a table keyed by the current value of each iterator.
So naturally I went and implemented O’Neill’s “Genuine Sieve” algorithm in C++, and posted the resulting code on StackOverflow to get feedback. Let’s pass over the surprisingly negative review comments and skip straight to what I want to talk about. :)
My C++14 code has two fundamental pieces, both of which match the “iterator” or
“unbounded range” concepts. The first piece is iotarator
, which generates an unbounded
range of increasing integers:
template<typename Int>
class iotarator {
Int value = 0;
public:
explicit iotarator() = default;
explicit iotarator(Int v) : value(v) {}
Int operator*() const { return value; }
iotarator& operator++() { value += 1; return *this; }
iotarator operator++(int) { auto ret = *this; ++*this; return ret; }
bool operator==(const iotarator& rhs) const {
return value == rhs.value;
}
bool operator!=(const iotarator& rhs) const { return !(*this == rhs); }
};
The second piece is sieverator
, which fits over the end of an iotarator
and filters
its range according to O’Neill’s sieve algorithm.
Then, the program to produce an infinite stream of primes to stdout
looks like this:
int main()
{
iotarator<__int128_t> iota(2);
sieverator<__int128_t> sieve(iota);
for (int p : sieve) {
std::cout << p << std::endl;
}
}
As written, my original program produced the first 10,000,000 primes in about 97 seconds.
$ time ./a.out | head -10000000 |tail -1
179424673
real 1m36.686s
user 1m38.039s
sys 0m38.026s
Confronted with the fact that my “genuine sieve” program was horribly slow compared to an expert’s quick-and-dirty implementation of the “unfaithful sieve”, I (well, first I waited a year, but then I) started looking at where the bottlenecks actually were in my original code.
Obvious bottleneck 1: iostreams
Replacing std::endl
with '\n'
reduces the
running time from 97 seconds to 71 seconds. Replacing std::cout
with printf("%d\n", p)
reduces
the running time a tiny bit more, to 67 seconds.
Obvious bottleneck 2: premature pessimizations
My original code had done at least three things that might be slowing down my benchmark numbers:
-
Using
__int128_t
(which should be “wide enough for anyone”) instead ofint64_t
(which happens to be wide enough for this particular benchmark, and is of a convenient size to do multiplications quickly on x86-64). -
Using a weird type-erased iterator pattern in order to make
sieverator<Int>
not depend on the type of iterator it’s filtering. In reality, I don’t think it even makes sense to filter anything except aniotarator
… and anyway, there’s nothing wrong with templatingsieverator
on the iterator type —sieverator<Int, Iterator>
. Premature type-erasure is definitely an antipattern I should not have fallen into. -
My original
iotarator
took astep
parameter, which in practice was always set to1
. (I removed this parameter already, in the code I quoted above, but you can see it in my original program.) I bet the compiler inlined away all the references to this parameter anyway, but just in case, let’s remove it.
After undoing these premature pessimizations, we’re down to 50 seconds.
Bottleneck 3: __sift_down
By the way, if you don’t know how a min-heap works, now would be a good time to learn. This blog post will assume you already know.
At this point, I decided to open up Instruments Time Profiler and look at where the hotspot
actually was. (Amdahl’s Law
for the win!) It turned out that we were spending 94% of our time in
sieverator<__int128>::is_already_crossed_off(__int128)
, which is not
surprising; that’s where the real work gets done. sieverator
looks something like this:
template<class T> using min_heap = std::priority_queue<T, std::vector<T>, std::greater<>>;
min_heap<pair> m_pq;
sieverator& operator++() {
cross_off_multiples_of_prime(m_current);
do {
++m_iter;
m_current = *m_iter;
} while (is_already_crossed_off(m_current));
return *this;
}
void cross_off_multiples_of_prime(Int value) {
m_pq.emplace(value * value, value);
}
bool is_already_crossed_off(Int value) {
if (value != m_pq.top().next_crossed_off_value) {
return false;
} else {
do {
auto x = m_pq.top();
m_pq.pop();
m_pq.emplace(x.next_crossed_off_value + x.prime_increment, x.prime_increment);
} while (value == m_pq.top().next_crossed_off_value);
return true;
}
}
What is a bit concerning is that that function was spending 58% of its time in a
libc++ library routine called __sift_down
! This is because when we pop
an element
from the priority queue, the Standard specifies that we’ll do it by moving the
last element to the top (replacing the top element, which we are popping off),
and then sifting that element down until it reaches its proper place in the new,
one-element-shorter heap. Since it was on the bottom of the heap to begin with,
its proper place is probably going to be near the bottom again; so each pop
costs
us log N
operations.
And then we immediately emplace a new element, which goes to the end of the heap and then sifts up to its proper place! (This is likely to be a no-op: the value we are emplacing is very large, and its proper place will be near the bottom of the heap.)
So, for each element that we pop-and-then-reemplace in our priority queue, we are
re-shuffling the heap twice. First we sift down an arbitrary element that was already
in its proper place to begin with; and then we add our new element and sift it up
to its proper place. Obviously it would be more efficient if we just replaced the
top element with our new element and sifted it down! But this operation is missing
from the standard priority_queue
, and even missing from the standard <algorithm>
header.
In other words, the operation that we are really trying to do is
namespace nonstd {
template<class T, class Container, class Compare>
class priority_queue : public std::priority_queue<T, Container, Compare>
{
public:
template<class... Args>
void reemplace_top(Args&&... args) {
this->pop();
this->emplace(std::forward<Args>(args)...);
}
};
} // namespace nonstd
(Here’s the code.) We can rewrite this in terms of the standard heap algorithms, but that doesn’t change anything yet:
namespace nonstd {
template<class T, class Container, class Compare>
class priority_queue : public std::priority_queue<T, Container, Compare>
{
public:
template<class... Args>
void reemplace_top(Args&&... args) {
std::pop_heap(c.begin(), c.end(), comp);
c.back() = T(std::forward<Args>(args)...);
std::push_heap(c.begin(), c.end(), comp);
}
};
} // namespace nonstd
but that doesn’t change anything about the bad behavior yet. We’re still
calling __sift_down
in pop_heap
, and then calling __sift_up
in push_heap
.
Two sifts, where we want just one.
Now, because the Standard doesn’t actually specify the exact way in which
the library represents a min-heap in terms of a sequence container, this is where
we have to go slightly non-portable. Let’s assume that a min-heap is represented
in the “obvious” way: the root of the heap is at c.front()
(this much is guaranteed),
and then the root’s children are at c[1]
and c[2]
(this is technically not
guaranteed but it is “obvious”), and so on. So we can write our own versions
of sift_down
and sift_up
that will interoperate seamlessly with any library’s push_heap
and pop_heap
,
as long as the library doesn’t go out of its way to be malicious.
Once we have these sift_up
and sift_down
algorithms, we can avoid the extra sift.
Assuming for convenience that our sift_down
is spelled exactly the same way that
libc++ spells it, we’ll write this:
namespace nonstd {
template<class T, class Container, class Compare>
class priority_queue : public std::priority_queue<T, Container, Compare>
{
public:
template<class... Args>
void reemplace_top(Args&&... args) {
using std::__sift_down; // Don't try this at home. Or do. Whatever.
c.front() = T(std::forward<Args>(args)...);
__sift_down(c.begin(), c.end(), comp, c.size(), c.begin());
}
};
} // namespace nonstd
(Here’s the code.) The portable version (with two sifts) finds the 10,000,000th prime in about 50 seconds. The version above (with one sift) does the same thing in 28 seconds!
This huge speedup surprises me, because I argued above that the second sift (the sift-up
of our newly inserted item) should be a no-op. And that’s all we’re skipping here.
We’re still sifting down a very big element from the root down to the bottom half of the
heap; the only difference is that we’re not then going and touching another element
at the very tail end of the heap. And somehow, omitting that touch saves us about
half of our running time! Either my analysis is wrong, or this is some sort of
cache effect. I really cannot
fully explain it. (The Time Profiler profile shows that we are now spending 81% of our
time in __sift_down
, as opposed to the original 58%. This matches our intuition that
we eliminated a __sift_up
that was not showing up in the profile, increasing the relative
proportion of time spent inside __sift_down
. The __sift_up
wasn’t showing up in the profile
because of inlining.)
Anyway, we have now knocked our original running time of 97 seconds down to an improved time of 28 seconds. Here’s the itemized bill:
essential complexity 28 seconds
premature pessimization tax +17 seconds
priority_queue tax +22 seconds
iostreams tax +30 seconds
----------------------------------------
Total: 97 seconds
What do you think — should the STL’s priority_queue
be extended to include a
replace_top
operation?
Here’s a libc++ patch that implements reemplace_top
,
plus a matching <algorithm>
-header algorithm which I’m calling poke_heap
because I can’t come up
with any really good name. Feel free to use it, until your friendly neighborhood STL vendor
includes it in their implementation.