std::is_heap could be faster
std::is_heap could be fasterThe other day I was noodling around with some libc++ unit-test code that looked roughly like this (Godbolt):
template<class A>
auto extract_container(A& a) {
struct UnwrapAdaptor : A { A::container_type& cc = A::c; };
return UnwrapAdaptor(a).cc;
}
template<class Adaptor>
void test_push_range(bool is_heapified) {
int in1[] = {1,3,7};
int in2[] = {2,4,5,6};
int expected[] = {1,3,7,2,4,5,6};
Adaptor a;
a.push_range(in1);
a.push_range(in2);
if (auto c = extract_container(a); is_heapified) {
assert(std::ranges::is_heap(c));
assert(std::ranges::is_permutation(c, expected));
} else {
assert(std::ranges::equal(c, expected));
}
}
int main() {
test_push_range<std::stack<int>>(false);
test_push_range<std::queue<int>>(false);
test_push_range<std::priority_queue<int>>(true);
}
I tried to extend main to test also some non-default containers.
(Incidentally, did you know stack’s default container is deque,
not vector?)
test_push_range<std::stack<int, std::vector<int>>>(false);
test_push_range<std::stack<int, std::list<int>>>(false);
…And suddenly the unit test no longer compiled!
error: no matching function for call to object of type 'const __is_heap'
22 | assert(std::ranges::is_heap(c));
| ^~~~~~~~~~~~~~~~~~~~
[...]
note: because 'std::list<int> &' does not satisfy 'random_access_range'
That caught me by surprise. Why should testing whether a range is heapified
require random access? It’s simple to implement with only forward traversal,
and make_heap/is_heap seems to analogize perfectly against sort/is_sorted.
is_sorted doesn’t require random access; why should is_heap?
| Ranges algorithm | Constraint |
|---|---|
sort |
random_access_range |
is_sorted |
forward_range |
is_sorted_until |
forward_range |
make_heap |
random_access_range |
is_heap |
random_access_range (could be forward_range) |
is_heap_until |
random_access_range (could be forward_range) |
partition |
forward_range |
is_partitioned |
input_range |
is_partitioned_until |
input_range |
unique |
forward_range |
is_uniqued |
forward_range |
adjacent_find |
forward_range |
Note that is_partitioned only needs to view one element at a time, and remember
the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap
need to view two elements at a time; that’s why they can’t handle input_range.
The two “could bes” in the table above seem to indicate a design defect.
Benchmark it!
libc++’s implementation of std::is_heap looks shockingly inefficient:
its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element)
and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation,
avoiding all arithmetic and just advancing two iterators in lockstep,
would have taken only 18 lines:
template <class _Compare, class _ForwardIterator, class _Sentinel>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator
__is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) {
_ForwardIterator __child = __first;
if (__child == __last) {
return __child;
}
while (true) {
++__child;
if (__child == __last || __comp(*__first, *__child))
break;
++__child;
if (__child == __last || __comp(*__first, *__child))
break;
++__first;
}
return __child;
}
Surprisingly, libstdc++ and MS STL also spend time adding and dividing (or shifting) when they don’t need to.
Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient. So I whipped up a simple benchmark:
void BM_vector_half(benchmark::State& state) {
auto n = state.range(0);
auto v = std::vector<unsigned>(n);
std::mt19937 g;
std::generate(v.begin(), v.end(), g);
std::make_heap(v.begin(), v.begin() + (n / 2));
for (auto _ : state) {
benchmark::DoNotOptimize(v);
auto it = std::is_heap_until(v.begin(), v.end());
benchmark::DoNotOptimize(it);
}
}
vector_full is the same but with make_heap(v.begin(), v.end()).
deque_{half,full} are the same but with deque instead of vector.
Here’s the benchmark result before and after applying the patch above to libc++.
In this case, our eyes don’t lie: what looks inefficient is inefficient!
(Note that our comparator here is less<int>, which is cheap. An expensive comparator
could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)
| Benchmark | n |
CPU time (before) | CPU time (after) | % |
|---|---|---|---|---|
vector_half |
1K | 4155 ns | 2658 ns | −36% |
vector_half |
100K | 311560 ns | 263807 ns | −15% |
vector_half |
10M | 32487789 ns | 26089364 ns | −19% |
vector_full |
1K | 8813 ns | 5294 ns | −39% |
vector_full |
100K | 644535 ns | 535987 ns | −16% |
vector_full |
10M | 59771100 ns | 52807359 ns | −11% |
deque_half |
1K | 4186 ns | 2662 ns | −36% |
deque_half |
100K | 375844 ns | 264856 ns | −29% |
deque_half |
10M | 31999750 ns | 26152052 ns | −18% |
deque_full |
1K | 9002 ns | 5271 ns | −41% |
deque_full |
100K | 619876 ns | 523727 ns | −15% |
deque_full |
10M | 65697333 ns | 52488343 ns | −20% |
Even with the current specification of is_heap, we can achieve this performance today.
The algorithm can remain constrained on random_access_range
(thus doing nothing to solve the motivating use-case that introduced this post)
while internally using nothing more than forward-iterator operations.
But once the algorithm uses nothing more than forward-iterator operations…
wouldn’t it be nice for the Standard to say so?
Conclusions
-
libc++ should improve its
is_heapimplementation along the lines above, reaping the performance benefit. -
So should libstdc++ and Microsoft STL, I expect.
-
WG21 should consider relaxing the constraint on
is_heapandis_heap_untilfrom “random access” to “forward,” incidentally solving my original use-case. I’ll probably bring a paper to this effect at some point. Note that if such a paper were adopted, all three vendors would have to change their implementations to the faster one.
