`std::is_heap` could be faster

The other day I was noodling around with some libc++ unit-test code that looked roughly like this (Godbolt):

template<class A>
auto extract_container(A& a) {
  struct UnwrapAdaptor : A { A::container_type& cc = A::c; };
  return UnwrapAdaptor(a).cc;
}

template<class Adaptor>
void test_push_range(bool is_heapified) {
  int in1[] = {1,3,7};
  int in2[] = {2,4,5,6};
  int expected[] = {1,3,7,2,4,5,6};
  Adaptor a;
  a.push_range(in1);
  a.push_range(in2);
  if (auto c = extract_container(a); is_heapified) {
    assert(std::ranges::is_heap(c));
    assert(std::ranges::is_permutation(c, expected));
  } else {
    assert(std::ranges::equal(c, expected));
  }
}

int main() {
  test_push_range<std::stack<int>>(false);
  test_push_range<std::queue<int>>(false);
  test_push_range<std::priority_queue<int>>(true);
}

I tried to extend main to test also some non-default containers. (Incidentally, did you know stack’s default container is deque, not vector?)

  test_push_range<std::stack<int, std::vector<int>>>(false);
  test_push_range<std::stack<int, std::list<int>>>(false);

…And suddenly the unit test no longer compiled!

error: no matching function for call to object of type 'const __is_heap'
   22 |     assert(std::ranges::is_heap(c));
      |            ^~~~~~~~~~~~~~~~~~~~
[...]
note: because 'std::list<int> &' does not satisfy 'random_access_range'

That caught me by surprise. Why should testing whether a range is heapified require random access? It’s simple to implement with only forward traversal, and make_heap/is_heap seems to analogize perfectly against sort/is_sorted. is_sorted doesn’t require random access; why should is_heap?

Ranges algorithm	Constraint
`sort`	`random_access_range`
`is_sorted`	`forward_range`
`is_sorted_until`	`forward_range`
`make_heap`	`random_access_range`
`is_heap`	`random_access_range` (could be `forward_range`)
`is_heap_until`	`random_access_range` (could be `forward_range`)
`partition`	`forward_range`
`is_partitioned`	`input_range`
`is_partitioned_until`	`input_range`
`unique`	`forward_range`
`is_uniqued`	`forward_range`
`adjacent_find`	`forward_range`

Note that is_partitioned only needs to view one element at a time, and remember the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap need to view two elements at a time; that’s why they can’t handle input_range.

The two “could bes” in the table above seem to indicate a design defect.

Benchmark it!

libc++’s implementation of std::is_heap looks shockingly inefficient: its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element) and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation, avoiding all arithmetic and just advancing two iterators in lockstep, would have taken only 18 lines:

template <class _Compare, class _ForwardIterator, class _Sentinel>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator
__is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) {
  _ForwardIterator __child = __first;
  if (__child == __last) {
    return __child;
  }
  while (true) {
    ++__child;
    if (__child == __last || __comp(*__first, *__child))
      break;
    ++__child;
    if (__child == __last || __comp(*__first, *__child))
      break;
    ++__first;
  }
  return __child;
}

Surprisingly, libstdc++ and MS STL also spend time adding and dividing (or shifting) when they don’t need to.

Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient. So I whipped up a simple benchmark:

void BM_vector_half(benchmark::State& state) {
  auto n = state.range(0);
  auto v = std::vector<unsigned>(n);
  std::mt19937 g;
  std::generate(v.begin(), v.end(), g);
  std::make_heap(v.begin(), v.begin() + (n / 2));
  for (auto _ : state) {
    benchmark::DoNotOptimize(v);
    auto it = std::is_heap_until(v.begin(), v.end());
    benchmark::DoNotOptimize(it);
  }
}

vector_full is the same but with make_heap(v.begin(), v.end()). deque_{half,full} are the same but with deque instead of vector.

Here’s the benchmark result before and after applying the patch above to libc++. In this case, our eyes don’t lie: what looks inefficient is inefficient! (Note that our comparator here is less<int>, which is cheap. An expensive comparator could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)

Benchmark	`n`	CPU time (before)	CPU time (after)	%
`vector_half`	1K	4155 ns	2658 ns	−36%
`vector_half`	100K	311560 ns	263807 ns	−15%
`vector_half`	10M	32487789 ns	26089364 ns	−19%
`vector_full`	1K	8813 ns	5294 ns	−39%
`vector_full`	100K	644535 ns	535987 ns	−16%
`vector_full`	10M	59771100 ns	52807359 ns	−11%
`deque_half`	1K	4186 ns	2662 ns	−36%
`deque_half`	100K	375844 ns	264856 ns	−29%
`deque_half`	10M	31999750 ns	26152052 ns	−18%
`deque_full`	1K	9002 ns	5271 ns	−41%
`deque_full`	100K	619876 ns	523727 ns	−15%
`deque_full`	10M	65697333 ns	52488343 ns	−20%

Even with the current specification of is_heap, we can achieve this performance today. The algorithm can remain constrained on random_access_range (thus doing nothing to solve the motivating use-case that introduced this post) while internally using nothing more than forward-iterator operations. But once the algorithm uses nothing more than forward-iterator operations… wouldn’t it be nice for the Standard to say so?

Conclusions

libc++ should improve its is_heap implementation along the lines above, reaping the performance benefit.
So should libstdc++ and Microsoft STL, I expect.
WG21 should consider relaxing the constraint on is_heap and is_heap_until from “random access” to “forward,” incidentally solving my original use-case. I’ll probably bring a paper to this effect at some point. Note that if such a paper were adopted, all three vendors would have to change their implementations to the faster one.

Posted 2026-05-11

benchmarks library-design proposal stl-classic

std::is_heap could be faster

Benchmark it!

Conclusions

`std::is_heap` could be faster