std::is_heap could be faster

The other day I was noodling around with some libc++ unit-test code that looked roughly like this (Godbolt):

template<class A>
auto extract_container(A& a) {
  struct UnwrapAdaptor : A { A::container_type& cc = A::c; };
  return UnwrapAdaptor(a).cc;
}

template<class Adaptor>
void test_push_range(bool is_heapified) {
  int in1[] = {1,3,7};
  int in2[] = {2,4,5,6};
  int expected[] = {1,3,7,2,4,5,6};
  Adaptor a;
  a.push_range(in1);
  a.push_range(in2);
  if (auto c = extract_container(a); is_heapified) {
    assert(std::ranges::is_heap(c));
    assert(std::ranges::is_permutation(c, expected));
  } else {
    assert(std::ranges::equal(c, expected));
  }
}

int main() {
  test_push_range<std::stack<int>>(false);
  test_push_range<std::queue<int>>(false);
  test_push_range<std::priority_queue<int>>(true);
}

I tried to extend main to test also some non-default containers. (Incidentally, did you know stack’s default container is deque, not vector?)

  test_push_range<std::stack<int, std::vector<int>>>(false);
  test_push_range<std::stack<int, std::list<int>>>(false);

…And suddenly the unit test no longer compiled!

error: no matching function for call to object of type 'const __is_heap'
   22 |     assert(std::ranges::is_heap(c));
      |            ^~~~~~~~~~~~~~~~~~~~
[...]
note: because 'std::list<int> &' does not satisfy 'random_access_range'

That caught me by surprise. Why should testing whether a range is heapified require random access? It’s simple to implement with only forward traversal, and make_heap/is_heap seems to analogize perfectly against sort/is_sorted. is_sorted doesn’t require random access; why should is_heap?

Ranges algorithm Constraint
sort random_access_range
  is_sorted forward_range
  is_sorted_until forward_range
make_heap random_access_range
  is_heap random_access_range (could be forward_range)
  is_heap_until random_access_range (could be forward_range)
partition forward_range
  is_partitioned input_range
  is_partitioned_until input_range
unique forward_range
  is_uniqued forward_range
  adjacent_find forward_range

Note that is_partitioned only needs to view one element at a time, and remember the boolean value of pred(elt). By contrast, is_sorted, adjacent_find, and is_heap need to view two elements at a time; that’s why they can’t handle input_range.

The two “could bes” in the table above seem to indicate a design defect.

Benchmark it!

libc++’s implementation of std::is_heap looks shockingly inefficient: its 24 lines spend a lot of time recomputing subexpressions like __first + __c (to get to the c’th element) and 2 * __p + 1 (to compute the child index from the parent). A straight-line implementation, avoiding all arithmetic and just advancing two iterators in lockstep, would have taken only 18 lines:

template <class _Compare, class _ForwardIterator, class _Sentinel>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _ForwardIterator
__is_heap_until(_ForwardIterator __first, _Sentinel __last, _Compare&& __comp) {
  _ForwardIterator __child = __first;
  if (__child == __last) {
    return __child;
  }
  while (true) {
    ++__child;
    if (__child == __last || __comp(*__first, *__child))
      break;
    ++__child;
    if (__child == __last || __comp(*__first, *__child))
      break;
    ++__first;
  }
  return __child;
}

Surprisingly, libstdc++ and MS STL also spend time adding and dividing (or shifting) when they don’t need to.

Still, just because a piece of code looks inefficient doesn’t always mean that it is inefficient. So I whipped up a simple benchmark:

void BM_vector_half(benchmark::State& state) {
  auto n = state.range(0);
  auto v = std::vector<unsigned>(n);
  std::mt19937 g;
  std::generate(v.begin(), v.end(), g);
  std::make_heap(v.begin(), v.begin() + (n / 2));
  for (auto _ : state) {
    benchmark::DoNotOptimize(v);
    auto it = std::is_heap_until(v.begin(), v.end());
    benchmark::DoNotOptimize(it);
  }
}

vector_full is the same but with make_heap(v.begin(), v.end()). deque_{half,full} are the same but with deque instead of vector.

Here’s the benchmark result before and after applying the patch above to libc++. In this case, our eyes don’t lie: what looks inefficient is inefficient! (Note that our comparator here is less<int>, which is cheap. An expensive comparator could dwarf the cost of arithmetic, making the benefit of this patch less perceptible.)

Benchmark n CPU time (before) CPU time (after) %
vector_half 1K 4155 ns 2658 ns −36%
vector_half 100K 311560 ns 263807 ns −15%
vector_half 10M 32487789 ns 26089364 ns −19%
vector_full 1K 8813 ns 5294 ns −39%
vector_full 100K 644535 ns 535987 ns −16%
vector_full 10M 59771100 ns 52807359 ns −11%
deque_half 1K 4186 ns 2662 ns −36%
deque_half 100K 375844 ns 264856 ns −29%
deque_half 10M 31999750 ns 26152052 ns −18%
deque_full 1K 9002 ns 5271 ns −41%
deque_full 100K 619876 ns 523727 ns −15%
deque_full 10M 65697333 ns 52488343 ns −20%

Even with the current specification of is_heap, we can achieve this performance today. The algorithm can remain constrained on random_access_range (thus doing nothing to solve the motivating use-case that introduced this post) while internally using nothing more than forward-iterator operations. But once the algorithm uses nothing more than forward-iterator operations… wouldn’t it be nice for the Standard to say so?

Conclusions

  • libc++ should improve its is_heap implementation along the lines above, reaping the performance benefit.

  • So should libstdc++ and Microsoft STL, I expect.

  • WG21 should consider relaxing the constraint on is_heap and is_heap_until from “random access” to “forward,” incidentally solving my original use-case. I’ll probably bring a paper to this effect at some point. Note that if such a paper were adopted, all three vendors would have to change their implementations to the faster one.

Posted 2026-05-11