The STL is more than `std::accumulate`

`std::accumulate`

Conor Hoekstra gives great talks on algorithms. Notably, “Algorithm Intuition” (C++Now 2019) and “Better Algorithm Intuition” (code::dive 2019). However, every time I watch one of his talks where he uses STL algorithms to solve some programming problem, I come away feeling like

Or maybe

Let’s look at two examples of “using STL algorithms” in this limited sense, and how we might solve the same problems differently (maybe using the STL, maybe not). Both of these examples are taken from Conor’s CppCon 2020 talk “Structure and Interpretation of Computer Programs: SICP.”

## Sum of two largest squares

Circa 36m55s, Conor finishes polishing a function to compute the sum of the squares of the two largest integers in a list.

```
// IMP1
auto solution(auto const& v) {
auto const [a, b] = std::accumulate(
std::cbegin(v),
std::cend(v),
std::pair{0, 0},
[](auto acc, auto e) {
auto [a, b] = acc;
if (e > a) { b = a; a = e; }
else if (e > b) { b = e; }
return std::pair{a, b};
});
return a * a + b * b;
}
```

“This code makes me really happy. Some would argue that it’s not as nice as a

`for`

loop, and I would have to vehemently disagree.”

To me, that’s not “STL” code; that’s just open-coding with extra steps.
You know how in a pre-Coroutines world we have to maintain the “state” of our state machine by hand?
That’s basically what we’re doing here, via `acc`

— it’s just a manual way
of managing the local variables `a`

and `b`

in a way that can be threaded
through `accumulate`

. We could use `for`

to write that *exact same* algorithm
much more simply:

```
// IMP2
auto core_solution(const auto& v) {
int a = 0;
int b = 0;
for (auto&& e : v) {
if (e > a) { b = a; a = e; }
else if (e > b) { b = e; }
}
return a * a + b * b;
}
```

Conor’s `IMP1`

using `std::accumulate`

with an ad-hoc accumulator function
is analogous to a C++17 state machine that manually manages stack frames;
`IMP2`

, using core-language `for`

, is analogous to a C++20 function
using core-language `co_await`

. It lets us unkink our control flow and stop
manually managing our stack frames.

Meanwhile, here’s what I would call the actually “STL classic” solution to Conor’s puzzle:

```
// IMP3
auto stl_solution(const auto& v) {
int top[2] = {};
std::ranges::partial_sort_copy(v, top, std::greater());
return std::inner_product(top, std::end(top), top, 0);
}
```

This directly translates the problem statement: “Copy `v`

’s top 2 elements
to `top`

; then compute the sum of the squares of `top`

’s elements.”

Unfortunately, both libstdc++ and libc++ produce absolutely awful code for this
formulation: their `partial_sort_copy`

implementations are very inliner-hostile.
In fact, libstdc++’s `std::ranges::partial_sort_copy`

is somehow worse than their
`std::partial_sort_copy`

— as of this writing, you can actually generate strictly better code
by replacing

```
std::ranges::partial_sort_copy(v, top, std::greater());
```

with

```
std::partial_sort_copy(std::begin(v), std::end(v),
top, std::end(top), std::greater());
```

You’re probably wondering why I bothered to write `std::end(top)`

when I could have written `top+2`

, right? Well, I wanted to highlight
that it is trivial to extend `stl_solution`

to compute the sum of squares of
the top 3 values, or the top 10 values. All you have to do is change the bound
of the array variable `top`

in one place.
Everything else in this (three-line) function just keeps working.
Consider how much you’d need to change either of the open-coded solutions to sum
the squares of, say, the top 10 values.

*This*, to me, is the benefit of using STL algorithms and containers. It’s not that
they will give you the absolute best codegen. It’s that they’ll give you an appropriate
level of maintainability, while generating code that costs *no more* than whatever
you might have written by hand (with that same level of maintainability).

STL algorithms, used correctly, are *self-documenting.* Compare `IMP2`

to `IMP3`

;
which one is easier to understand? (I claim the answer is “`IMP3`

,” because it
is a line-by-line translation of the problem statement.) Now compare
`IMP2`

to `IMP1`

; which one is easier to understand? I claim that *at best* they
are equally comprehensible. In practice, `IMP1`

is harder to understand, because it
has all the stateful-variable ickiness of `IMP2`

*plus* the reader must understand
what’s happening with that `std::accumulate`

.

## Leibniz pi approximation

Circa 49m38s, Conor presents this implementation of a pi-calculating algorithm:

```
// IMP1
auto leibniz_pi_approximation(int n) {
return (rv::iota(0, n)
| rv::transform([](auto e){ return 1 + 2 * e;})
| rv::chunk(2)
| rv::transform([](auto rng){ return 1.0 / (rng[0] * rng[1]); })
| hs::accumulate(0.0, std::plus{})) * 8;
}
```

At 52m42s, there’s an “iteratively refined” version of the same algorithm. Conor doesn’t explicitly say so, but the big advantage of this version over the previous is that this one produces decent codegen.

```
// IMP2
auto leibniz_pi_approximation2_alt(int n) {
return (rv::iota(0, n)
| rv::transform([s = -1.0](auto e) mutable {
s *= -1;
return s / (1 + 2 * e); })
| hs::accumulate(0.0, std::plus{})) * 4;
}
```

Notice that this code has the same overall outline as the “sum of squares”
program: we’re iterating over our input sequence and mutating a manually
managed “stack frame” of variables (in this case just `s`

) which conceptually
live outside the loop.

In “core-language” C++, the same algorithm might look something like this:

```
// IMP3
double leibniz_pi_approximation(int n) {
double sum = 0.0;
double s = -1.0;
for (int e = 0; e < n; ++e) {
s = -s;
sum += s / (1 + 2 * e);
}
return sum * 4;
}
```

And we could re-obfuscate the code using `std::accumulate`

like this:

```
// IMP4
double leibniz_pi_approximation(int n) {
auto [sum, s] = rv::iota(0, n)
| hs::accumulate(
std::pair(0.0, 1.0),
[](auto acc, auto e) {
return std::pair(acc.first + acc.second/(1+2*e), -acc.second);
});
return sum * 4;
}
```

All of these versions have basically the same codegen at `-O3`

(but of course
the non-Ranges version `IMP3`

is vastly faster to compile).

By the way, you don’t need mutability just to get a repeating sequence
of `+1, -1`

in range-v3. You could use `rv::cycle`

, like this:

```
// IMP5
double leibniz_pi_approximation(int n) {
int signs[] = {+1, -1};
return (rv::zip_with(
[](auto e, double s) { return s / (1+2*e); },
rv::iota(0), rv::cycle(signs))
| rv::take_exactly(n)
| hs::accumulate(0.0)) * 4;
}
```

However, if you do this, then again the codegen suffers, compared to the baseline
`for`

-loop (`IMP3`

) or the two “manual stack frame management” solutions (`IMP2`

,
`IMP4`

).

Is there a “classic STL” version of this algorithm? I don’t think so, because it’s really
just a numeric operation: it takes in *one* number (not a collection of elements), and it
produces *one* number (not a collection). I think `IMP3`

is the best solution, not just
because its codegen is better, but because it most clearly explains what it’s doing.
The closest I can get to a “classic STL” solution would be, like,

```
// IMP6
double leibniz_pi_approximation(int n) {
std::vector<double> terms(n);
std::iota(terms.begin(), terms.end(), 0);
for (auto&& e : terms) { e = 1.0 / (1+2*e); }
for (auto&& e : terms | rv::stride(2)) { e = -e; }
return -4 * std::accumulate(terms.begin(), terms.end(), 0.0);
}
```

But this is obviously not an acceptable solution, because it heap-allocates for a problem (computing pi) that fundamentally doesn’t require heap-allocation.

Here are the line counts I saw for each version of `leibniz_pi_approximation`

on Godbolt, using GCC 10.2 at `-O2`

:

Version | C++17 | C++20 |
---|---|---|

`IMP1` |
227 | 219 |

`IMP2` |
49 | 49 |

`IMP3` |
35 | 35 |

`IMP4` |
49 | 49 |

`IMP5` |
83 | 83 |

`IMP6` |
180 | 172 |

By the way, I think the problem with `IMP1`

is that the parameter to `rv::chunk(2)`

is a runtime function parameter, whereas we really want it to be a compile-time
(template) parameter. If there were an `rv::chunk<2>()`

, we could eliminate a
whole heap-allocation there. Something like this:

```
template<int N>
struct chunk {
friend auto operator|(auto&& lhs, const chunk& rhs) {
return rv::zip_with(
[](auto x, auto y) { return std::array{x, y}; },
lhs | rv::stride(2),
lhs | rv::drop(1) | rv::stride(2)
);
}
};
```

This modification chops `IMP1`

’s codegen from 219 lines to 176 —
in the vicinity of `IMP6`

— but it’s still an order of magnitude worse
than the “good” versions.