The abstraction penalty for wide integer math on x86-64

Back in November 2018, following a thread on the std-proposals mailing list, I sat down and wrote out a really simple “wide integer” library.

“Wide integers” are integers of fixed-but-bigger-than-word-length size.
For example, the `__uint128_t`

supported by GCC and Clang would be
considered a “wide” integer.
Wide integers are not to be confused with “big integers,” i.e., integers of
dynamic (heap-allocated) size.

I can also imagine “wide floating-point,” “wide fixed-point,” “wide rational”
(that is, a number expressed and stored as the *ratio* of two wide integers),
and “big rational.” I doubt that “big fixed-point” or “big floating-point”
make philosophical sense.

My library provides a single class template: `Wider<T>`

.
The `T`

can be either `uint64_t`

, or some specialization of `Wider`

.
Each `Wider`

doubles the bit-width of the unsigned integer represented;
so we can say

```
using Uint128 = Wider<uint64_t>;
using Uint256 = Wider<Uint128>;
using Uint512 = Wider<Uint256>;
using Uint1024 = Wider<Uint512>;
```

I looked at Clang’s codegen for `__uint128_t`

to see what might be the most optimal
codegen, and then tweaked my code to try to achieve the same codegen. It was
actually pretty easy, on Clang/LLVM! On GCC, not so much.

The idea is that we can write plain old C++ code — maybe with a few intrinsics to help out at the lowest level — and then let the compiler do all the hard work of optimizing it. Here’s the code for wide-integer addition:

```
inline bool producecarry(uint64_t& x, uint64_t y) {
x += y;
return (x < y);
}
inline bool addcarry(bool cf, uint64_t& x, uint64_t y) {
return _addcarry_u64(cf, x, y, (unsigned long long*)&x);
}
template<class Int64>
struct Wider {
Int64 lo;
Int64 hi;
friend bool producecarry(Wider& x, const Wider& y) {
return addcarry(producecarry(x.lo, y.lo), x.hi, y.hi);
}
friend bool addcarry(bool cf, Wider& x, const Wider& y) {
cf = addcarry(cf, x.lo, y.lo);
cf = addcarry(cf, x.hi, y.hi);
return cf;
}
friend Wider& operator+=(Wider& x, const Wider& y) { (void)producecarry(x, y); return x; }
friend Wider operator+(Wider x, const Wider& y) { x += y; return x; }
};
```

Compile this with Clang trunk and we see that the compiler produces the exact same code for

```
Uint128 example(const Uint128& x, const Uint128& y)
{
return x + y;
}
__uint128_t example(const __uint128_t& x, const __uint128_t& y)
{
return x + y;
}
```

And that code is optimal:

```
movq (%rsi), %rax
addq (%rdi), %rax
movq 8(%rsi), %rdx
adcq 8(%rdi), %rdx
retq
```

GCC trunk, on the other hand, performs abysmally.

Anywhere there’s a gap between the performance of `Wider<T>`

and the performance of
a built-in type like `__uint128_t`

, that’s an opportunity for some compiler writer
to go improve the codegen. In this way, `WideIntProofOfConcept`

is kind of like
the “Stepanov Abstraction Penalty Benchmark”;
it points to places where the compiler could do better at recognizing peephole-level idioms.

Back in 2018–2019 I filed a bunch of bugs against LLVM inspired by this benchmark: 39968, 40090, 24545, 31754, 40486, 40825. They were all fixed and closed quite expeditiously!

Bug 40908, an ICE on using `__int128`

as an NTTP,
I actually discovered while writing “Is `__int128`

integral? A survey”
(2019-02-28). It remains open.

To see the current “penalties” for using `Wider<T>`

versus the built-in types,
see the README on GitHub.
There’s also a Python script you can download to regenerate
the table using Godbolt’s API, if you want to try other (Godbolt-supported) compilers,
or just to see whether my numbers are up-to-date.

Finally, thanks to Niall Douglas for (two years ago) motivating this whole thing by pointing
out how absymally awful Boost.Multiprecision is! Here is
a reduced version of Niall’s benchmark for `a + b`

, showing `boost::multiprecision::uint128_t`

alongside native `unsigned __int128`

; `Wider<uint64_t>`

; and the new kid on the block,
Abseil’s `absl::uint128`

.

**Abseil’s absl::uint128 seems to be high-quality.** I do detect some places where it gives worse codegen
than

`Wider<uint64_t>`

— for example, try the expression `a + -b`

—
but ye gods, Abseil is *worlds*better than Boost.Multiprecision!