The abstraction penalty for wide integer math on x86-64
Back in November 2018, following a thread on the std-proposals mailing list, I sat down and wrote out a really simple “wide integer” library.
“Wide integers” are integers of fixed-but-bigger-than-word-length size.
For example, the __uint128_t
supported by GCC and Clang would be
considered a “wide” integer.
Wide integers are not to be confused with “big integers,” i.e., integers of
dynamic (heap-allocated) size.
I can also imagine “wide floating-point,” “wide fixed-point,” “wide rational” (that is, a number expressed and stored as the ratio of two wide integers), and “big rational.” I doubt that “big fixed-point” or “big floating-point” make philosophical sense.
My library provides a single class template: Wider<T>
.
The T
can be either uint64_t
, or some specialization of Wider
.
Each Wider
doubles the bit-width of the unsigned integer represented;
so we can say
using Uint128 = Wider<uint64_t>;
using Uint256 = Wider<Uint128>;
using Uint512 = Wider<Uint256>;
using Uint1024 = Wider<Uint512>;
I looked at Clang’s codegen for __uint128_t
to see what might be the most optimal
codegen, and then tweaked my code to try to achieve the same codegen. It was
actually pretty easy, on Clang/LLVM! On GCC, not so much.
The idea is that we can write plain old C++ code — maybe with a few intrinsics to help out at the lowest level — and then let the compiler do all the hard work of optimizing it. Here’s the code for wide-integer addition:
inline bool producecarry(uint64_t& x, uint64_t y) {
x += y;
return (x < y);
}
inline bool addcarry(bool cf, uint64_t& x, uint64_t y) {
return _addcarry_u64(cf, x, y, (unsigned long long*)&x);
}
template<class Int64>
struct Wider {
Int64 lo;
Int64 hi;
friend bool producecarry(Wider& x, const Wider& y) {
return addcarry(producecarry(x.lo, y.lo), x.hi, y.hi);
}
friend bool addcarry(bool cf, Wider& x, const Wider& y) {
cf = addcarry(cf, x.lo, y.lo);
cf = addcarry(cf, x.hi, y.hi);
return cf;
}
friend Wider& operator+=(Wider& x, const Wider& y) { (void)producecarry(x, y); return x; }
friend Wider operator+(Wider x, const Wider& y) { x += y; return x; }
};
Compile this with Clang trunk and we see that the compiler produces the exact same code for
Uint128 example(const Uint128& x, const Uint128& y)
{
return x + y;
}
__uint128_t example(const __uint128_t& x, const __uint128_t& y)
{
return x + y;
}
And that code is optimal:
movq (%rsi), %rax
addq (%rdi), %rax
movq 8(%rsi), %rdx
adcq 8(%rdi), %rdx
retq
GCC trunk, on the other hand, performs abysmally.
Anywhere there’s a gap between the performance of Wider<T>
and the performance of
a built-in type like __uint128_t
, that’s an opportunity for some compiler writer
to go improve the codegen. In this way, WideIntProofOfConcept
is kind of like
the “Stepanov Abstraction Penalty Benchmark”;
it points to places where the compiler could do better at recognizing peephole-level idioms.
Back in 2018–2019 I filed a bunch of bugs against LLVM inspired by this benchmark: 39968, 40090, 24545, 31754, 40486, 40825. They were all fixed and closed quite expeditiously!
Bug 40908, an ICE on using __int128
as an NTTP,
I actually discovered while writing “Is __int128
integral? A survey”
(2019-02-28). It remains open.
To see the current “penalties” for using Wider<T>
versus the built-in types,
see the README on GitHub.
There’s also a Python script you can download to regenerate
the table using Godbolt’s API, if you want to try other (Godbolt-supported) compilers,
or just to see whether my numbers are up-to-date.
Finally, thanks to Niall Douglas for (two years ago) motivating this whole thing by pointing
out how absymally awful Boost.Multiprecision is! Here is
a reduced version of Niall’s benchmark for a + b
, showing boost::multiprecision::uint128_t
alongside native unsigned __int128
; Wider<uint64_t>
; and the new kid on the block,
Abseil’s absl::uint128
.
Abseil’s absl::uint128
seems to be high-quality. I do detect some places where it gives worse codegen
than Wider<uint64_t>
— for example, try the expression a + -b
—
but ye gods, Abseil is worlds better than Boost.Multiprecision!