Bit-vector manipulations in standard C++

Recently on the std-proposals mailing list, Madhur Chauhan proposed that C++ needs a better way to find the first set bit in an arbitrary-length bit-string. Today, if you want to manipulate bit-strings of length N, you have these options:

For N <= 64, use unsigned long long.
On some platforms, for N <= 128, use __uint128.
On Clang 13, use _ExtInt(N). (Clang 14 drops support for N > 128, which removes any reason to use _ExtInt as far as I know.)
Use something like my Wider<T>, which is statically sized and stack-allocated.
Use std::bitset<N>, which is statically sized and stack-allocated.
Use std::vector<bool>, which is dynamically resizeable and heap-allocated.
Use boost::dynamic_bitset<>, which is dynamically resizeable and heap-allocated.

Suppose we have a bit-string stored in one of these ways, and we want to find the second-lowest 1-bit.

Conventionally, the lowest-order bit will be (x & 1) for integer-like types which are written high-to-low, but x[0] for sequence-like types which are written low-to-high. So the operation that for__uint128 and std::bitset<N> we spell >> (“right-shift”) is, for vector<bool>, spelled std::shift_left.

We’ll generalize the problem by defining a function f(n, x) that finds the lowest 1-bit in x whose index is greater than or equal to the given index n, or returns some large value if no such 1-bit exists. The original problem, of finding the index of the second-lowest 1-bit in x, becomes:

y = f(f(0, x) + 1, x);
if (y < N) ~~~

Here are the current state-of-the-art methods for implementing f, as far as I know. Notice that I’ll use int for bit indices, since 2147483647 bits ought to be enough for anyone. Godbolt:

int f(int n, unsigned long long x) {
    return n >= N ? N : std::countr_zero(x >> n) + n;
}

#ifndef _MSC_VER
static int f(int n, __uint128_t x) {
    constexpr unsigned long long mask = ~0uLL;
    return (x == 0) ? 128 :
           (n < 64 && (x & mask) != 0) ?
               std::countr_zero((unsigned long long)(x >> n)) + n :
           f(n - 64, x >> 64) + 64;
}
#endif

int f(int n, const Wider<T>& x) {
    Wider<T> mask = Wider<T>(1) << n;
    for (int i = n; i < N; ++i, mask <<= 1) {
        if (x & mask) return i;
    }
    return N;
}

int f(int n, const std::bitset<N>& x) {
#ifdef __GLIBCXX__
    return (n == 0) ? x._Find_first() : x._Find_next(n-1);
#else
    for (int i = n; i < N; ++i) {
        if (x[i]) return i;
    }
    return N;
#endif
}

static int f(int n, const std::vector<bool>& x) {
    if (n >= N) return N;
    return std::find(x.begin() + n, x.end(), true) - x.begin();
}

static int f(int n, const boost::dynamic_bitset<>& x) {
    if (n >= N) return N;
    return (n == 0) ? x.find_first() : x.find_next(n-1);
}

For unsigned long long, we can’t just call std::countr_zero(x >> n) because n might be out of range, making x >> n undefined behavior.

libc++ provides std::countr_zero for __uint128_t, but GNU libstdc++ does not, and MSVC doesn’t support __uint128_t at all. See also “Is __int128 integral?” (2019-02-28).

GNU libstdc++ provides implementation-detail methods _Find_first and _Find_next on std::bitset, with exactly the same semantics as the public find_first and find_next methods on boost::dynamic_bitset. Oddly, libstdc++ does not provide those methods for their vector<bool>, even though the two types’ elements are laid out in the same way.

STL algorithms for bit iterators

libc++ provides clever specializations of certain STL algorithms for bit iterators specifically, so that they can exploit full-word-length instructions instead of extracting “elements” bit by bit. For example, std::find can use rep bsfq, and std::count can use popcntq. The full list of algorithms that libc++ optimizes is: std::find, count, fill, fill_n, copy, copy_backward, move, move_backward, swap_ranges, rotate, and equal. Oddly, as of this writing libc++ does not optimize std::mismatch for bit iterators, even though mismatch can be considered a building block of equal. Maybe a mismatch optimization will be added along the way to implementing vector<bool>::operator<=>. Even more oddly, libc++ fails to optimize copy_n, so in Clang 15 std::copy(first, first+n, dest) is 1000x faster than std::copy_n(first, n, dest) when first is a bit iterator. I’ve submitted a patch for the issue and expect it’ll be fixed soon.

You might expect libraries also to specialize std::all_of, any_of, and none_of. The problem with those algorithms is that they don’t default their Predicate argument: you must provide a predicate, making any_of more like count_if than like count. Thus, on libc++, std::count(v.begin(), v.end(), false) == 0 will run much faster than std::ranges::all_of(v, std::identity()).

Microsoft STL optimizes fill, fill_n, find, and count for bit iterators. (Search the code for uses of _Is_vb_iterator.)

libstdc++ optimizes only std::fill for bit iterators.

From the vantage point of 2022, you might think that std::bitset<N> is to std::vector<bool> in the same way that std::array<T, N> is to std::vector<T>. That is, it uses stack storage with a fixed size instead of dynamic storage, but ought to provide basically the same iterator API otherwise. Sadly, this is not the case!

std::bitset is one of those “not quite really STL” types, like std::string and std::valarray, that tries to fit a couple of use-cases but none of them are “STL iterable container.” Arguably, the hint is in its name: whereas vector<bool> is clearly a vector, a sequence container, of boolean values, bitset is a set, a collection of small integer indices. bs.test(42) is the equivalent of s.find(42); bs.set(42) is the equivalent of s.insert(42); and so on. (As usual, these named methods throw std::out_of_range on error, and bitset provides operator[] for faster unchecked access.)

If you were going to iterate over the “contents” of such an object, what would you expect to see?

std::bitset<1000> bs;
bs.set(42);
for (auto elt : bs) {
    std::cout << bs;
}

Surely you wouldn’t expect to see “false, false, false, false, …”! So std::bitset (and boost::dynamic_bitset too) simply don’t provide iterators at all. Instead, dynamic_bitset provides find_first and find_next, and libstdc++’s bitset provides secret _Uglified versions of the same methods.

Like a Python set, bitset provides overloaded &, |, ^, ~ that perform intersection, union, symmetric-difference, and invert operations. But then, as if the type author were free-associating on the other meanings of those operators, it goes on to provide << and >>. Which makes no sense at all if you’re thinking of bitset either as a sequence container of bits (like vector<bool>) or a set of indices (as we were just doing in the previous paragraph), but perfect sense if you’re thinking of it as an integer type.

std::bitset<N> ends up behaving similarly to an integer type with N bits… except that it can’t do the full suite of arithmetic operations out of the box. You can still implement those operations tediously by hand. (Thanks to Maciej Hehl on StackOverflow for this code. Note that adding operators to a type we don’t own, such as bitset, is very ill-advised in practice; we’d do better to name this function plus, or wrap the bitsets in a class of our own devising.)

template<size_t N>
auto operator+(const std::bitset<N>& a, const std::bitset<N>& b) {
    auto carry = a & b;
    auto result = a ^ b;
    while (carry.any()) {
        auto shifted = carry << 1;
        carry = result & shifted;
        result ^= shifted;
    }
    return result;
}

For a move-semantic type like string or set<int>, we’d want to pass a by value and accumulate the result in a instead of making a whole new copy. But for giant trivial stack-storage types like bitset and array, move buys us nothing and NRVO buys us much. So, for these types, we pass by reference and return in ways that can be NRVO’ed.

Conclusion

There’s not much to conclude here, except that C++ currently has a bunch of ways to do bit-strings, and none of them are particularly well crafted. vector<bool> is iterable but not arithmetic’able; bitset<N> is arithmetic’able but fails to be iterable or even resizable; different library vendors provide vastly different performance profiles for STL algorithms on bit iterators.

Madhur and I produced a benchmark (backup) comparing all these different ways of “find-next’ing” in a bit-string of a million elements. Here are the results on libc++; you can see that libc++’s std::find is a huge winner.

----------------------------------------------------
Benchmark               Time        CPU   Iterations
----------------------------------------------------
VectorBoolManual     1862 ns    1862 ns       100000
VectorBoolStdFind      69 ns      44 ns     20286454
BitsetManual         7300 ns    2357 ns       100000
DynBitsetManual      6864 ns    2876 ns       252099
DynBitsetFindnext      73 ns      41 ns     15469004

Here are the results on libstdc++; note that libstdc++’s std::find is a huge loser, even compared to manual iteration:

----------------------------------------------------
Benchmark               Time        CPU   Iterations
----------------------------------------------------
VectorBoolManual     2017 ns    2016 ns       100000
VectorBoolStdFind    8323 ns    3479 ns       100000
BitsetManual         3532 ns    3481 ns       100000
BitsetFindnext         57 ns      30 ns     18880905
DynBitsetManual      3611 ns    2099 ns       331986
DynBitsetFindnext      47 ns      30 ns     29609779

These benchmark numbers were prooduced on VMs at godbolt.org, so they should be taken with a grain of salt especially when comparing across the two benchmark runs. But no amount of salt will erase a 100x speedup or slowdown.

Posted 2022-11-05

benchmarks implementation-divergence llvm standard-library-trivia stl-classic