It’s not always obvious when tail-call optimization is allowed

I initially wrote this as part of a new entry in “A C++ acronym glossary” (2019-08-02), but decided that it would be better to pull it out into its own post.

“TCO” stands for “tail call optimization.” This is a compiler optimization that takes what appears in the source code as a function call (which usually pushes an activation frame onto the call stack) and turns it into a plain old jump (which does not push a frame on the call stack). This is possible only when the function call appears at the very tail end of the function — something like return bar(). The compiler is saying, “Look, I know bar is going to end by returning to its caller; that’s just me, and I have nothing left to do. So let’s just trick bar into returning to my caller!”

Tail call optimization is often possible for return bar() but not for, e.g., return bar()+1.

In C++, it can be very hard for a human to figure out exactly where TCO is allowed to happen. The main reason is non-trivial destructors:

void foo1() { bar(); }  // tail call
void foo2() { std::lock_guard lk(m); bar(); }  // not a tail call

including destructors of temporary objects:

void bar(std::string_view);
void foo1() { bar("hello"); }  // tail call
void foo2() { bar("hello"s); }  // not a tail call

but even when everything is trivially destructible, you might need to adjust the stack pointer or something, thus preventing TCO. Godbolt:

void bar();
void escape(const int&);
void foo1() { escape(42); bar(); }  // tail-call on GCC and MSVC
void foo2() { const int i = 42; escape(i); bar(); }  // not a tail-call

Interestingly, in the above example, GCC and MSVC emit jmp bar for foo1, but Clang and ICC miss that optimization.

You might reasonably ask why we can’t do the same optimization for foo2. I think the reason is that C++ guarantees that every variable (within its lifetime) has a unique address. If we were to implement bar like this:

const int *addr_of_i;
void escape(const int& i) {
    addr_of_i = &i;
}
void bar() {
    int j;
    assert(&j != addr_of_i);
}

then the program could tell whether the implementation had (non-conformingly) put j into the same memory location as i. However, there’s no rule that says j can’t share a memory location with the temporary object produced from 42, since that temporary’s lifetime doesn’t overlap with j’s.


Whether an implementation does TCO is kinda-sorta observable, in the sense that a tail-recursive function might use O(1) stack space with TCO but O(n) without TCO — thus “blowing the stack” when the recursion goes deep enough. However, C++’s abstract machine doesn’t really have any notion of blowing the stack. There’s no conforming way for a C++ program to detect that condition or deal with it.

Sufficiently paranoid C++ code will therefore avoid very deep recursion. One technique for doing this is to do tail-recursion optimization by hand:

int gcd(int x, int y) {
    if (x == 0) return y;
    return gcd(y % x, x);
}

becomes

int gcd(int x, int y) {
top:
    if (x == 0) return y;
    std::tie(x, y) = std::tuple(y % x, x);
    goto top;
}

which in turn becomes

int gcd(int x, int y) {
    while (x != 0) {
        std::tie(x, y) = std::tuple(y % x, x);
    }
    return y;
}

It’s often pragmatically important to do this last step, not just because structured programming makes the code easier for humans to understand, but also because goto is one of the very few C++ constructs that prevents marking the function as constexpr. If you want your function to be constexpr (for whatever reason), you must avoid goto. This is a rare case of C++ being stylistically opinionated.

(If that tie = tuple trick is new to you, you might enjoy my CppCon 2020 talk “Back to Basics: Algebraic Data Types.”.)

Posted 2021-01-09