Better diagnostics through operator-precedence parsing

All C-style programming languages have too many levels of operator precedence. This confuses their programmers, who have to remember not just the basic PEMDAS stuff but the really arcane precedences, too:

x & y == z doesn’t mean (x & y) == z but rather x & (y == z). That is, it generally Does The Wrong Thing.
x == y == z and x < y < z have the “correct” natural precedence, but if you see one of these in code, it’s almost certainly a bug, because it’s not asking whether x, y, and z have the same value; it’s asking whether (x == y) has the same value as z!
Some expressions are simply bizarre. I’d bet you have no intuition for the meaning of x < y | z, do you? Of course you can reason that < has roughly the precedence of ==, and | has roughly the precedence of &, so this should parse roughly the same as x == y & z. But that’s not intuition, it’s reasoning: a conscious process that requires active engagement.
!x && y is another expression that’s easy to reason about, once you’re on the lookout for it; but at a glance it’s easy to mistake for !(x && y).
Even limiting ourselves to simple arithmetic, it’s easy to misinterpret an expression like x / y * z.

Jonathan Müller wrote on this subject in “Operator precedence is broken” (July 2017); it may be worth your time to go read his post before tackling the rest of this one.

All five of the problems above are solvable within the operator-precedence subsystem! The key observation is that all of these ways of “making an expression problematic” boil down to that the expression contains two adjoining operators whose resolution is non-obvious: & next to ==, say, or / next to * (but not the reverse).

By “next to,” of course I don’t mean that the operators must appear lexically consecutive in the source code: we also want to diagnose x / a->b * z, despite the -> operator’s coming lexically between / and *. It turns out that the meaning of “next to” is exactly captured by this definition:

Two operators @ and $ within the same expression are said to be “next to” each other if parsing the expression with an operator-precedence parser requires at some point comparing the relative precedences of @ and $.

In this post I construct a simple operator-precedence parser using Dijkstra’s shunting-yard algorithm to parse an expression grammar almost, but not quite, entirely unlike C’s; and then modify it to diagnose all five of the issues above (and many more).

Lexer

Here’s a simple lexer for a C-style expression grammar. A convenient thing about C’s lexer is that every operator token is either a single byte, or else a string where every prefix of that string is also a valid operator. For example, <<= is a valid token; and so are its prefixes << and <. Therefore our lexer never needs to “look ahead” more than a single character.

Since this lexer is merely an uninteresting prerequisite for the fun parsing part, we’ll impose no structure on our non-operator tokens: alphanumeric tokens like 42, 0x7, foo, if, 5g are all treated uniformly. In fact, since we use Python’s built-in isalnum, we’ll consider underscore to be an (unrecognized) operator, not an identifier.

OPERATORS = [
  '+', '-', '*', '/', '%',
  '<', '<=', '>', '>=', '==',
  # etc.
]

class Lexer:
  def tokenize(self, chars):
    self.token = ''
    for ch in chars:
      if ch.isspace():
        yield from self.emit_if(True)
      elif ch.isalnum():
        yield from self.emit_if(not self.token.isalnum())
        self.token += ch
      else:
        yield from self.emit_if(self.token.isalnum())
        yield from self.emit_if((self.token + ch) not in OPERATORS)
        self.token += ch
    yield from self.emit_if(True)

  def emit_if(self, b):
    if b and self.token:
      yield self.token
      self.token = ''

Test with:

while True:
  line = input()
  tokens = Lexer().tokenize(line)
  print(' '.join(tokens))

For example, entering “abc*d<=-ef/g” should print “abc * d <= - ef / g.”

Parser (without diagnostics)

Let’s make a simple shunting-yard parser. It will consume a stream of tokens in infix order (such as the stream generated by our previous step’s lexer) and produces a stream of the same tokens rearranged in postfix (RPN) order. In the input stream unary and binary - are represented identically; but in the output stream we’ll distinguish unary operators by the prefix U. Thus the output “a b U- -” will correspond to the input “a - -b,” while the output “a b - U-” will correspond to the input “-(a-b).”

The parser needs to know the precedence of each operator. We encapsulate that comparison in a helper function. Notice that unary operators are given to has_higher_precedence already encoded with the U prefix.

PREC = {
  k: v for v, ks in enumerate([
    ['||'], ['&&'], ['|'], ['^'], ['&'],
    ['==', '!='], ['<', '<=', '>', '>='], ['<=>'],
    ['<<', '>>'], ['+', '-'], ['*', '/', '%'],
    ['.*', '->*'],
    ['U'+t for t in ['+', '-', '!', '~', '*', '&']],
    ['.', '->'],
  ]) for k in ks
}
PREC['('] = -float('inf')

def has_higher_precedence(a, b):
  return PREC[a] >= PREC[b]

This code assumes every operator is left-associative. Supporting right-associative operators such as = or ** or => simply requires inspecting the associativity of a:

def has_higher_precedence(a, b):
  if a in ['=', '**', '=>']:
    return PREC[a] > PREC[b]
  else:
    return PREC[a] >= PREC[b]

Now the parser itself goes like this:

def infix_to_postfix(tokens):
  stack = []
  expect_primary = True
  for token in tokens:
    if token.isalnum():
      if not expect_primary:
        raise ValueError('got "%s" when a binary operator was expected' % token)
      expect_primary = False
      yield token
    elif token == '(':
      if not expect_primary:
        raise ValueError('got "(" when a binary operator was expected')
      stack.append(token)
    elif token == ')':
      if expect_primary:
        raise ValueError('got ")" when a primary-expression was expected')
      while stack and stack[-1] != '(':
        yield stack.pop()
      if not stack:
        raise ValueError('right parenthesis with no preceding left parenthesis')
      stack.pop()
    elif expect_primary:
      if ('U' + token) not in PREC:
        raise ValueError('unknown unary operator "%s"' % token)
      stack.append('U' + token)
    else:
      if token not in PREC:
        raise ValueError('unknown binary operator "%s"' % token)
      while stack and has_higher_precedence(stack[-1], token):
        yield stack.pop()
      stack.append(token)
      expect_primary = True
  while stack:
    if stack[-1] == '(':
      raise ValueError('left parenthesis with no matching right parenthesis')
    yield stack.pop()

Test with:

while True:
  line = input()
  tokens = Lexer().tokenize((ch for ch in line))
  try:
    print(' '.join(infix_to_postfix(tokens)))
  except ValueError as e:
    print('error:', e)

For example, entering “abc*d<=-ef/g” should print “abc d * ef U- g / <=.”

Adding diagnostics

Since every pair of “adjoining” operators passes through has_higher_precedence, it’s simple to diagnose all the problems I listed at the top of this post. At first we might think to do it via a blacklist of known-problematic pairs:

def has_higher_precedence(a, b):
  if a in ['U!'] and b in ['&&', '||']:
    print("warning: (%sx %s y) is ambiguous" % (a, b))
  elif a in ['/', '%'] and b in ['*']:
    print("warning: (x %s y %s z) is ambiguous" % (a, b))
  elif a in ['&', '^', '|'] and b in ['==', '!=', '<', '<=', '>', '>=']:
    print("warning: (x %s y %s z) is ambiguous" % (a, b))
  elif a in ['<', '<='] and b in ['<', '<=']:
    print("warning: (x %s y %s z) doesn't mean what you think" % (a, b))
  return PREC[a] >= PREC[b]

However, this leaves you vulnerable to failures of imagination. For example, it seems to me that we should also warn about x & y << z (which means x & (y << z), not (x & y) << z). Where in the maze of ifs above should that case be inserted?

A better approach is to diagnose anything that’s not on a whitelist of uncontroversially non-problematic pairs! Out of the roughly 667 operator-pairs that our has_higher_precedence might ever see, only about half of them are non-problematic in my book.

Some pairs are nonsensical. For example, as Jonathan noted, !p->*x has the “wrong” precedence, so that it will not type-check; therefore we can omit the pair (U!, ->*) from our whitelist. Other pairs, such as (+, &&), seem nonsensical at first glance but in fact we must whitelist them anyway. Consider the expression:

a < b + 1 && c < d

In parsing this expression we call has_higher_precedence with the following pairs: (<, +); (+, &&); (<, &&); (&&, <). If (+, &&) weren’t on our whitelist, we’d get a warning — which we don’t want! We certainly should try to diagnose “unusually typed” expressions such as b + 1 && c; but we can’t do it with this operator-precedence technique alone.

However, it seems to me that we can safely omit the pair (<<, &&) from our whitelist: those two operators really should never appear next to one another. In fact I think << should always be parenthesized; I can’t think of any operator (between the extremes of . and =) where x << y @ z is both clear and useful. The useful expressions, such as x << y + 1, are not clear; and the clear expressions, such as x << y && z, are not useful.

My suggested whitelist ends up looking like this:

UNPROBLEMATIC = sum([
  [('||',r) for r in ['||', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
  [('&&',r) for r in ['&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
  [(x,x) for x in '|^&'],
  [(l,r) for l in ['==', '!=', '<', '<=', '>', '>='] for r in ['||', '&&', '+', '-', '*', '/', '%']],
  [(l,r) for l in ['+', '-', '*'] for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
  [('/',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-']],
  [('%',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=']],
  [(l,r) for l in ['U+', 'U-', 'U~', 'U*', 'U&'] for r in ['||', '&&', '|', '^', '&', '==', '!=', '<', '<=', '>', '>=', '<=>', '<<', '>>', '+', '-', '*', '/', '%']],
], [])

def has_higher_precedence(a, b):
  if a in ['==', '!='] and (a == b):
    print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
  elif a in ['<', '<='] and b in ['<', '<=']:
    print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
  elif a in ['>', '>='] and b in ['>', '>=']:
    print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
  elif a[0] == 'U' and b in ['.*', '->*']:
    print('warning: (%sx%sy) means (%sx)%sy' % (a[1:], b, a[1:], b))
  elif a in ['(', '.', '->', '.*', '->*'] or b in ['.', '->', '.*', '->*']:
    pass # not problematic
  elif (a,b) in UNPROBLEMATIC:
    pass # not problematic
  elif a[0] == 'U':
    print('warning: (%sx %s y) is ambiguous; consider adding parentheses' % (a[1:], b))
  else:
    print('warning: (x %s y %s z) is ambiguous; consider adding parentheses' % (a, b))
  return PREC[a] >= PREC[b]

Test it against some of Jonathan’s exotic expressions and it spews warnings — but I don’t think any of these warnings are “wrong”!

a & b + c * d && e ^ f == 7
 warning: (x & y + z) is ambiguous; consider adding parentheses
 warning: (x & y && z) is ambiguous; consider adding parentheses
 warning: (x && y ^ z) is ambiguous; consider adding parentheses
 warning: (x ^ y == z) is ambiguous; consider adding parentheses
 a b c d * + & e f 7 == ^ &&

arr + 32 < ~a | b
 warning: (x < y | z) is ambiguous; consider adding parentheses
 arr 32 + a U~ < b |

!x && y
 warning: (!x && y) is ambiguous; consider adding parentheses
 x U! y &&

Get the full Python code here.