Better diagnostics through operator-precedence parsing
All C-style programming languages have too many levels of operator precedence. This confuses their programmers, who have to remember not just the basic PEMDAS stuff but the really arcane precedences, too:
-
x & y == z
doesn’t mean(x & y) == z
but ratherx & (y == z)
. That is, it generally Does The Wrong Thing. -
x == y == z
andx < y < z
have the “correct” natural precedence, but if you see one of these in code, it’s almost certainly a bug, because it’s not asking whetherx
,y
, andz
have the same value; it’s asking whether(x == y)
has the same value asz
! -
Some expressions are simply bizarre. I’d bet you have no intuition for the meaning of
x < y | z
, do you? Of course you can reason that<
has roughly the precedence of==
, and|
has roughly the precedence of&
, so this should parse roughly the same asx == y & z
. But that’s not intuition, it’s reasoning: a conscious process that requires active engagement. -
!x && y
is another expression that’s easy to reason about, once you’re on the lookout for it; but at a glance it’s easy to mistake for!(x && y)
. -
Even limiting ourselves to simple arithmetic, it’s easy to misinterpret an expression like
x / y * z
.
Jonathan Müller wrote on this subject in “Operator precedence is broken” (July 2017); it may be worth your time to go read his post before tackling the rest of this one.
All five of the problems above are solvable within the operator-precedence subsystem!
The key observation is that all of these ways of “making an expression problematic”
boil down to that the expression contains two adjoining operators whose resolution is
non-obvious: &
next to ==
, say, or /
next to *
(but not the reverse).
By “next to,” of course I don’t mean that the operators must appear lexically consecutive
in the source code: we also want to diagnose x / a->b * z
, despite the ->
operator’s
coming lexically between /
and *
. It turns out that the meaning of “next to” is
exactly captured by this definition:
Two operators
@
and$
within the same expression are said to be “next to” each other if parsing the expression with an operator-precedence parser requires at some point comparing the relative precedences of@
and$
.
In this post I construct a simple operator-precedence parser using Dijkstra’s shunting-yard algorithm to parse an expression grammar almost, but not quite, entirely unlike C’s; and then modify it to diagnose all five of the issues above (and many more).
Lexer
Here’s a simple lexer for a C-style expression grammar. A convenient thing about
C’s lexer is that every operator token is either a single byte, or else a string where
every prefix of that string is also a valid operator. For example, <<=
is a valid
token; and so are its prefixes <<
and <
. Therefore our lexer never needs to “look ahead”
more than a single character.
Since this lexer is merely an uninteresting prerequisite for the fun parsing part, we’ll impose no
structure on our non-operator tokens: alphanumeric tokens like 42
, 0x7
, foo
, if
,
5g
are all treated uniformly. In fact, since we use Python’s built-in isalnum
,
we’ll consider underscore to be an (unrecognized) operator, not an identifier.
OPERATORS = [
'+', '-', '*', '/', '%',
'<', '<=', '>', '>=', '==',
# etc.
]
class Lexer:
def tokenize(self, chars):
self.token = ''
for ch in chars:
if ch.isspace():
yield from self.emit_if(True)
elif ch.isalnum():
yield from self.emit_if(not self.token.isalnum())
self.token += ch
else:
yield from self.emit_if(self.token.isalnum())
yield from self.emit_if((self.token + ch) not in OPERATORS)
self.token += ch
yield from self.emit_if(True)
def emit_if(self, b):
if b and self.token:
yield self.token
self.token = ''
Test with:
while True:
line = input()
tokens = Lexer().tokenize(line)
print(' '.join(tokens))
For example, entering “abc*d<=-ef/g
” should print “abc * d <= - ef / g
.”
Parser (without diagnostics)
Let’s make a simple shunting-yard parser. It will consume a stream of tokens
in infix order (such as the stream generated by our previous step’s lexer)
and produces a stream of the same tokens rearranged in postfix (RPN) order.
In the input stream unary and binary -
are represented identically; but
in the output stream we’ll distinguish unary operators by the prefix U
.
Thus the output “a b U- -
” will correspond to the input “a - -b
,” while
the output “a b - U-
” will correspond to the input “-(a-b)
.”
The parser needs to know the precedence of each operator. We encapsulate
that comparison in a helper function. Notice that unary operators are
given to has_higher_precedence
already encoded with the U
prefix.
PREC = {
k: v for v, ks in enumerate([
['||'], ['&&'], ['|'], ['^'], ['&'],
['==', '!='], ['<', '<=', '>', '>='], ['<=>'],
['<<', '>>'], ['+', '-'], ['*', '/', '%'],
['.*', '->*'],
['U'+t for t in ['+', '-', '!', '~', '*', '&']],
['.', '->'],
]) for k in ks
}
PREC['('] = -float('inf')
def has_higher_precedence(a, b):
return PREC[a] >= PREC[b]
This code assumes every operator is left-associative. Supporting right-associative
operators such as =
or **
or =>
simply requires inspecting the associativity of a
:
def has_higher_precedence(a, b):
if a in ['=', '**', '=>']:
return PREC[a] > PREC[b]
else:
return PREC[a] >= PREC[b]
Now the parser itself goes like this:
def infix_to_postfix(tokens):
stack = []
expect_primary = True
for token in tokens:
if token.isalnum():
if not expect_primary:
raise ValueError('got "%s" when a binary operator was expected' % token)
expect_primary = False
yield token
elif token == '(':
if not expect_primary:
raise ValueError('got "(" when a binary operator was expected')
stack.append(token)
elif token == ')':
if expect_primary:
raise ValueError('got ")" when a primary-expression was expected')
while stack and stack[-1] != '(':
yield stack.pop()
if not stack:
raise ValueError('right parenthesis with no preceding left parenthesis')
stack.pop()
elif expect_primary:
if ('U' + token) not in PREC:
raise ValueError('unknown unary operator "%s"' % token)
stack.append('U' + token)
else:
if token not in PREC:
raise ValueError('unknown binary operator "%s"' % token)
while stack and has_higher_precedence(stack[-1], token):
yield stack.pop()
stack.append(token)
expect_primary = True
while stack:
if stack[-1] == '(':
raise ValueError('left parenthesis with no matching right parenthesis')
yield stack.pop()
Test with:
while True:
line = input()
tokens = Lexer().tokenize((ch for ch in line))
try:
print(' '.join(infix_to_postfix(tokens)))
except ValueError as e:
print('error:', e)
For example, entering “abc*d<=-ef/g
” should print “abc d * ef U- g / <=
.”
Adding diagnostics
Since every pair of “adjoining” operators passes through has_higher_precedence
,
it’s simple to diagnose all the problems I listed at the top of this post. At first
we might think to do it via a blacklist of known-problematic pairs:
def has_higher_precedence(a, b):
if a in ['U!'] and b in ['&&', '||']:
print("warning: (%sx %s y) is ambiguous" % (a, b))
elif a in ['/', '%'] and b in ['*']:
print("warning: (x %s y %s z) is ambiguous" % (a, b))
elif a in ['&', '^', '|'] and b in ['==', '!=', '<', '<=', '>', '>=']:
print("warning: (x %s y %s z) is ambiguous" % (a, b))
elif a in ['<', '<='] and b in ['<', '<=']:
print("warning: (x %s y %s z) doesn't mean what you think" % (a, b))
return PREC[a] >= PREC[b]
However, this leaves you vulnerable to failures of imagination. For example, it seems
to me that we should also warn about x & y << z
(which means x & (y << z)
, not
(x & y) << z
). Where in the maze of if
s above should that case be inserted?
A better approach is to diagnose anything that’s not on a whitelist of
uncontroversially non-problematic pairs! Out of the roughly 667 operator-pairs
that our has_higher_precedence
might ever see, only about half of them are
non-problematic in my book.
Some pairs are nonsensical. For example,
as Jonathan noted, !p->*x
has the “wrong” precedence, so that it will not type-check;
therefore we can omit the pair (U!
, ->*
) from our whitelist.
Other pairs, such as (+
, &&
), seem nonsensical at first glance but in fact
we must whitelist them anyway. Consider the expression:
a < b + 1 && c < d
In parsing this expression we call has_higher_precedence
with the following
pairs: (<
, +
); (+
, &&
); (<
, &&
); (&&
, <
). If (+
, &&
) weren’t
on our whitelist, we’d get a warning — which we don’t want! We certainly should
try to diagnose “unusually typed” expressions such as b + 1 && c
; but we can’t
do it with this operator-precedence technique alone.
However, it seems to me that we can safely omit the pair (<<
, &&
) from our
whitelist: those two operators really should never appear next to one another.
In fact I think <<
should always be parenthesized; I can’t think of any operator
(between the extremes of .
and =
) where x << y @ z
is both clear and useful.
The useful expressions, such as x << y + 1
, are not clear; and the clear expressions,
such as x << y && z
, are not useful.
My suggested whitelist ends up looking like this:
UNPROBLEMATIC = sum([
[('||',r) for r in ['||', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
[('&&',r) for r in ['&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
[(x,x) for x in '|^&'],
[(l,r) for l in ['==', '!=', '<', '<=', '>', '>='] for r in ['||', '&&', '+', '-', '*', '/', '%']],
[(l,r) for l in ['+', '-', '*'] for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-', '*', '/', '%']],
[('/',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=', '+', '-']],
[('%',r) for r in ['||', '&&', '==', '!=', '<', '<=', '>', '>=']],
[(l,r) for l in ['U+', 'U-', 'U~', 'U*', 'U&'] for r in ['||', '&&', '|', '^', '&', '==', '!=', '<', '<=', '>', '>=', '<=>', '<<', '>>', '+', '-', '*', '/', '%']],
], [])
def has_higher_precedence(a, b):
if a in ['==', '!='] and (a == b):
print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
elif a in ['<', '<='] and b in ['<', '<=']:
print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
elif a in ['>', '>='] and b in ['>', '>=']:
print('warning: (x %s y %s z) doesn\'t mean what you think' % (a, b))
elif a[0] == 'U' and b in ['.*', '->*']:
print('warning: (%sx%sy) means (%sx)%sy' % (a[1:], b, a[1:], b))
elif a in ['(', '.', '->', '.*', '->*'] or b in ['.', '->', '.*', '->*']:
pass # not problematic
elif (a,b) in UNPROBLEMATIC:
pass # not problematic
elif a[0] == 'U':
print('warning: (%sx %s y) is ambiguous; consider adding parentheses' % (a[1:], b))
else:
print('warning: (x %s y %s z) is ambiguous; consider adding parentheses' % (a, b))
return PREC[a] >= PREC[b]
Test it against some of Jonathan’s exotic expressions and it spews warnings — but I don’t think any of these warnings are “wrong”!
a & b + c * d && e ^ f == 7
warning: (x & y + z) is ambiguous; consider adding parentheses
warning: (x & y && z) is ambiguous; consider adding parentheses
warning: (x && y ^ z) is ambiguous; consider adding parentheses
warning: (x ^ y == z) is ambiguous; consider adding parentheses
a b c d * + & e f 7 == ^ &&
arr + 32 < ~a | b
warning: (x < y | z) is ambiguous; consider adding parentheses
arr 32 + a U~ < b |
!x && y
warning: (!x && y) is ambiguous; consider adding parentheses
x U! y &&
Get the full Python code here.
See also:
- “Two musings on the design of compiler warnings” (2020-09-02)