Advanced Programming 2018 – Parsing and Parser Combinators, Continued
Advanced Programming 2018
Parsing and Parser Combinators, Continued
Andrzej Filinski
andrzej@di.ku.dk
Department of Computer Science
University of Copenhagen
September 20, 2018
1 / 28
Last time
I General motivation for learning about parsing
I a commonly needed skill, not only for PL implementors
I Introduced basic notions of CFGs
I terminals, nonterminals, productions/rules
I Introduced basics of parser combinators
I Parser monad, simple parsers, parser-combining forms
(sequences, alternatives, iteration)
2 / 28
Today: Some more advanced topics
I Grammars with operator precedences and associativities
I Eliminating left recursion
I Lexing issues
I Esp. whitespace (where allowed, where required)
I Parsing paradigms
I Shallow vs deep vs no backtracking
I ReadP and Parsec combinator libraries
I A bit about Assignment 2 skeleton
3 / 28
Parsing expressions with operators
I Had definition of expressions:
Expr ::= Var | Num | Expr Oper Expr | ‘(’ Expr ‘)’
Oper ::= ‘+’ | ‘*’
I Prime example of ambiguous grammar: string of nonterminals
“2+3*X” can be derived in multiple ways from Expr:
Expr︷ ︸︸ ︷
Expr︷ ︸︸ ︷
Expr︷︸︸︷
Num︷︸︸︷
2
Oper︷︸︸︷
+
Expr︷︸︸︷
Num︷︸︸︷
3
Oper︷︸︸︷
*
Expr︷︸︸︷
Var︷︸︸︷
X
Expr︷ ︸︸ ︷
Expr︷︸︸︷
Num︷︸︸︷
2
Oper︷︸︸︷
+
Expr︷ ︸︸ ︷
Expr︷︸︸︷
Num︷︸︸︷
3
Oper︷︸︸︷
*
Expr︷︸︸︷
Var︷︸︸︷
X
I Presumably only the latter was intended, but grammar gives no
indication of this.
I And once input is parsed into AST, any choice of grouping is hard
to undo.
4 / 28
Disambiguation-augmented grammars
I Common in practice to supplement ambiguous formal CFG with
semi-formal disambiguation rules.
I Another classic example: “dangling else” problem
Stmt ::= · · ·
| ‘if’ ‘(’ Expr ‘)’ Stmt ‘else’ Stmt
| ‘if’ ‘(’ Expr ‘)’ Stmt
I How to parse “if (t1) if (t2) X=1; else X=2;”?
I Most languages: else belongs to innermost enclosing if.
I Can leave to parser implementation.
I If it has a well specified disambiguation strategy, e.g. Yacc, or our
Parser monad.
I Or (often preferable): rewrite the grammar so only intended
parse is possible.
I Exercise! (See, e.g., Wikipedia for hint if needed.)
5 / 28
Operator parsing 1: precedence
I Want to express that ‘*’ groups tighter than ‘+’.
I In general, whole hierarchy of operators
I E.g., ‘<=’ groups even looser than ‘+’, while ‘^’ (exponentiation)
groups tighter than ‘*’.
I Conventional to assign to each operator a precedence:
small(ish) number, that indicates grouping strength of operator
compared to others.
I Only relative ordering matters, not magnitude of difference.
I E.g., in Haskell, (+) has precedence 6, (*) has precedence 7.
I Stratify grammar according to operator precedences
Expr ::= Term | Expr ‘+’ Expr
Term ::= Factor | Term ‘*’ Term
Factor ::= Num | Var | ‘(’ Expr ‘)’
I Now only one possible way to parse “2+3*X”.
6 / 28
Operator parsing 2: associativity
I Precedence-stratified grammar,
Expr ::= Term | Expr ‘+’ Expr
is still ambiguous: two ways to split and parse “2+3+X”.
I For addition, does not matter much which one we choose
I Except for potential overflow and/or loss of precision.
I But if we also allow ‘-’ between terms, parsing “2-3+X” by
grouping it as “2-(3+X)” would be wrong.
I As would parsing “2-3-X” as “2-(3-X)”, so not a matter of
relative precedence of ‘+’ and ‘-’.
I Rather, among operators of same precedence, should have a
well defined grouping direction (associativity).
I Most operators (‘+’, ‘-’) associate to the left, but some to right:
I E.g., Haskell: ‘^’ (exponentiation), ‘:’ (cons), ‘->’ (function space
[as a type constructor!])
I ‘++’ (list/string append), though semantically associative (like
‘+’), is parsed as associating to the right. (Why?)
7 / 28
Disambiguating associativity
I Consider ambiguous grammar:
Expr ::= Term | Expr AddOp Expr AddOp ::= ‘+’ | ‘-’
I To express that AddOps are left-associative, take instead:
Expr ::= Term | Expr AddOp Term
I I.e. in a valid parse, the right argument of an AddOp cannot itself
contain another AddOp (unless parenthesized).
I Now only one way to parse “2-3+X”.
I Symmetrically, for right-associative operators, can take:
LExpr ::= Expr | Expr ‘:’ LExpr
I So only way to parse “2+3:4:l” as a LExpr (where ‘+’ has higher
precedence than ‘:’) is like “(2+3):(4:l)”.
I And for operators that shouldn’t associate at all:
CExpr ::= Expr | Expr RelOp Expr RelOp ::= ‘==’ | ‘<’ | · · ·
I Then “2==3==X” is a syntax error.
I Whereas “(2==3)==X” or “2==(3==X)” would be (syntactically) OK.
8 / 28
Left recursion
I Consider unambiguous grammar:
Exp ::= Exp AddOp Term Term ::= Num | ‘(’ Exp ‘)’ AddOp ::= ‘+’ | ‘-’
I Coded (naively/literally) with parser combinators:
pExp :: Parser Expr
pExp = do e1 <- pExp; ao <- pAddOp; e2 <- pTerm
return $ ao e1 e2
pTerm :: Parser Expr
pTerm = do n <- pNum; return $ Num n
<|>
do symbol “(“; e <- pExp; symbol ")"; return e
pAddOp :: Parser (Expr -> Expr -> Expr)
pAddop = do symbol “+”; return Add
<|> do symbol “-“; return Sub
I Can’t even parse input “2” with pExp! Infinite recursion.
I Left recursion: parser can directly or indirectly call itself,
without consuming any input in between.
9 / 28
Eliminating left recursion
I Some parser generators can handle (indeed, prefer!),
left-recursive grammars.
I But for top-down parsers (incl. Parsec, ReadP): deadly.
I Note that for right-associative (or non-associative) operators,
grammar is not left-recursive:
Exp ::= Term | Term ‘:’ Exp Term ::= Num | ‘(’ Exp ‘)’
I And right-recursion in grammar is fine (for left-to-right) parser).
I Unfortunately, can’t just change associativity of + and – from
standard mathematical practice to simplify parsing…
I Better solution: rewrite grammar to (in EBNF):
Exp ::= Term { AddOp Term }
I “Expression is a term, followed by zero or more additions or
subtractions of more terms.”
10 / 28
Eliminating left recursion, cont’d
I EBNF: Exp ::= Term { AddOp Term }
I BNF: Exp ::= Term Exp ′ Exp ′ ::= � | AddOp Term Exp ′
I Parser-combinator code:
pExp :: Parser Expr
pExp = do e <- pTerm; pExp' e
pExp' :: Expr -> Parser Expr — argument is “accumulator”
pExp’ e1 = do ao <- pAddOp; e2 <- pTerm; pExp' (ao e1 e2)
<|> return e1 — try this alternative last
I Can extract above pattern into utility combinator chainl1:
chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
p `chainl1` po = do a <- p; p' a where
p' a1 = do o <- po; a2 <- p; p' (a1 `o` a2)
<|> return a
pExp = pTerm `chainl1` pAddOp
11 / 28
Left factoring
I Consider a grammar with a right-associative operator
LExp ::= Exp | Exp ‘:’ LExp
I No left-recursion (assuming Exp expressed properly).
I But can’t tell up front which of the two productions to use.
I Necessitates backtracking parser, and sometimes wastes work.
I Since both alternatives start with Exp, can parse Exp
unconditionally first, and only then choose:
LExp ::= Exp LExp ′ LExp ′ ::= � | ‘:’ LExp
p `chainr1` po = do a <- p; p' a where
p' a1 = do o <- po; a2 <- p `chainr1` po; return (a1 `o` a2)
<|> return a
pLExp = pExp `chainr1` (do symbol “:”; return Cons)
I Other opportunities for left-factoring abound, e.g., in:
S ::= … | ‘if’ E ‘then’ S ‘fi’ | ‘if’ E ‘then’ S ‘else’ S ‘fi’
12 / 28
Whitespace
I Most grammars allow arbitrary whitespace between tokens:
Whitespace ::= � | ( ‘ ’ | ‘\t’ | ‘\n’ ) Whitespace
I Do not want to insert Whitespace between all pairs of adjacent
symbols in productions.
I Nor explicit calls to whitespace-skipping throughout the parser.
I Need a systematic approach: make all token parsers
responsible for skipping adjacent whitespace.
I Clearly enough to skip before each token, and at very end; or vice
versa.
I In fact, much preferable to skip after tokens (and at very
beginning)
I Invariant: each terminal parser will see first real char of input.
I Avoids re-skipping whitespace at start of every alternative.
I Much like left-factoring the grammar.
13 / 28
Skipping whitespace in parsers
I Easy to add to whitespace-skipping parser builder:
whitespace :: Parser ()
whitespace = — better: use skipMany/munch combinator
do many (satisfy isSpace); return ()
token :: Parser a -> Parser a
token p = do a <- p; whitespace; return a
symbol :: String -> Parser ()
symbol s = token $ string s
pNum :: Parser Int
pNum = token $ do ds <- many1 (satisfy isDigit)
return $ read ds
...
14 / 28
Token separation
I Sometimes whitespace is required to separate adjacent tokens.
I Consider, e,g., grammar:
Expr ::= · · · | ‘let’ Var ‘=’ Expr ‘in’ Expr
I How to define keyword :: String -> Parser (), for
pExpr = do keyword “let”; v <- pPvar; s <- symbol "="; ...
I Naive approach: keyword = symbol
I Would accept “letx=5inx+1”: probably undesirable.
I On the other hand, “let x=5in(x+1)” is OK, so can’t simply
require at least one whitespace char after keywords, either.
I Workable solution: read entire word first, then check at end:
keyword s = token $ do s' <- many1 (satisfy isAlphaNum)
if s' == s then return ()
else fail $ "expected " ++ s
15 / 28
Delimiting keywords, continued
I Previous solution is slightly wasteful
I Will only detect mismatch after reading entire word, even if
differs from expected keyword on first char.
I Repeats work when used in alternatives.
I Alternative approach: negated parsers
notFollowedBy :: Parser a -> Parser () — slightly odd name
— notFollowedBy p will succeed (without consuming anything)
— iff input string does not start with a p.
notFollowedBy p =
P (\s -> case runP p s of
Right _ -> Left “illegal here”
Left _ -> Right ((), s))
keyword s = token $ do string s
notFollowedBy (satisfy isAlphaNum)
eof = notFollowdBy getc — succeeds iff at end of input
16 / 28
Keywords, concluded
I Final twist: keywords are often reserved
I So cannot use for variable names:
reserved :: [String]
reserved = [“if”, “for”, “while”, “do”, …]
type Var = String
pVar :: parser Var
pVar = token $
do c <- satisfy isAlpha
cs <- many (satisfy isAlphaNum)
let i = c:cs
if i `notElem` reserved then return i
else fail "variable can't be a reserved word"
17 / 28
Lookahead and backtracking
I For alternatives, our Parser tries all productions in turn, until
one succeeds.
I In A ::= α1 | α2, a parsing failure anywhere within α1 will cause
α2 to be tried.
I But if parsing all of α1 succeds, α2 is discarded
I Sometimes known as shallow backtracking.
I Allows unlimited lookahead when picking alternative.
I Ordering of alternatives is significant.
I Nice balance between convenience and efficiency, but not only
possible design choice.
18 / 28
Shallow backtracking is not always enough
I Example: “An A is zero or more ‘x’s and ‘y’s, followed by a ‘z’.”
A ::= B ‘z’ B ::= ‘x’ B | ‘y’ B | �
I Greedy approach for parsing B will work fine.
I Example: “An A is one or more ‘x’s and ‘y’s, ending in an ‘x’.”
A ::= B ‘x’ B ::= ‘x’ B | ‘y’ B | �
I Greedy parsing of B may eat too many characters, causing rest of
A to fail!
I Need to rewrite grammar to not require lookahead, e.g.:
A ::= ‘x’ C | ‘y’ A C ::= ‘x’ C | ‘y’ A | �
I String can end after an ‘x’, but not after a ‘y’.
19 / 28
Alternative: deep backtracking parser
I Consider grammar:
A ::= α B γ | δ B ::= β1 | β2
I If parsing γ fails, could try different way of parsing B instead of
jumping straight to to δ (as in shallow backtracking).
I E.g, β2 instead of β1
I Note that β1 and β2 might consume different amounts of input,
e.g., if β1 = b and β2 = �.
I “No choice is final until entire input sucessfully parsed”.
I Idea: make the parser returns all possible ways of parsing a
nonterminal at beginning of string.
I Only minimal changes required in Parser monad:
-- Parser a = P { runP :: String -> Either Error (a, String) }
Parser a = P { runP :: String -> [(a, String)] }
instance Monad Parser where
return a = P (\s -> return (a,s)) — builds on [] monad!
m >>= f = P (\s -> do (a,s’) <- runP m s; runP (f a) s')
fail e = P (\s -> []) — ignore message
20 / 28
List-based parsing, continued
I Also a few changes in basic parser combinators:
— Parser a = P { runP :: String -> [(a, String)] }
getc :: Parser Char
getc = P (\s -> case s of “” -> []; (c:s’) -> return (c,s’))
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = P (\s -> runP p1 s ++ runP p2 s)
notFollowedBy p =
P (\s -> case runP p s of [] -> return (); _ -> [])
parseString :: Parser a -> String -> Either ParseError a
parseString p s =
case runP (do whitespace; a <- p; eof; return a) s of
[] -> Left “cannot parse”
[(a,_)] -> Right a — the _ will be “” due to eof parser
_ -> error “looks like my grammar is ambiguous…”
21 / 28
Pros and cons of deep backtracking
I Some gain in convenience (can handle more grammars directly).
I Potentially excessive backtracking.
I Easy to induce quadratic, or even exponential, behavior.
I Worse: may split tokens in unexpected places
I E.g., for pNum :: Parser Int, defined exactly as before:
runP pNum “123!” == [(123,”!”),(12,”3!”),(1,”23!”)]
I Sometimes need to explicitly force longest parse:
munch :: (Char -> Bool) -> Parser Char
munch p = do as <- many (satisfy p)
notFollowedBy (satisfy p); return as
I But can be implemented much more efficiently:
p1 <<|> p2 = — like <|>, but only tries p2 if p1 fails
P (\s -> case runP p1 s of [] -> runP p2 s; l -> l)
munch p = do c <- satisfy p; cs <- munch p; return (c:cs) <<|> return [] — note: “many” uses <|> here
22 / 28
ReadP parser library
I Behaves like list-based parser on previous slides (as further
described in “Monadic Parsing” article), but internally
implemented more efficiently.
I Uses +++ for symmetric choice instead of <|>, and a few other
naming differences.
I See Hutton & Meijer article for principles, and Hoogle ReadP for
full API.
I Note: use readP_to_S to invoke top-level parser.
I Welcome to use for AP assignments, but beware of pitfalls from
previous slides…
I Also, absolutely no feedback on errors!
I Hint: use (approximate) bisection to track down location of
parsing errors when debugging grammar and/or input strings.
23 / 28
Other extreme: non-backtracking parsers
I Also possible to parse without backtracking.
I Potentially more efficient
I (Mainly because uses more programmer effort to transform the
grammar into suitably restricted form first).
I In particular, manual left-factorization.
I In A ::= α1 | α2, commit to α1 branch as soon as it succesfully
parses the first input token.
I Only tries α2 if α1 failed straight away.
I Actually OK for many practical grammars (LL(1) class).
I By default, lookahead is only one character.
I Not enough to distinguish, e.g., throw and try at start of a
sentence.
I Or between any keyword and, e.g., a variable in an assignment,
or procedure name in a call.
24 / 28
Limited backtracking: try
I To see more of the input before committing, need extra
combinator:
try :: Parser p -> Parser p
I try p tries to run p, but if it fails anywhere, pretend that it failed
already on first input character.
I Typical usage:
pStmt = do symbol “{“; … — no “try” needed here
<|>
do try (keyword “if”); …
<|>
do v <- pVar; symbol "="; ... -- nor here
I try can actually span over any parsers, not just single tokens.
I Extreme case: try p1 <|> p2 simulates unbounded (shallow)
backtracking.
I But negates advantages of backtracking-less parser.
I Principle: only “try”-protect as much of each alternative as
needed to determine that none of the following will work.
25 / 28
Parsec parser library
I Efficient, non-backtracking parser.
I See Leijen & Meijer article for principles, and Hoogle Parsec
for full API.
I Perhaps main advantage: gives pretty good error messages
out-of-the-box.
I Location of error (line & column)
I List of tokens (or higher-level symbols) valid at this point.
I Can improve error messages further by extra hints.
I Don’t waste time on that for AP!
I For post-AP work, may want to consider Megaparsec, or other
recent parsing libraries.
I Original Parsec starting to show its age.
I But beware of excessive hype, spotty documentation.
26 / 28
Something completely different
I In SubScript assignments, parser (and interpreter, for that
matter) module have very short export lists.
I Avoids leaking internal implementation details.
I Avoids clashing with client’s own functions/types on bulk import.
I Never export more than listed in documentation!
I But then, how to thoroughly unit-test all the internal functions?
I Don’t put testing code into implementation!
I Solution: split into two modules:
I Implementation module exports everything.
I Interface module imports from the implementation and
re-exports only the required names.
I Link against (import from) appropriate module:
I Actual clients (and black-box testing code) link against interface
module.
I White-box testing code links against implementation module.
I Already set up this way in skeleton code for Assignment 2.
27 / 28
What next
I Assignment 2 (parsing SubScript) is out, due on Wednesday
I Labs today 13:05–15:00
I Next week: Prolog and logic programming
I Will see deep backtracking again!
28 / 28