程序代写代做代考 prolog Haskell interpreter chain Advanced Programming 2018 - Parsing and Parser Combinators, Continued

Advanced Programming 2018 – Parsing and Parser Combinators, Continued

Advanced Programming 2018
Parsing and Parser Combinators, Continued

Andrzej Filinski
andrzej@di.ku.dk

Department of Computer Science
University of Copenhagen

September 20, 2018

1 / 28

Last time

I General motivation for learning about parsing
I a commonly needed skill, not only for PL implementors

I Introduced basic notions of CFGs
I terminals, nonterminals, productions/rules

I Introduced basics of parser combinators
I Parser monad, simple parsers, parser-combining forms

(sequences, alternatives, iteration)

2 / 28

Today: Some more advanced topics

I Grammars with operator precedences and associativities
I Eliminating left recursion

I Lexing issues
I Esp. whitespace (where allowed, where required)

I Parsing paradigms
I Shallow vs deep vs no backtracking
I ReadP and Parsec combinator libraries

I A bit about Assignment 2 skeleton

3 / 28

Parsing expressions with operators

I Had definition of expressions:

Expr ::= Var | Num | Expr Oper Expr | ‘(’ Expr ‘)’
Oper ::= ‘+’ | ‘*’

I Prime example of ambiguous grammar: string of nonterminals
“2+3*X” can be derived in multiple ways from Expr:

Expr︷︸︸︷
Expr︷︸︸︷

Expr︷︸︸︷
Num︷︸︸︷
2

Oper︷︸︸︷
+

Expr︷︸︸︷
Num︷︸︸︷
3

Oper︷︸︸︷
*

Expr︷︸︸︷
Var︷︸︸︷
X

Expr︷︸︸︷
Expr︷︸︸︷
Num︷︸︸︷
2

Oper︷︸︸︷
+

Expr︷︸︸︷
Expr︷︸︸︷
Num︷︸︸︷
3

Oper︷︸︸︷
*

Expr︷︸︸︷
Var︷︸︸︷
X

I Presumably only the latter was intended, but grammar gives no
indication of this.

I And once input is parsed into AST, any choice of grouping is hard
to undo.

4 / 28

Disambiguation-augmented grammars

I Common in practice to supplement ambiguous formal CFG with
semi-formal disambiguation rules.

I Another classic example: “dangling else” problem

Stmt ::= · · ·
| ‘if’ ‘(’ Expr ‘)’ Stmt ‘else’ Stmt
| ‘if’ ‘(’ Expr ‘)’ Stmt

I How to parse “if (t1) if (t2) X=1; else X=2;”?
I Most languages: else belongs to innermost enclosing if.

I Can leave to parser implementation.
I If it has a well specified disambiguation strategy, e.g. Yacc, or our
Parser monad.

I Or (often preferable): rewrite the grammar so only intended
parse is possible.

I Exercise! (See, e.g., Wikipedia for hint if needed.)

5 / 28

Operator parsing 1: precedence

I Want to express that ‘*’ groups tighter than ‘+’.

I In general, whole hierarchy of operators
I E.g., ‘<=’ groups even looser than ‘+’, while ‘^’ (exponentiation) groups tighter than ‘*’. I Conventional to assign to each operator a precedence: small(ish) number, that indicates grouping strength of operator compared to others. I Only relative ordering matters, not magnitude of difference. I E.g., in Haskell, (+) has precedence 6, (*) has precedence 7. I Stratify grammar according to operator precedences Expr ::= Term | Expr ‘+’ Expr Term ::= Factor | Term ‘*’ Term Factor ::= Num | Var | ‘(’ Expr ‘)’ I Now only one possible way to parse “2+3*X”. 6 / 28 Operator parsing 2: associativity I Precedence-stratified grammar, Expr ::= Term | Expr ‘+’ Expr is still ambiguous: two ways to split and parse “2+3+X”. I For addition, does not matter much which one we choose I Except for potential overflow and/or loss of precision. I But if we also allow ‘-’ between terms, parsing “2-3+X” by grouping it as “2-(3+X)” would be wrong. I As would parsing “2-3-X” as “2-(3-X)”, so not a matter of relative precedence of ‘+’ and ‘-’. I Rather, among operators of same precedence, should have a well defined grouping direction (associativity). I Most operators (‘+’, ‘-’) associate to the left, but some to right: I E.g., Haskell: ‘^’ (exponentiation), ‘:’ (cons), ‘->’ (function space

[as a type constructor!])
I ‘++’ (list/string append), though semantically associative (like

‘+’), is parsed as associating to the right. (Why?)

7 / 28

Disambiguating associativity

I Consider ambiguous grammar:

Expr ::= Term | Expr AddOp Expr AddOp ::= ‘+’ | ‘-’
I To express that AddOps are left-associative, take instead:

Expr ::= Term | Expr AddOp Term
I I.e. in a valid parse, the right argument of an AddOp cannot itself

contain another AddOp (unless parenthesized).
I Now only one way to parse “2-3+X”.

I Symmetrically, for right-associative operators, can take:
LExpr ::= Expr | Expr ‘:’ LExpr

I So only way to parse “2+3:4:l” as a LExpr (where ‘+’ has higher
precedence than ‘:’) is like “(2+3):(4:l)”.

I And for operators that shouldn’t associate at all:

CExpr ::= Expr | Expr RelOp Expr RelOp ::= ‘==’ | ‘<’ | · · · I Then “2==3==X” is a syntax error. I Whereas “(2==3)==X” or “2==(3==X)” would be (syntactically) OK. 8 / 28 Left recursion I Consider unambiguous grammar: Exp ::= Exp AddOp Term Term ::= Num | ‘(’ Exp ‘)’ AddOp ::= ‘+’ | ‘-’ I Coded (naively/literally) with parser combinators: pExp :: Parser Expr pExp = do e1 <- pExp; ao <- pAddOp; e2 <- pTerm return $ ao e1 e2 pTerm :: Parser Expr pTerm = do n <- pNum; return $ Num n <|>
do symbol “(“; e <- pExp; symbol ")"; return e pAddOp :: Parser (Expr -> Expr -> Expr)
pAddop = do symbol “+”; return Add

<|> do symbol “-“; return Sub

I Can’t even parse input “2” with pExp! Infinite recursion.

I Left recursion: parser can directly or indirectly call itself,
without consuming any input in between.

9 / 28

Eliminating left recursion

I Some parser generators can handle (indeed, prefer!),
left-recursive grammars.

I But for top-down parsers (incl. Parsec, ReadP): deadly.

I Note that for right-associative (or non-associative) operators,
grammar is not left-recursive:

Exp ::= Term | Term ‘:’ Exp Term ::= Num | ‘(’ Exp ‘)’

I And right-recursion in grammar is fine (for left-to-right) parser).

I Unfortunately, can’t just change associativity of + and – from
standard mathematical practice to simplify parsing…

I Better solution: rewrite grammar to (in EBNF):

Exp ::= Term { AddOp Term }

I “Expression is a term, followed by zero or more additions or
subtractions of more terms.”

10 / 28

Eliminating left recursion, cont’d

I EBNF: Exp ::= Term { AddOp Term }
I BNF: Exp ::= Term Exp ′ Exp ′ ::= � | AddOp Term Exp ′
I Parser-combinator code:

pExp :: Parser Expr
pExp = do e <- pTerm; pExp' e pExp' :: Expr -> Parser Expr — argument is “accumulator”
pExp’ e1 = do ao <- pAddOp; e2 <- pTerm; pExp' (ao e1 e2) <|> return e1 — try this alternative last

I Can extract above pattern into utility combinator chainl1:

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
p `chainl1` po = do a <- p; p' a where p' a1 = do o <- po; a2 <- p; p' (a1 `o` a2) <|> return a

pExp = pTerm `chainl1` pAddOp

11 / 28

Left factoring

I Consider a grammar with a right-associative operator

LExp ::= Exp | Exp ‘:’ LExp
I No left-recursion (assuming Exp expressed properly).
I But can’t tell up front which of the two productions to use.

I Necessitates backtracking parser, and sometimes wastes work.

I Since both alternatives start with Exp, can parse Exp
unconditionally first, and only then choose:

LExp ::= Exp LExp ′ LExp ′ ::= � | ‘:’ LExp
p `chainr1` po = do a <- p; p' a where p' a1 = do o <- po; a2 <- p `chainr1` po; return (a1 `o` a2) <|> return a

pLExp = pExp `chainr1` (do symbol “:”; return Cons)

I Other opportunities for left-factoring abound, e.g., in:

S ::= … | ‘if’ E ‘then’ S ‘fi’ | ‘if’ E ‘then’ S ‘else’ S ‘fi’
12 / 28

Whitespace

I Most grammars allow arbitrary whitespace between tokens:

Whitespace ::= � | ( ‘ ’ | ‘\t’ | ‘\n’ ) Whitespace

I Do not want to insert Whitespace between all pairs of adjacent
symbols in productions.

I Nor explicit calls to whitespace-skipping throughout the parser.

I Need a systematic approach: make all token parsers
responsible for skipping adjacent whitespace.

I Clearly enough to skip before each token, and at very end; or vice
versa.

I In fact, much preferable to skip after tokens (and at very
beginning)

I Invariant: each terminal parser will see first real char of input.
I Avoids re-skipping whitespace at start of every alternative.
I Much like left-factoring the grammar.

13 / 28

Skipping whitespace in parsers

I Easy to add to whitespace-skipping parser builder:

whitespace :: Parser ()
whitespace = — better: use skipMany/munch combinator
do many (satisfy isSpace); return ()

token :: Parser a -> Parser a
token p = do a <- p; whitespace; return a symbol :: String -> Parser ()
symbol s = token $ string s

pNum :: Parser Int
pNum = token $ do ds <- many1 (satisfy isDigit) return $ read ds ... 14 / 28 Token separation I Sometimes whitespace is required to separate adjacent tokens. I Consider, e,g., grammar: Expr ::= · · · | ‘let’ Var ‘=’ Expr ‘in’ Expr I How to define keyword :: String -> Parser (), for
pExpr = do keyword “let”; v <- pPvar; s <- symbol "="; ... I Naive approach: keyword = symbol I Would accept “letx=5inx+1”: probably undesirable. I On the other hand, “let x=5in(x+1)” is OK, so can’t simply require at least one whitespace char after keywords, either. I Workable solution: read entire word first, then check at end: keyword s = token $ do s' <- many1 (satisfy isAlphaNum) if s' == s then return () else fail $ "expected " ++ s 15 / 28 Delimiting keywords, continued I Previous solution is slightly wasteful I Will only detect mismatch after reading entire word, even if differs from expected keyword on first char. I Repeats work when used in alternatives. I Alternative approach: negated parsers notFollowedBy :: Parser a -> Parser () — slightly odd name
— notFollowedBy p will succeed (without consuming anything)
— iff input string does not start with a p.
notFollowedBy p =

P (\s -> case runP p s of
Right _ -> Left “illegal here”
Left _ -> Right ((), s))

keyword s = token $ do string s
notFollowedBy (satisfy isAlphaNum)

eof = notFollowdBy getc — succeeds iff at end of input

16 / 28

Keywords, concluded

I Final twist: keywords are often reserved

I So cannot use for variable names:

reserved :: [String]
reserved = [“if”, “for”, “while”, “do”, …]

type Var = String
pVar :: parser Var
pVar = token $
do c <- satisfy isAlpha cs <- many (satisfy isAlphaNum) let i = c:cs if i `notElem` reserved then return i else fail "variable can't be a reserved word" 17 / 28 Lookahead and backtracking I For alternatives, our Parser tries all productions in turn, until one succeeds. I In A ::= α1 | α2, a parsing failure anywhere within α1 will cause α2 to be tried. I But if parsing all of α1 succeds, α2 is discarded I Sometimes known as shallow backtracking. I Allows unlimited lookahead when picking alternative. I Ordering of alternatives is significant. I Nice balance between convenience and efficiency, but not only possible design choice. 18 / 28 Shallow backtracking is not always enough I Example: “An A is zero or more ‘x’s and ‘y’s, followed by a ‘z’.” A ::= B ‘z’ B ::= ‘x’ B | ‘y’ B | � I Greedy approach for parsing B will work fine. I Example: “An A is one or more ‘x’s and ‘y’s, ending in an ‘x’.” A ::= B ‘x’ B ::= ‘x’ B | ‘y’ B | � I Greedy parsing of B may eat too many characters, causing rest of A to fail! I Need to rewrite grammar to not require lookahead, e.g.: A ::= ‘x’ C | ‘y’ A C ::= ‘x’ C | ‘y’ A | � I String can end after an ‘x’, but not after a ‘y’. 19 / 28 Alternative: deep backtracking parser I Consider grammar: A ::= α B γ | δ B ::= β1 | β2 I If parsing γ fails, could try different way of parsing B instead of jumping straight to to δ (as in shallow backtracking). I E.g, β2 instead of β1 I Note that β1 and β2 might consume different amounts of input, e.g., if β1 = b and β2 = �. I “No choice is final until entire input sucessfully parsed”. I Idea: make the parser returns all possible ways of parsing a nonterminal at beginning of string. I Only minimal changes required in Parser monad: -- Parser a = P { runP :: String -> Either Error (a, String) }
Parser a = P { runP :: String -> [(a, String)] }
instance Monad Parser where

return a = P (\s -> return (a,s)) — builds on [] monad!
m >>= f = P (\s -> do (a,s’) <- runP m s; runP (f a) s') fail e = P (\s -> []) — ignore message

20 / 28

List-based parsing, continued

I Also a few changes in basic parser combinators:
— Parser a = P { runP :: String -> [(a, String)] }

getc :: Parser Char
getc = P (\s -> case s of “” -> []; (c:s’) -> return (c,s’))

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = P (\s -> runP p1 s ++ runP p2 s)

notFollowedBy p =
P (\s -> case runP p s of [] -> return (); _ -> [])

parseString :: Parser a -> String -> Either ParseError a
parseString p s =

case runP (do whitespace; a <- p; eof; return a) s of [] -> Left “cannot parse”
[(a,_)] -> Right a — the _ will be “” due to eof parser
_ -> error “looks like my grammar is ambiguous…”

21 / 28

Pros and cons of deep backtracking

I Some gain in convenience (can handle more grammars directly).
I Potentially excessive backtracking.

I Easy to induce quadratic, or even exponential, behavior.

I Worse: may split tokens in unexpected places
I E.g., for pNum :: Parser Int, defined exactly as before:
runP pNum “123!” == [(123,”!”),(12,”3!”),(1,”23!”)]

I Sometimes need to explicitly force longest parse:
munch :: (Char -> Bool) -> Parser Char
munch p = do as <- many (satisfy p) notFollowedBy (satisfy p); return as I But can be implemented much more efficiently: p1 <<|> p2 = — like <|>, but only tries p2 if p1 fails

P (\s -> case runP p1 s of [] -> runP p2 s; l -> l)

munch p = do c <- satisfy p; cs <- munch p; return (c:cs) <<|> return [] — note: “many” uses <|> here

22 / 28

ReadP parser library

I Behaves like list-based parser on previous slides (as further
described in “Monadic Parsing” article), but internally
implemented more efficiently.

I Uses +++ for symmetric choice instead of <|>, and a few other
naming differences.

I See Hutton & Meijer article for principles, and Hoogle ReadP for
full API.

I Note: use readP_to_S to invoke top-level parser.

I Welcome to use for AP assignments, but beware of pitfalls from
previous slides…

I Also, absolutely no feedback on errors!
I Hint: use (approximate) bisection to track down location of

parsing errors when debugging grammar and/or input strings.

23 / 28

Other extreme: non-backtracking parsers

I Also possible to parse without backtracking.
I Potentially more efficient
I (Mainly because uses more programmer effort to transform the

grammar into suitably restricted form first).
I In particular, manual left-factorization.

I In A ::= α1 | α2, commit to α1 branch as soon as it succesfully
parses the first input token.

I Only tries α2 if α1 failed straight away.
I Actually OK for many practical grammars (LL(1) class).

I By default, lookahead is only one character.
I Not enough to distinguish, e.g., throw and try at start of a

sentence.
I Or between any keyword and, e.g., a variable in an assignment,

or procedure name in a call.

24 / 28

Limited backtracking: try

I To see more of the input before committing, need extra
combinator:

try :: Parser p -> Parser p
I try p tries to run p, but if it fails anywhere, pretend that it failed

already on first input character.

I Typical usage:
pStmt = do symbol “{“; … — no “try” needed here

<|>
do try (keyword “if”); …
<|>
do v <- pVar; symbol "="; ... -- nor here I try can actually span over any parsers, not just single tokens. I Extreme case: try p1 <|> p2 simulates unbounded (shallow)

backtracking.
I But negates advantages of backtracking-less parser.
I Principle: only “try”-protect as much of each alternative as

needed to determine that none of the following will work.

25 / 28

Parsec parser library

I Efficient, non-backtracking parser.

I See Leijen & Meijer article for principles, and Hoogle Parsec
for full API.

I Perhaps main advantage: gives pretty good error messages
out-of-the-box.

I Location of error (line & column)
I List of tokens (or higher-level symbols) valid at this point.
I Can improve error messages further by extra hints.

I Don’t waste time on that for AP!

I For post-AP work, may want to consider Megaparsec, or other
recent parsing libraries.

I Original Parsec starting to show its age.
I But beware of excessive hype, spotty documentation.

26 / 28

Something completely different

I In SubScript assignments, parser (and interpreter, for that
matter) module have very short export lists.

I Avoids leaking internal implementation details.
I Avoids clashing with client’s own functions/types on bulk import.

I Never export more than listed in documentation!

I But then, how to thoroughly unit-test all the internal functions?
I Don’t put testing code into implementation!

I Solution: split into two modules:
I Implementation module exports everything.
I Interface module imports from the implementation and

re-exports only the required names.

I Link against (import from) appropriate module:
I Actual clients (and black-box testing code) link against interface

module.
I White-box testing code links against implementation module.

I Already set up this way in skeleton code for Assignment 2.

27 / 28

What next

I Assignment 2 (parsing SubScript) is out, due on Wednesday

I Labs today 13:05–15:00

I Next week: Prolog and logic programming
I Will see deep backtracking again!

28 / 28

Related Posts