Advanced Programming 2018 – Introduction to Parsing and Parser Combinators
Advanced Programming 2018
Introduction to Parsing and Parser Combinators
Andrzej Filinski
andrzej@di.ku.dk
Department of Computer Science
University of Copenhagen
September 18, 2018
1 / 24
Background
I In practical applications, often have to read in structured
textual data for further processing.
I Could be actual program (script) code, but also serialized
database dumps, marked-up text, web page templates,
configuration files, build rules, …
I If expressed in a standard format (XML, JSON, YAML, …), can
probably find a ready-made reader.
I Though maybe not exactly for your language/platform
I Or for the almost-standard format you actually need to read.
I Proprietary/experimental extensions
I Standards version skew
I Not-strictly compliant data
I But otherwise, you’re on your own.
I As a computer scientist, you’ll need to be able to cope.
2 / 24
Introduction to parsing
I Parsing is (for our purposes) the process of constructing
I … a structured representation of data in memory …
I … from its linearized representation as a text string.
I The structured representation is usually a simple data
structure, the abstract syntax tree (AST).
I Haskell’s (and other functional languages’) algebraic data types
(data) are an excellent match for this.
I Have already seen for Assignment 1.
I The description of the linearized format is usually given by a
grammar.
I Describes concrete syntax of language being parsed
I And (where non-obvious) how the concrete syntax corresponds to
the abstract.
I Must be suitable for machine processing
I Sometimes a few tweaks are necessary
3 / 24
Parsing principles
I Most languages (both human and machine-oriented) have
natural two-level structure:
I The lexical level deals with rules for the structure of the atomic
tokens of the language:
I Exact rules for format of numerals, identifiers, keywords,
operator symbols, …
I What goes into leaves (or at least single nodes) of the AST
I Often have finitary, or simple iterative, description:
I “A keyword is one of the following: if, then, else, for, do, …”
I “A valid identifier is a letter followed by zero or more letters, digits,
or underscores”
I Also, specifies general lexical properties not associated with
individual tokens, e.g.:
I case sensitivity of identifiers/keywords
I whitespace rules (i.e., where allowed/required)
I comment format(s)
4 / 24
Parsing principles, continued
I The syntactic level deals with rules for how tokens may be
composed into higher-level units: expressions, statements,
functions, modules, …
I Typically reflects the structure of the AST
I But with a little more “noise”: parentheses, separator symbols for
lists (“,” vs. “;” vs. “ ”), …
I Almost always involves non-trivial recursion:
I An expression is a number, or a variable, or two expressions
separated by an infix operator, or …
I A statement is an assignment of an expression to a variable, or a
sequence of statements enclosed in braces, or …
I In practice, boundary between lexical and syntactic levels often
a bit fuzzy (more later).
I “Scannerless parsing”: unified view.
5 / 24
Formal grammars
I A grammar says how individual characters or tokens (called
terminal symbols or just terminals) are arranged into
higher-level constructs (nonterminals).
I Grammar contains a collection of rules (or productions),
specifying how each nonterminal may be built out of sequences
of zero or more terminals and/or nonterminals.
I One of the nonterminals is the designated start symbol.
I “Entry point” of the grammar for a single parsing run.
I Cf. main function in a C/Java/Haskell program.
I Meta-notation: when talking about grammars in general, will
use:
I uppercase Roman letters (A , B , C , …) for nonterminals,
I lowercase Roman letters (a, b , c, …) for terminals
I (lowercase) Greek letters (α, β, γ, …) for (possibly empty)
sequences of terminals and nonterminals.
6 / 24
Context-free grammars
I A CFG (aka “BNF grammar”: for “Backus–Naur Form” or
“Backus Normal Form”) has the following general shape
(“ragged matrix”):
A1 ::= α11 | · · · | α1n1
…
Am ::= αm1 | · · · | αmnm
I Informally: (mutually) recursive definitions of all the
nonterminals, enumerating all valid alternatives for each.
I Important subclass of CFGs: regular grammars; nonterminals
are only allowed as last element of an αij (“tail recursion”).
I Also definable: context-sensitive, or even more general
grammars, may also have additional symbols on LHSs of
productions.
I Rarely used in CS practice, but more commonly in linguistics.
7 / 24
Writing specific grammars
I Many common concrete conventions for writing actual
grammars; will generally use the following for AP :
I nonterminals written as capitalized identifiers in italics
I terminals written as character sequences between single quotes
and in typewriter font
I symbols in sequences are separated by spaces.
I empty sequence of symbols written as � (instead of nothing)
I E.g.. syntactic grammar for small imperative language:
Stmt ::= Var ‘=’ Expr ‘;’
| ‘{’ Stmts ‘}’
Stmts ::= �
| Stmt Stmts
Expr ::= Number
| Var
| Expr Oper Expr
| ‘(’ Expr ‘)’
Oper ::= ‘+’ | ‘*’
(Assumes Number, Var defined elsewhere)
8 / 24
Lexical grammars
I Lexical structure of tokens can also be described by grammars.
I Example: lexical grammar for variable names:
Var ::= VHChar VarRest
VarRest ::= � | VRChar VarRest
VHChar ::= ‘A’ | · · · | ‘Z’
VRChar ::= VHChar | ‘0’ | · · · | ‘9’ | ‘_’
I “…” is informal range notation. In completely formal grammar,
would have to write out all alternatives.
I Not technically a regular grammar (not “tail recursive”)
I But could be turned into one, by unfolding definitions of
character classes and refactoring.
I In practice, lexical grammar often specified (partially) textually.
I Character classes may be huge (e.g., “letters” in Unicode).
I Avoids clutter in syntactic grammar to express that arbitrary
whitespace or comments allowed between proper tokens.
9 / 24
Extended BNF (EBNF) notation.
I Often seen in real-world specifications: regexp-like extensions
for writing CFGs a bit more concisely.
I Easily “desugared” into plain BNF.
I Internal grouping:
A ::= α ( β1 | · · · | βn) γ
A ::= α B γ
B ::= β1 | · · · | βn
I Optional elements:
A ::= α [ β ] γ
A ::= α B γ
B ::= β | �
I Iterated elements
A ::= α { β } γ
A ::= α B γ
B ::= β B | �
I Notation can also be nested or used multiple times in single
rule, but with diminishing readability gains.
10 / 24
Derivations and parsing
I Given grammar (regular or context free), can say that a
nonterminal A derives string of terminals α, written A ⇒ α.
I Begin with start symbol, repeatedly replace some nonterminal
with one of its production RHSs, until only terminals left.
I Example (with selected nonterminal underlined in each step)
Expr ⇒ Expr + Expr ⇒ Num + Expr ⇒ 4 + Expr ⇒ 4 + Var ⇒ 4 + X
I For parsing, considering the opposite problem: given sequence
of terminals, can it be derived from the start symbol?
I And by which exact productions? (To get parse tree)
I Not obvious that this is even effectively decidable.
I Worst case “only” O(n3), where n is length of input [Earley]
I For “well behaved” grammars, can do significantly better, often
close to O(n).
I In practice, most grammars are indeed well behaved.
I If hard to parse for a computer, not great for humans either
I … even if brain may have “built-in” parsing acceleration [Chomsky]
11 / 24
Parsing in practice
I Traditional compiler wisdom: lexing done with regular
grammars (regexps), parsing with CFGs.
I In practice: not quite.
I Real languages (especially legacy ones) often have mildly
context-sensitive (or worse) lexing or parsing rules that defy
easy categorization. Some random examples:
I Fortran 66 “Hollerith constants”: CALL WRITE(12HHELLO WORLD!)
I Programmer-specified operator precedence and associativity:
Haskell’s infixr 5 ++. (In SML, can even be lexically scoped!)
I C’s “typedef problem”: is (a)*b a cast of a pointer dereference,
or a multiplication? Depends on prior declaration of a.
I Non-reserved keywords (Fortran, PL/I), e.g., FORI may be
variable name, or start of a FOR-loop, depending on context.
I Indentation sensitivity (Python, Haskell).
12 / 24
Parsing tools
I Parsing according to a CFG may look scary.
I General parser-generator tools exist: Lex/Flex, Yacc/Bison, …
I (Quasi-)ports to various languages, including Haskell (Alex,
Happy)
I Translate a grammar specification into (often table-based)
parsing program.
I Often give best absolute performance:
I Can use fancier parsing algorithms that require heavy
preprocessing.
I Avoid “interpretation overhead” at parsing time.
I But for majority of applications, lexing/parsing is far from the
most time-consuming part.
I So performance matters less than programmer productivity.
I Also, need various hacks to handle complications from previous
slide with pure CFG tools.
13 / 24
Parser combinators
I For many purposes, preferable to hand-code a parser from
(almost) scratch, using some variant of recursive-descent
parsing.
I Can “escape” to full host language whenever needed during
parsing.
I Particularly convenient in Haskell:
I Small library of parser combinators make parser specification
almost as concise and readable as grammar itself
I Lazy evaluation and infix operators help keep notation extra
compact.
I Monad-based approach makes it easy to incorporate
backtracking (if/when desired), so lookahead is not a problem.
I Additional features and optimizations can also be pushed into the
combinators by tweaking the parsing monad.
14 / 24
A simple parsing task
I Recall (part of) grammar of simple imperative language:
Stmt ::= Var ‘=’ Expr ‘;’
| ‘{’ Stmts ‘}’
Stmts ::= �
| Stmt Stmts
Expr ::= Number
| Var
| Expr Oper Expr
| ‘(’ Expr ‘)’
Oper ::= ‘+’ | ‘*’
I Want to parse statements and expressions into following AST:
type Var = String
data Stmt = Assign Var Expr | Seq [Stmt]
data Expr = Num Int | Ref Var | Add Expr Expr | Mul Expr Expr
— note: no separate case for parenthesized Expr
I How would we like such a parser to look?
15 / 24
Parser code using combinators
— Imported from general parser library:
newtype Parser a = …
instance Monad Parser where … — enables do-notation
symbol :: String -> Parser ()
(<|>) :: Parser a -> Parser a -> Parser a
many :: Parser a -> Parser [a]
parseString :: Parser a -> String -> Either ParseError a
pStmt :: Parser Stmt
pStmt = do v <- pVar; symbol "="; e <- pExpr
return $ Assign v e
<|>
do symbol “{“; ss <- many pStmt; symbol "}"
return $ Seq ss
pVar :: Parser Var -- TBD
pExpr :: Parser Expr -- TBD
16 / 24
Implementing the parser library
I Can often use existing library as “black box”.
I But very useful to be aware of underlying principles
I Understand limitations on grammars handled.
I E.g., left-recursion, lookahead, ...
I Cf. LALR(1) restriction for Yacc.
I Understand performance characteristics
I E.g., exponential slowdown if grammar specified inappropriately
I Can add extensions where needed
I Utility combinators (using only exported API of library)
I Genuine extensions (using library internals)
I Reading materials contain presentations of two complete parser
libraries (ReadP, Parsec).
I In lectures, will show bits and pieces of toy variant, combining
aspects of both.
I Don’t use for assignments, or any other real work!
17 / 24
First step: what is a Parser?
I Need to support sequencing, where consecutive parser calls
consume input string incrementally.
I First try: a pure state monad
newtype Parser a = P {runP :: String -> (a, String)}
I E.g., pNum :: Parser Int will read the number at start of input
string and return its value and the rest of the string (for
following parsers)
I But what if parsing fails, i.e., if string does not start with a digit?
I Could call error and abort entire program.
I Not very nice on library users.
I Problematic for parsing nonterminals with multiple alternatives,
because can’t recover from failure and try another parse.
18 / 24
Parsing with failures
I Better: explicitly account for possibility of failure:
newtype Parser a = P {runP :: String -> Maybe (a, String)}
I Even more general: include error information upon failure:
type ParseError = …
newtype Parser a =
P {runP :: String -> Either ParseError (a, String)}
I (Maybe version corresponds to taking type ParseErrror = ())
I Make into a Monad:
type ParseError = String
instance Monad Parser where
return a = P (\s -> return (a,s)) — builds on Either monad!
m >>= f = P (\s -> do (a,s’) <- runP m s; runP (f a) s')
fail e = P (\s -> Left e)
— also: instance Functor Parser, instance Applicative Parser
19 / 24
Some examples of simple parsers
— read single char from input stream, if possible
getc :: Parser Char
getc = P (\s -> case s of
(c : s’) -> return (c, s’)
“” -> Left “unexpected end of input”)
— read char of specific class
— for use with, e.g., Data.Char.isDigit :: Char -> Bool
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = do c <- getc
if p c then return c
else fail $ "unexpected '" ++ [c] ++ "'"
-- skip expected string
string :: String -> Parser ()
string s = mapM_ (\c -> satisfy (== c)) s
— uses MapM_ :: Monad m => (a -> m b) -> [a] -> m ()
20 / 24
Alternatives
I How to try alternative productions for a terminal?
I Choosing between two RHSs, preferring first:
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 =
P (\s -> case runP p1 s of
Right (a,s’) -> Right (a,s’)
Left e1 -> case runP p2 s of
Right (a,s’) -> Right (a,s’)
Left e2 -> Left $ e1 ++ ” or ” e2)
I If p1 succeeds, will not try p2.
I In most practical grammars, at most one alternative can succeed,
anyway
I Often, infeasible alternatives fail already on first symbol.
I Caution: if one alternative can be empty, will always succeed, so
should try it last (unless doing deep backtracking, next time).
I If both p1 and p2 fail, tries to combine error msgs, rather than
only return msg from p2 (which may be confusing).
21 / 24
Iteration
I Often want to parse any number of consecutive occurrences of
grammar element (cf. “{ · · · }” in EBNF):
many :: Parser a -> Parser [a]
many p = do a <- p; as <- many p; return (a:as)
<|> return [] — ordering matters! (for our Parser)
pNum :: Parser Int
pnum = do d <- satisfy isDigit; ds <- many (satisfy isDigit)
return $ read (d:ds) -- using instance Read Int
-- Ex: runP pNum "123!" == Right (123, "!")
I Could also define many1 (aka some): at least one occurrence.
I Danger: if p can succeed without consuming anything, many p
will run forever!
I In particular, many (many p) will not work.
I Can often fix by rephrasing the grammar.
I Or use fancier many, that exits iteration if argument parser
succeeded without consuming any input.
22 / 24
Complete parses
I In general, all parsers consume input from start of string and
leave remainder.
I To parse complete input, should check that nothing left after
start symbol has been parsed.
I Simple check when eventually getting out of Parser monad:
parseStringStmt :: String -> Either ParseError Stmt
parseStringStmt s =
case runP pStmt s of
Right (a, “”) -> Right a
Right (_, _) -> Left “Garbage left at end of input”
Left e -> Left $ “Parsing failed: ” ++ e
23 / 24
What next
I Thursday’s lecture: more advanced features and issues:
I Dealing with left recursion
I Operator precedences and associativites
I Lexing issues (esp. whitespace)
I More on error reporting
I Deep vs. shallow backtracking
I …
I Assignment 1 due tomorrow evening.
I Remember testing!
I Not just running examples from handout and/or OnlineTA.
I Assignment 2 (out later today) will be a parser for SubScript.
I Lab session today 11–12.
I Mostly intended for TA help with Assignment 1, and/or
Assignment 0 resubmission.
I Some grammar exercises at end of Sestoft & Larsen parser
notes.
I But you’ll need to read the notes first!
24 / 24