程序代写代做代考 python flex compiler database Excel Java Fortran Haskell algorithm Advanced Programming 2018 – Introduction to Parsing and Parser Combinators

Advanced Programming 2018 – Introduction to Parsing and Parser Combinators

Advanced Programming 2018
Introduction to Parsing and Parser Combinators

Andrzej Filinski
andrzej@di.ku.dk

Department of Computer Science
University of Copenhagen

September 18, 2018

1 / 24

Background

I In practical applications, often have to read in structured
textual data for further processing.

I Could be actual program (script) code, but also serialized
database dumps, marked-up text, web page templates,
configuration files, build rules, …

I If expressed in a standard format (XML, JSON, YAML, …), can
probably find a ready-made reader.

I Though maybe not exactly for your language/platform
I Or for the almost-standard format you actually need to read.

I Proprietary/experimental extensions
I Standards version skew
I Not-strictly compliant data

I But otherwise, you’re on your own.
I As a computer scientist, you’ll need to be able to cope.

2 / 24

Introduction to parsing

I Parsing is (for our purposes) the process of constructing
I … a structured representation of data in memory …
I … from its linearized representation as a text string.

I The structured representation is usually a simple data
structure, the abstract syntax tree (AST).

I Haskell’s (and other functional languages’) algebraic data types
(data) are an excellent match for this.

I Have already seen for Assignment 1.

I The description of the linearized format is usually given by a
grammar.

I Describes concrete syntax of language being parsed
I And (where non-obvious) how the concrete syntax corresponds to

the abstract.

I Must be suitable for machine processing
I Sometimes a few tweaks are necessary

3 / 24

Parsing principles

I Most languages (both human and machine-oriented) have
natural two-level structure:

I The lexical level deals with rules for the structure of the atomic
tokens of the language:

I Exact rules for format of numerals, identifiers, keywords,
operator symbols, …

I What goes into leaves (or at least single nodes) of the AST

I Often have finitary, or simple iterative, description:
I “A keyword is one of the following: if, then, else, for, do, …”
I “A valid identifier is a letter followed by zero or more letters, digits,

or underscores”

I Also, specifies general lexical properties not associated with
individual tokens, e.g.:

I case sensitivity of identifiers/keywords
I whitespace rules (i.e., where allowed/required)
I comment format(s)

4 / 24

Parsing principles, continued

I The syntactic level deals with rules for how tokens may be
composed into higher-level units: expressions, statements,
functions, modules, …

I Typically reflects the structure of the AST
I But with a little more “noise”: parentheses, separator symbols for

lists (“,” vs. “;” vs. “ ”), …

I Almost always involves non-trivial recursion:
I An expression is a number, or a variable, or two expressions

separated by an infix operator, or …
I A statement is an assignment of an expression to a variable, or a

sequence of statements enclosed in braces, or …

I In practice, boundary between lexical and syntactic levels often
a bit fuzzy (more later).

I “Scannerless parsing”: unified view.

5 / 24

Formal grammars

I A grammar says how individual characters or tokens (called
terminal symbols or just terminals) are arranged into
higher-level constructs (nonterminals).

I Grammar contains a collection of rules (or productions),
specifying how each nonterminal may be built out of sequences
of zero or more terminals and/or nonterminals.

I One of the nonterminals is the designated start symbol.
I “Entry point” of the grammar for a single parsing run.
I Cf. main function in a C/Java/Haskell program.

I Meta-notation: when talking about grammars in general, will
use:

I uppercase Roman letters (A , B , C , …) for nonterminals,
I lowercase Roman letters (a, b , c, …) for terminals
I (lowercase) Greek letters (α, β, γ, …) for (possibly empty)

sequences of terminals and nonterminals.

6 / 24

Context-free grammars

I A CFG (aka “BNF grammar”: for “Backus–Naur Form” or
“Backus Normal Form”) has the following general shape
(“ragged matrix”):

A1 ::= α11 | · · · | α1n1

Am ::= αm1 | · · · | αmnm
I Informally: (mutually) recursive definitions of all the

nonterminals, enumerating all valid alternatives for each.

I Important subclass of CFGs: regular grammars; nonterminals
are only allowed as last element of an αij (“tail recursion”).

I Also definable: context-sensitive, or even more general
grammars, may also have additional symbols on LHSs of
productions.

I Rarely used in CS practice, but more commonly in linguistics.

7 / 24

Writing specific grammars

I Many common concrete conventions for writing actual
grammars; will generally use the following for AP :

I nonterminals written as capitalized identifiers in italics
I terminals written as character sequences between single quotes

and in typewriter font
I symbols in sequences are separated by spaces.
I empty sequence of symbols written as � (instead of nothing)

I E.g.. syntactic grammar for small imperative language:

Stmt ::= Var ‘=’ Expr ‘;’

| ‘{’ Stmts ‘}’
Stmts ::= �

| Stmt Stmts

Expr ::= Number

| Var
| Expr Oper Expr
| ‘(’ Expr ‘)’

Oper ::= ‘+’ | ‘*’
(Assumes Number, Var defined elsewhere)

8 / 24

Lexical grammars

I Lexical structure of tokens can also be described by grammars.
I Example: lexical grammar for variable names:

Var ::= VHChar VarRest

VarRest ::= � | VRChar VarRest
VHChar ::= ‘A’ | · · · | ‘Z’
VRChar ::= VHChar | ‘0’ | · · · | ‘9’ | ‘_’

I “…” is informal range notation. In completely formal grammar,
would have to write out all alternatives.

I Not technically a regular grammar (not “tail recursive”)
I But could be turned into one, by unfolding definitions of

character classes and refactoring.
I In practice, lexical grammar often specified (partially) textually.

I Character classes may be huge (e.g., “letters” in Unicode).
I Avoids clutter in syntactic grammar to express that arbitrary

whitespace or comments allowed between proper tokens.

9 / 24

Extended BNF (EBNF) notation.

I Often seen in real-world specifications: regexp-like extensions
for writing CFGs a bit more concisely.

I Easily “desugared” into plain BNF.

I Internal grouping:

A ::= α ( β1 | · · · | βn) γ
A ::= α B γ
B ::= β1 | · · · | βn

I Optional elements:

A ::= α [ β ] γ
A ::= α B γ
B ::= β | �

I Iterated elements

A ::= α { β } γ
A ::= α B γ
B ::= β B | �

I Notation can also be nested or used multiple times in single
rule, but with diminishing readability gains.

10 / 24

Derivations and parsing

I Given grammar (regular or context free), can say that a
nonterminal A derives string of terminals α, written A ⇒ α.

I Begin with start symbol, repeatedly replace some nonterminal
with one of its production RHSs, until only terminals left.

I Example (with selected nonterminal underlined in each step)
Expr ⇒ Expr + Expr ⇒ Num + Expr ⇒ 4 + Expr ⇒ 4 + Var ⇒ 4 + X

I For parsing, considering the opposite problem: given sequence
of terminals, can it be derived from the start symbol?

I And by which exact productions? (To get parse tree)
I Not obvious that this is even effectively decidable.

I Worst case “only” O(n3), where n is length of input [Earley]
I For “well behaved” grammars, can do significantly better, often

close to O(n).
I In practice, most grammars are indeed well behaved.

I If hard to parse for a computer, not great for humans either
I … even if brain may have “built-in” parsing acceleration [Chomsky]

11 / 24

Parsing in practice

I Traditional compiler wisdom: lexing done with regular
grammars (regexps), parsing with CFGs.

I In practice: not quite.

I Real languages (especially legacy ones) often have mildly
context-sensitive (or worse) lexing or parsing rules that defy
easy categorization. Some random examples:

I Fortran 66 “Hollerith constants”: CALL WRITE(12HHELLO WORLD!)

I Programmer-specified operator precedence and associativity:
Haskell’s infixr 5 ++. (In SML, can even be lexically scoped!)

I C’s “typedef problem”: is (a)*b a cast of a pointer dereference,
or a multiplication? Depends on prior declaration of a.

I Non-reserved keywords (Fortran, PL/I), e.g., FORI may be
variable name, or start of a FOR-loop, depending on context.

I Indentation sensitivity (Python, Haskell).

12 / 24

Parsing tools

I Parsing according to a CFG may look scary.

I General parser-generator tools exist: Lex/Flex, Yacc/Bison, …
I (Quasi-)ports to various languages, including Haskell (Alex,

Happy)
I Translate a grammar specification into (often table-based)

parsing program.

I Often give best absolute performance:
I Can use fancier parsing algorithms that require heavy

preprocessing.
I Avoid “interpretation overhead” at parsing time.

I But for majority of applications, lexing/parsing is far from the
most time-consuming part.

I So performance matters less than programmer productivity.

I Also, need various hacks to handle complications from previous
slide with pure CFG tools.

13 / 24

Parser combinators

I For many purposes, preferable to hand-code a parser from
(almost) scratch, using some variant of recursive-descent
parsing.

I Can “escape” to full host language whenever needed during
parsing.

I Particularly convenient in Haskell:
I Small library of parser combinators make parser specification

almost as concise and readable as grammar itself
I Lazy evaluation and infix operators help keep notation extra

compact.
I Monad-based approach makes it easy to incorporate

backtracking (if/when desired), so lookahead is not a problem.
I Additional features and optimizations can also be pushed into the

combinators by tweaking the parsing monad.

14 / 24

A simple parsing task

I Recall (part of) grammar of simple imperative language:

Stmt ::= Var ‘=’ Expr ‘;’

| ‘{’ Stmts ‘}’
Stmts ::= �

| Stmt Stmts

Expr ::= Number

| Var
| Expr Oper Expr
| ‘(’ Expr ‘)’

Oper ::= ‘+’ | ‘*’
I Want to parse statements and expressions into following AST:

type Var = String
data Stmt = Assign Var Expr | Seq [Stmt]
data Expr = Num Int | Ref Var | Add Expr Expr | Mul Expr Expr

— note: no separate case for parenthesized Expr

I How would we like such a parser to look?

15 / 24

Parser code using combinators

— Imported from general parser library:
newtype Parser a = …
instance Monad Parser where … — enables do-notation
symbol :: String -> Parser ()
(<|>) :: Parser a -> Parser a -> Parser a
many :: Parser a -> Parser [a]
parseString :: Parser a -> String -> Either ParseError a

pStmt :: Parser Stmt
pStmt = do v <- pVar; symbol "="; e <- pExpr return $ Assign v e <|>
do symbol “{“; ss <- many pStmt; symbol "}" return $ Seq ss pVar :: Parser Var -- TBD pExpr :: Parser Expr -- TBD 16 / 24 Implementing the parser library I Can often use existing library as “black box”. I But very useful to be aware of underlying principles I Understand limitations on grammars handled. I E.g., left-recursion, lookahead, ... I Cf. LALR(1) restriction for Yacc. I Understand performance characteristics I E.g., exponential slowdown if grammar specified inappropriately I Can add extensions where needed I Utility combinators (using only exported API of library) I Genuine extensions (using library internals) I Reading materials contain presentations of two complete parser libraries (ReadP, Parsec). I In lectures, will show bits and pieces of toy variant, combining aspects of both. I Don’t use for assignments, or any other real work! 17 / 24 First step: what is a Parser? I Need to support sequencing, where consecutive parser calls consume input string incrementally. I First try: a pure state monad newtype Parser a = P {runP :: String -> (a, String)}

I E.g., pNum :: Parser Int will read the number at start of input
string and return its value and the rest of the string (for
following parsers)

I But what if parsing fails, i.e., if string does not start with a digit?

I Could call error and abort entire program.
I Not very nice on library users.
I Problematic for parsing nonterminals with multiple alternatives,

because can’t recover from failure and try another parse.

18 / 24

Parsing with failures

I Better: explicitly account for possibility of failure:

newtype Parser a = P {runP :: String -> Maybe (a, String)}

I Even more general: include error information upon failure:

type ParseError = …
newtype Parser a =

P {runP :: String -> Either ParseError (a, String)}

I (Maybe version corresponds to taking type ParseErrror = ())

I Make into a Monad:

type ParseError = String
instance Monad Parser where

return a = P (\s -> return (a,s)) — builds on Either monad!
m >>= f = P (\s -> do (a,s’) <- runP m s; runP (f a) s') fail e = P (\s -> Left e)

— also: instance Functor Parser, instance Applicative Parser

19 / 24

Some examples of simple parsers

— read single char from input stream, if possible
getc :: Parser Char
getc = P (\s -> case s of

(c : s’) -> return (c, s’)
“” -> Left “unexpected end of input”)

— read char of specific class
— for use with, e.g., Data.Char.isDigit :: Char -> Bool
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = do c <- getc if p c then return c else fail $ "unexpected '" ++ [c] ++ "'" -- skip expected string string :: String -> Parser ()
string s = mapM_ (\c -> satisfy (== c)) s
— uses MapM_ :: Monad m => (a -> m b) -> [a] -> m ()

20 / 24

Alternatives

I How to try alternative productions for a terminal?
I Choosing between two RHSs, preferring first:

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 =

P (\s -> case runP p1 s of
Right (a,s’) -> Right (a,s’)
Left e1 -> case runP p2 s of

Right (a,s’) -> Right (a,s’)
Left e2 -> Left $ e1 ++ ” or ” e2)

I If p1 succeeds, will not try p2.
I In most practical grammars, at most one alternative can succeed,

anyway
I Often, infeasible alternatives fail already on first symbol.
I Caution: if one alternative can be empty, will always succeed, so

should try it last (unless doing deep backtracking, next time).

I If both p1 and p2 fail, tries to combine error msgs, rather than
only return msg from p2 (which may be confusing).

21 / 24

Iteration

I Often want to parse any number of consecutive occurrences of
grammar element (cf. “{ · · · }” in EBNF):
many :: Parser a -> Parser [a]
many p = do a <- p; as <- many p; return (a:as) <|> return [] — ordering matters! (for our Parser)

pNum :: Parser Int
pnum = do d <- satisfy isDigit; ds <- many (satisfy isDigit) return $ read (d:ds) -- using instance Read Int -- Ex: runP pNum "123!" == Right (123, "!") I Could also define many1 (aka some): at least one occurrence. I Danger: if p can succeed without consuming anything, many p will run forever! I In particular, many (many p) will not work. I Can often fix by rephrasing the grammar. I Or use fancier many, that exits iteration if argument parser succeeded without consuming any input. 22 / 24 Complete parses I In general, all parsers consume input from start of string and leave remainder. I To parse complete input, should check that nothing left after start symbol has been parsed. I Simple check when eventually getting out of Parser monad: parseStringStmt :: String -> Either ParseError Stmt
parseStringStmt s =

case runP pStmt s of
Right (a, “”) -> Right a
Right (_, _) -> Left “Garbage left at end of input”
Left e -> Left $ “Parsing failed: ” ++ e

23 / 24

What next

I Thursday’s lecture: more advanced features and issues:
I Dealing with left recursion
I Operator precedences and associativites
I Lexing issues (esp. whitespace)
I More on error reporting
I Deep vs. shallow backtracking
I …

I Assignment 1 due tomorrow evening.
I Remember testing!
I Not just running examples from handout and/or OnlineTA.

I Assignment 2 (out later today) will be a parser for SubScript.
I Lab session today 11–12.

I Mostly intended for TA help with Assignment 1, and/or
Assignment 0 resubmission.

I Some grammar exercises at end of Sestoft & Larsen parser
notes.

I But you’ll need to read the notes first!

24 / 24