COMP2100/COMP6442
Parsing
Sid Chi
[Lecture 10]
–
Kin Chau
1
Unstructured Data
2
Goals of This Lecture
• Motivation
• Learn how source codes are interpreted,
compiled, and executed • Tokenization
• Parsing
• Context-free Grammars
• Recursive Descent Parsing
3
Parsing Text Data
• Modern data is often stored in text files
• Input data to computer via various text files
• Markup languages: HTML, JSON, XML, Markdown…
• Programming languages: Java, C, C++, Haskell, Perl, Python, PHP, … • Mathematical expressions, e.g., {(2+3)*4}/2
• Need to extract meaningful information for computers • Need rules for writing and reading
4
Basic Idea of Parsing
• Also called syntax (syntactic) analysis
• Aim to understand the exact meaning of structured text
• Resulting in parse tree, a representation of structured text • Preceded by a tokenizer
• Need a grammar to generate parse tree
• Grammar is a set of rules of language in structured text • Test whether a text conforms to the given grammar
5
Parsing Process
xx + yyy = zz; (x+y)/z = a; a – b – c = x;
Source text
xx + yyy
Tokens
Grammars
Tokenizer
Parser
01010101000
00011110011
11010101011
1111010010…
Parse Trees
Interpreter/ Compiler
Executable file
6
…
Tokenization
• Tokenization is the process of converting a sequence of characters into a sequence of tokens
• Example, natural language tokenization • Input
• “I want to tokenize this sentence.” • Output
• I / want / to / tokenize / this / sentence /.
• In this case, each token is a vocabulary word or a punctuation mark
7
Tokenization of Structured Text
• A token is a string with an assigned meaning as a pair consisting of a token type and a token value
• Example:
public void myMethod(intvar1)
(keyword, “public”) (punctuator, “)”) (keyword, “void”) (identifier, “var1”)
(identifier,“myMethod”) (keyword,“int”) (punctuator, “(”)
8
Token Types
• Tokens: type, location, name (if any)
Token Type
Example
Punctuators
();,[]
Operators
+ – * :=
Keywords
begin end if while try catch
Identifiers
Square_Root
String literals
“press Enter to continue”
Character literals
‘x’
Numeric literals (integer)
123
Numeric literals (floating point)
5.23e+2
9
Punctuators (Separators)
• Typically individual special characters
• Example: ( { } : . ;
• Sometimes double characters: tokenizer looks for longest token:
• /*, //, — comment openers in various languages • Returned as identity (type) of token
• Plus perhaps location for error messages and debugging purposes
10
Operators
• Like punctuators
• No real difference for tokenizer
• Tokenizers do not “process/execute” tokens • Typically single or double special characters
• Operators
•+ – == <=
• Operations • :=
• Returned as type of token • Plus perhaps location
11
Identifiers and Keywords
• Identifiers: function names, variable names • Length, allowed characters, separators
• Need to build a name table
• Single entry for all occurrences of names like var1, myFunction • Typical data structure: hash table
• Tokenizer returns token types
• Plus key (index) to table entry
• Table entry includes location information
• Keywords: Reserved identifiers (it can be case-sensitive) • e.g., BEGIN END in Pascal, if in C, catch in C++
12
Literals
• Pre-defined constants in programming languages • String literals
• “example”
• Character literals
• ‘c’
• Numeric literals
• 123 (Integer)
• 123.456 (Double)
13
Free
• Free-form languages (modern ones)
• White space does not matter. Ignore these:
• Tabs, spaces, new lines, carriage returns
• Only the ordering of tokens is important • e.g., C, C++, Java, Javascript, ...
• Fixed format languages (historical ones)
• Layout is critical
• Fortran, Python, indentation
• Tokenizer must know about layout to find tokens
• It was born in 1950’s (for punched cards, one statement per card – easy to debug/maintain)
-
form vs. Fixed format
14
Case Equivalence
• Some programming languages are case-insensitive • Pascal, Ada
• Some programming languages are case-sensitive • C, Java
• Tokenizer ignores case if configured
• This_Routine == THIS_RouTine
• Error analysis of tokens may need exact cases
15
Tokenizer
• Tokenizer carries out pre-processing of the source text
• A finite-state machine (FSM) is used to track the input sequences of
characters that can be contained within any recognized tokens
• The first non-whitespace character can be used to deduce the kind of token that follows
• Subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token
16
Finite State Machine of
Tokenizer
whitespace
start
identifier
a-z, A-Z 0-9
a-z, A-Z
integer 0-9 double 0-9 . (dot)
operator
+, -, *, /
parenthesis
0-9
(, )
17
General Strategy for Tokenization
• Tokenization aims to split the words in the source text
• Practical source text contain certain redundant information
• Tokenizer relies on simple strategies:
• Punctuation and whitespace may or may not be included in the resulting list
of tokens
• All contiguous strings of alphabetic characters are part of one token; likewise with numbers
• Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters
18
General Approach
• Define a set of token types: • Enumeration type
• tok_int, tok_if, tok_plus, tok_left_paren, etc
• Tokenizer returns a pair consisting of a token name and an optional
token value
• Some tokens carry associated data
• e.g. Location in the text • Key for identifier table
19
Abstract Class Tokenizer
public abstract class Tokenizer {
// extract next token from the current text and save it
}
public abstract void next();
// return the current token (without type information)
public abstract Object current();
//check whether there is a token remaining in the text
public abstract boolean hasNext();
20
What is a Grammar?
• Grammar is a formal method to describe a (textual) language
• Simple solution: Defining a finite set of all acceptable sentences • Problems:
• Large space/memory complexity
• What to do with infinite languages?
• Practical solution: a finite recipe to describe all acceptable sentences
• A grammar is a finite description of a possibly infinite set of acceptable sentences
21
Grammars
• Grammars formally specify how languages are constructed
• A grammar specifies the rules of a language
• Production rule: “a → b” means that a can be rewritten as b
• Example, simplified English grammar:
22
Example
• Derivation of string “the dog sleeps” from the simplified English grammar
Derivation
Grammar Rule
⇒
⇒ the
⇒ the dog
⇒ the dog sleeps
23
Example
• Derivation of string “a cat runs” from the simplified English grammar:
Derivation
Grammar Rule
⇒
⇒ a
⇒ a cat
⇒ a cat runs
24
Language of Grammar
• Language is all possible derivations from a given grammar • Example, the language of the simplified English grammar:
L = {
“a cat runs”,
“a cat sleeps”, “the cat runs”, “the cat sleeps”, “a dog runs”,
“a dog sleeps”,
“the dog runs”,
“the dog sleeps” }
25
Chomsky’s Grammar Hierarchy
• Type-0: Recursively Enumerable
• Specified by a Turing machine (any computer)
• Type-1: Context-sensitive • Rules: αAβ → αγβ
• Type-2: Context-free • Rules: A→γ
• Type-3: Regular
• Specified by a finite state machine
26
Context Free Grammars
• Grammars provide a precise way of specifying languages
• A context free grammar is often used to define the syntax
• A context free grammar is specified via a set of production rules
• Specified by → or ::=
• Variables (or non-terminals; surrounded with <> ) • Example:
• Terminals (symbols)
• Example: a, cat, runs
• Alternatives ( | )
• Example:
27
Example
• Grammar for integers
• {1, 2, 123, 001, 123001, 76860, 00000000, ….}
• Exercise: What is the grammar for an unsigned integer number that is not starting with ‘0’ character?
28
Exercise
• Grammar of matching parentheses:
• ()(()()) : Acceptable/Not Acceptable?
• (())()((())) : Acceptable/Not Acceptable?
• ((()) : Acceptable/Not Acceptable?
• ))()(( : Acceptable/Not Acceptable?
• ())(() : Acceptable/Not Acceptable?
• ))()(()( : Acceptable/Not Acceptable?
→ → () → ()
→ → )( → ()
29
Online
• Web-based grammar generator
• https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/ • Useful for online learning
Demo
30
Formal Definitions
• Grammar: G=(V,T,S,P)
• V: Set of variables
• T: Set of terminal symbols • S: Start variable
• P: Set of production rules
• Example
• V = {
• T = {0,1,2,3,4,5,6,7,8,9} • S =
•P=
31
Grammar for Mathematical Expressions
• Mathematical expressions can be generated by the following grammar:
• V = {
• T = {a, +, ∗, (, )}
• Let’s assume ‘a’ can be any number • S =
•P=
32
Example
• Derivation of expression “a + a * a” from the grammar
⇒ a +
⇒ a + a *
a
aa
Parse Tree
33
Evaluate Expression via Parse Tree
• Parsing tree defines an evaluation order
• Evaluation of expression can be evaluated from leaves to root
• Intermediate variables are replaced by terminal values iteratively
6
2
4
22
2
22
34
Ambiguity
• It is ambiguous because there are two possible derivations!
6
8
2
44
+
2
2222 2
2
2222
35
Ambiguous Grammar
• A context free grammar G is ambiguous
• If there is a string w from language of grammar G which has • More than one derivation tree
• But it is possible to remove ambiguity by restructuring the grammar
Ambiguous Grammar Non-ambiguous Grammar
36
Example
• Derivation of expression “a + a * a” from the grammar
⇒ a +
⇒ a +
⇒ a +
+
a
a
Grammar Production Rule
Parse Tree
37
a
Example
• Check non-ambiguous Grammar
• https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/
38
Implementing Parser
• Recursive descent parser
• Top-down parser
• Parse from start variable, recursively parse input tokens
• Create a method for each left-hand side variable in the grammar production rules
• These methods are responsible for generating parsed nodes
39
Variable (or Symbol) as a node
• To construct a parse tree,
• Can adopt ideas from a tree data structure
• Each variable (or symbol) can be represented as a node in a tree
• Define a node class for each variable
40
Implement a Parser
• Given grammar start with Exp
• We implement based on Recursive Descent Parsing
• Node classes
• abstract class Exp
• class Term extends Exp
• class Factor extends Exp • class Int extends Exp
• class AddExp extends Exp • class MulExp extends Exp
• Parse method for each rule in Parse class
• Exp parseExp(Tokenizer)
• Exp parseTerm(Tokenizer)
• Exp parseFactor(Tokenizer)
41
Implement a Parser
public class Parser {
// parse
}
public static Exp parseExp(Tokenizer tok){ . . . } // parse
public static Exp parseTerm(Tokenizer tok){ . . . }
// parse
public static Exp parseFactor(Tokenizer tok){ . . . }
42
Implement a Parser
• To implement rule “
public static Exp parseExp(Tokenizer tok){ Term term = parseTerm(tok);
if (tok.current()==‘+’){ tok.next();
Exp exp = parseExp(tok);
return new AddExp(term, exp); } else {
This production rule starts with Term
} }
return term;
Addition between Term and Exp
If the next token is +, apply first production rule
43
Implement a Parser
• To implement rule “
public static Exp parseTerm(Tokenizer tok){ Factor factor = parseFactor(tok); if (tok.current()==‘*’){
This production rule starts with Factor
}
Multiplication between Factor and Term
tok.next();
Term term = parseTerm(tok); return MulExp(factor, term);
} else {
return factor;
}
If the next token is *, apply first production rule
44
Implement a Parser
• To implement rule “
if (tok.current()==‘(’){ tok.next();
Exp exp = parseExp(tok);
tok.next();
If the next token is ‘(‘, apply the first production rule
Remove ‘)’
return exp; } else {
} }
Int i = new Int(tok.current());
return i;
Create an integer
45
Limitations
• Recursive Descent Parsing: Top-down parser from recursive procedures • The Recursive Descent Parsing approach will not work with left recursive
grammars. For example,
• which cannot be parsed using the recursive descent parser • However we could transform the grammar into:
• which represents the same language, but can be parsed by recursive descent parser
46
Automatic Parser Generation
Parser Generator
xx + yyy = zz;
(x+y)/z = a;
a – b – c = x;
Source text
01010101000
00011110011
11010101011
1111010010…
xx + yyy
Tokens
Grammars
Tokenizer
Parser
Interpreter/ Compiler
Executable file
Parse Trees
47
…
Automatic Parser Generation
• Parser Generators
• JavaCC, ANTLR, Yacc/Bison
• Manual Parser Implementation
• Good error recovery and reporting
• Flexible combination of parsing and actions
• Automatic Parser Generation
• Save manual work
• But more complex and rigid frameworks
• Error recovery and reporting more difficult
48
Summary
• Structured data
• Markup languages: HTML, JSON, XML, Markdown…
• Programming languages: Java, C, C++, Haskell, Perl, Python, PHP, … • Mathematical expressions, e.g., {(2+3)*4}/2
• Tokenization
• Parsing
• Context-free Grammars
• Recursive Descent Parsing
49
50