CS计算机代考程序代写 python data structure javascript compiler Java flex c++ Fortran Haskell interpreter Hive ada COMP2100/COMP6442

COMP2100/COMP6442
Parsing
Sid Chi
[Lecture 10]
–
Kin Chau
1

Unstructured Data
2

Goals of This Lecture
• Motivation
• Learn how source codes are interpreted,
compiled, and executed • Tokenization
• Parsing
• Context-free Grammars
• Recursive Descent Parsing
3

Parsing Text Data
• Modern data is often stored in text files
• Input data to computer via various text files
• Markup languages: HTML, JSON, XML, Markdown…
• Programming languages: Java, C, C++, Haskell, Perl, Python, PHP, … • Mathematical expressions, e.g., {(2+3)*4}/2
• Need to extract meaningful information for computers • Need rules for writing and reading
4

Basic Idea of Parsing
• Also called syntax (syntactic) analysis
• Aim to understand the exact meaning of structured text
• Resulting in parse tree, a representation of structured text • Preceded by a tokenizer
• Need a grammar to generate parse tree
• Grammar is a set of rules of language in structured text • Test whether a text conforms to the given grammar
5

Parsing Process
xx + yyy = zz; (x+y)/z = a; a – b – c = x;
Source text
xx + yyy
Tokens
::= | + | –
Grammars
Tokenizer
Parser
01010101000
00011110011
11010101011
1111010010…
Parse Trees
Interpreter/ Compiler
Executable file
6
…

Tokenization
• Tokenization is the process of converting a sequence of characters into a sequence of tokens
• Example, natural language tokenization • Input
• “I want to tokenize this sentence.” • Output
• I / want / to / tokenize / this / sentence /.
• In this case, each token is a vocabulary word or a punctuation mark
7

Tokenization of Structured Text
• A token is a string with an assigned meaning as a pair consisting of a token type and a token value
• Example:
public void myMethod(intvar1)
(keyword, “public”) (punctuator, “)”) (keyword, “void”) (identifier, “var1”)
(identifier,“myMethod”) (keyword,“int”) (punctuator, “(”)
8

Token Types
• Tokens: type, location, name (if any)
Token Type
Example
Punctuators
();,[]
Operators
+ – * :=
Keywords
begin end if while try catch
Identifiers
Square_Root
String literals
“press Enter to continue”
Character literals
‘x’
Numeric literals (integer)
123
Numeric literals (floating point)
5.23e+2
9

Punctuators (Separators)
• Typically individual special characters
• Example: ( { } : . ;
• Sometimes double characters: tokenizer looks for longest token:
• /*, //, — comment openers in various languages • Returned as identity (type) of token
• Plus perhaps location for error messages and debugging purposes
10

Operators
• Like punctuators
• No real difference for tokenizer
• Tokenizers do not “process/execute” tokens • Typically single or double special characters
• Operators
•+ – == <= • Operations • := • Returned as type of token • Plus perhaps location 11 Identifiers and Keywords • Identifiers: function names, variable names • Length, allowed characters, separators • Need to build a name table • Single entry for all occurrences of names like var1, myFunction • Typical data structure: hash table • Tokenizer returns token types • Plus key (index) to table entry • Table entry includes location information • Keywords: Reserved identifiers (it can be case-sensitive) • e.g., BEGIN END in Pascal, if in C, catch in C++ 12 Literals • Pre-defined constants in programming languages • String literals • “example” • Character literals • ‘c’ • Numeric literals • 123 (Integer) • 123.456 (Double) 13 Free • Free-form languages (modern ones) • White space does not matter. Ignore these: • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • e.g., C, C++, Java, Javascript, ... • Fixed format languages (historical ones) • Layout is critical • Fortran, Python, indentation • Tokenizer must know about layout to find tokens • It was born in 1950’s (for punched cards, one statement per card – easy to debug/maintain) - form vs. Fixed format 14 Case Equivalence • Some programming languages are case-insensitive • Pascal, Ada • Some programming languages are case-sensitive • C, Java • Tokenizer ignores case if configured • This_Routine == THIS_RouTine • Error analysis of tokens may need exact cases 15 Tokenizer • Tokenizer carries out pre-processing of the source text • A finite-state machine (FSM) is used to track the input sequences of characters that can be contained within any recognized tokens • The first non-whitespace character can be used to deduce the kind of token that follows • Subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token 16 Finite State Machine of Tokenizer whitespace start identifier a-z, A-Z 0-9 a-z, A-Z integer 0-9 double 0-9 . (dot) operator +, -, *, / parenthesis 0-9 (, ) 17 General Strategy for Tokenization • Tokenization aims to split the words in the source text • Practical source text contain certain redundant information • Tokenizer relies on simple strategies: • Punctuation and whitespace may or may not be included in the resulting list of tokens • All contiguous strings of alphabetic characters are part of one token; likewise with numbers • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters 18 General Approach • Define a set of token types: • Enumeration type • tok_int, tok_if, tok_plus, tok_left_paren, etc • Tokenizer returns a pair consisting of a token name and an optional token value • Some tokens carry associated data • e.g. Location in the text • Key for identifier table 19 Abstract Class Tokenizer public abstract class Tokenizer { // extract next token from the current text and save it } public abstract void next(); // return the current token (without type information) public abstract Object current(); //check whether there is a token remaining in the text public abstract boolean hasNext(); 20 What is a Grammar? • Grammar is a formal method to describe a (textual) language • Simple solution: Defining a finite set of all acceptable sentences • Problems: • Large space/memory complexity • What to do with infinite languages? • Practical solution: a finite recipe to describe all acceptable sentences • A grammar is a finite description of a possibly infinite set of acceptable sentences 21 Grammars • Grammars formally specify how languages are constructed • A grammar specifies the rules of a language • Production rule: “a → b” means that a can be rewritten as b • Example, simplified English grammar: → →

→

→ a

→ the → cat → dog → runs → sleeps
22

Example
• Derivation of string “the dog sleeps” from the simplified English grammar
→ →

→

→ a

→ the
→ cat
→ dog
→ runs
→ sleeps
Derivation
Grammar Rule
⇒ → ⇒ →
⇒

→

⇒ the

→ the
⇒ the dog
→ dog
⇒ the dog sleeps
→ sleeps
23

Example
• Derivation of string “a cat runs” from the simplified English grammar:
→ →

→

→ a

→ the
→ cat
→ dog
→ runs
→ sleeps
Derivation
Grammar Rule
⇒ → ⇒ →
⇒

→

⇒ a

→ a
⇒ a cat
→ cat
⇒ a cat runs
→ runs
24

Language of Grammar
• Language is all possible derivations from a given grammar • Example, the language of the simplified English grammar:
L = {
“a cat runs”,
“a cat sleeps”, “the cat runs”, “the cat sleeps”, “a dog runs”,
“a dog sleeps”,
“the dog runs”,
“the dog sleeps” }
25

Chomsky’s Grammar Hierarchy
• Type-0: Recursively Enumerable
• Specified by a Turing machine (any computer)
• Type-1: Context-sensitive • Rules: αAβ → αγβ
• Type-2: Context-free • Rules: A→γ
• Type-3: Regular
• Specified by a finite state machine
26

Context Free Grammars
• Grammars provide a precise way of specifying languages
• A context free grammar is often used to define the syntax
• A context free grammar is specified via a set of production rules
• Specified by → or ::=
• Variables (or non-terminals; surrounded with <> ) • Example:,
• Terminals (symbols)
• Example: a, cat, runs
• Alternatives ( | )
• Example:

→ a | the
27

Example
• Grammar for integers
• {1, 2, 123, 001, 123001, 76860, 00000000, ….}
• Exercise: What is the grammar for an unsigned integer number that is not starting with ‘0’ character?
→ | → 0|1|2|3|4|5|6|7|8|9
28

Exercise
• Grammar of matching parentheses:
• ()(()()) : Acceptable/Not Acceptable?
• (())()((())) : Acceptable/Not Acceptable?
• ((()) : Acceptable/Not Acceptable?
• ))()(( : Acceptable/Not Acceptable?
• ())(() : Acceptable/Not Acceptable?
• ))()(()( : Acceptable/Not Acceptable?
→ → () → ()
→ → )( → ()
29

Online
• Web-based grammar generator
• https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/ • Useful for online learning
Demo
30

Formal Definitions
• Grammar: G=(V,T,S,P)
• V: Set of variables
• T: Set of terminal symbols • S: Start variable
• P: Set of production rules
• Example
• V = {,}
• T = {0,1,2,3,4,5,6,7,8,9} • S =
•P=
→ |
→ 0|1|2|3|4|5|6|7|8|9
31

Grammar for Mathematical Expressions
• Mathematical expressions can be generated by the following grammar:
• V = {}
• T = {a, +, ∗, (, )}
• Let’s assume ‘a’ can be any number • S =
•P=
→ + | * | () | a
32

Example
• Derivation of expression “a + a * a” from the grammar
⇒ + ⇒ a +
⇒ a + *
⇒ a + a * ⇒a+a*a
→ + | * | () | a Grammar Production Rule

+
a *
aa
Parse Tree
33

Evaluate Expression via Parse Tree
• Parsing tree defines an evaluation order
• Evaluation of expression can be evaluated from leaves to root
• Intermediate variables are replaced by terminal values iteratively
6

2
4
+
22
2 *
22
34

Ambiguity
• It is ambiguous because there are two possible derivations!
6

8

2

44
+ *
2
2222 2 * +
2
2222
35

Ambiguous Grammar
• A context free grammar G is ambiguous
• If there is a string w from language of grammar G which has • More than one derivation tree
• But it is possible to remove ambiguity by restructuring the grammar
Ambiguous Grammar Non-ambiguous Grammar
→ + → * → () | a
→ + | → * | → () | a
36

Example
• Derivation of expression “a + a * a” from the grammar
⇒ + ⇒ +
⇒ a +
⇒ a +
⇒ a + * ⇒ a+a* ⇒ a+a* ⇒a+a*a

+

a *
→ + | → * | → () | a
a

Grammar Production Rule
Parse Tree
37
a

Example
• Check non-ambiguous Grammar
• https://web.stanford.edu/class/archive/cs/cs103/cs103.1156/tools/cfg/
38

Implementing Parser
• Recursive descent parser
• Top-down parser
• Parse from start variable, recursively parse input tokens
• Create a method for each left-hand side variable in the grammar production rules
• These methods are responsible for generating parsed nodes
39

Variable (or Symbol) as a node
• To construct a parse tree,
• Can adopt ideas from a tree data structure
• Each variable (or symbol) can be represented as a node in a tree
• Define a node class for each variable
40

Implement a Parser
• Given grammar start with Exp
• We implement based on Recursive Descent Parsing
→ + | → * | → () | a
• Node classes
• abstract class Exp
• class Term extends Exp
• class Factor extends Exp • class Int extends Exp
• class AddExp extends Exp • class MulExp extends Exp
• Parse method for each rule in Parse class
• Exp parseExp(Tokenizer)
• Exp parseTerm(Tokenizer)
• Exp parseFactor(Tokenizer)
41

Implement a Parser
public class Parser {
// parse -> + |
}
public static Exp parseExp(Tokenizer tok){ . . . } // parse -> * |
public static Exp parseTerm(Tokenizer tok){ . . . }
// parse -> () | a
public static Exp parseFactor(Tokenizer tok){ . . . }
42

Implement a Parser
• To implement rule “ → + | ”
public static Exp parseExp(Tokenizer tok){ Term term = parseTerm(tok);
if (tok.current()==‘+’){ tok.next();
Exp exp = parseExp(tok);
return new AddExp(term, exp); } else {
This production rule starts with Term
} }
return term;
Addition between Term and Exp
If the next token is +, apply first production rule
43

Implement a Parser
• To implement rule “ → * | ”
public static Exp parseTerm(Tokenizer tok){ Factor factor = parseFactor(tok); if (tok.current()==‘*’){
This production rule starts with Factor
}
Multiplication between Factor and Term
tok.next();
Term term = parseTerm(tok); return MulExp(factor, term);
} else {
return factor;
}
If the next token is *, apply first production rule
44

Implement a Parser
• To implement rule “ → () | a” public static Exp parseFactor(Tokenizer tok){
if (tok.current()==‘(’){ tok.next();
Exp exp = parseExp(tok);
tok.next();
If the next token is ‘(‘, apply the first production rule
Remove ‘)’
return exp; } else {
} }
Int i = new Int(tok.current());
return i;
Create an integer
45

Limitations
• Recursive Descent Parsing: Top-down parser from recursive procedures • The Recursive Descent Parsing approach will not work with left recursive
grammars. For example,
• which cannot be parsed using the recursive descent parser • However we could transform the grammar into:
• which represents the same language, but can be parsed by recursive descent parser
46
→ |
→ 0|1
→ | → 0|1

Automatic Parser Generation
Parser Generator
xx + yyy = zz;
(x+y)/z = a;
a – b – c = x;
Source text
01010101000
00011110011
11010101011
1111010010…
xx + yyy
Tokens
::= | + | –
Grammars
Tokenizer
Parser
Interpreter/ Compiler
Executable file
Parse Trees
47
…

Automatic Parser Generation
• Parser Generators
• JavaCC, ANTLR, Yacc/Bison
• Manual Parser Implementation
• Good error recovery and reporting
• Flexible combination of parsing and actions
• Automatic Parser Generation
• Save manual work
• But more complex and rigid frameworks
• Error recovery and reporting more difficult
48

Summary
• Structured data
• Markup languages: HTML, JSON, XML, Markdown…
• Programming languages: Java, C, C++, Haskell, Perl, Python, PHP, … • Mathematical expressions, e.g., {(2+3)*4}/2
• Tokenization
• Parsing
• Context-free Grammars
• Recursive Descent Parsing
49

50

Related Posts