PROJECT 2: Useless Symbols,
FIRST and FOLLOW sets, and
Predictive Parsing
CSE 340 SPRING 2020 Rida A. Bazzi
Project 2 Goals
▪ IhaveintroducedinclasspredictiveparsingandFIRSTandFOLLOW sets
▪ Thegoalofthisprojectistoshowyouhowtheprocessofbuildinga predictive parser can be automated
▪ Anotherimportantgoaloftheprojectistogiveyouexperiencein writing a substantial program which is non-trivial conceptually
– This will make you a better programmer
– You will have a better understanding of the power of abstraction in building code – You will have a better appreciation of the material covered so far
Outline
▪ SetOperations
▪ Grammarrepresentation
▪ Calculatinguselesssymbols
▪ Calculating FIRST sets
▪ CalculatingFOLLOWsets
▪ Determiningifagrammarhasapredictiveparser
Set Operations
▪ IncalculatingFIRSTandFOLLOWsets,youneedtorepresenttheses sets as a data structure in your program and you need to do operations on these sets
▪ Theoperationsyouneedare
– A = A ∪ (B – {Ɛ }) : Adding the elements of one set B with the exception of epsilon
to another set A and check if the set changed due to the additions
– A=A∪ {Ɛ}: Addingepsilontoasetandcheckifthesetchangedduetothe addition
– is_epsilon_in(A): Checking if epsilon belongs to a set
– printing the elements of a set according to some order
I suggest that you write a function for each of these functionalities (and others you might identify) to make your code easier to work with
Set Operations and keeping track of
change
▪ C++ has a number of libraries and data structures that can allow you to define sets. You should look at those and adopt one of them
▪ I comment on keeping track of change when adding elements of set S1 to set S2. Here is the pseudocode
for every element in S1 that is not epsilon if element is not in S2
changed = true add element to S2
In the pseudocode, changed is a Boolean variable. You can implement it as a global variable which I think will be easier than passing it around as argument. I describe how it is used in the slides on FIRST and FOLLOW (black background).
▪ I think that you should have all the functions for set operation in place before you attempt to write higher-level functi0nality. You will end up fighting less with your code
Referring to Sets
▪ Another functionality you need in calculating FIRST and FOLLOW sets is the ability to refer to something like FIRST(A) or FOLLOW(B). The rules numbered with roman numerals I through V for FIRST and FOLLOW sets that we have seen in class assume you can do that
▪ In your program, you will need
– represent terminals and non-terminals
– refer to the sets (FIRST and FOLLOW) or particular terminals and non-terminals
▪ A common approach I saw students use is to represent terminals and non- terminals as strings (remember your program will read the names of terminals and non-terminals as IDs and the lexeme string is the name)
▪ I am going to explain how you can do better than that to keep you code less cluttered
Representing Terminals and Non-Terminals
▪ You should read all terminals and non-terminals as strings and store them in a list that I will call universe. Also, the universe will include representations for epsilon (“#”) and EOF (“$”)
▪ In order to be able to refer to FIRST(A), you can use the index of A in the list, so you can say FIRST[Index(A)], where index (A) is a function that takes a string as a parameter and returns its index in the list
▪ Alternatively, you can use an unordered map for FIRST sets and another one for FOLLOW sets and refer to FIRST[A] and FOLLOW[A], where A is a string. You should lookup how to use unordered maps if you want to follow this approach
▪ Alternatively, you can have a more efficient implementation in terms of space and performance. You can store the indices and not the strings when representing grammar rules. This will effectively replace every symbol with an integer index which allows you to use FIRST[A] where A is not an index.
▪ Let us see how this can be done and then we get back to FIRST and FOLLOW
Grammar Representation
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Rule 1
LHS 6
RHS 7 8 9 RHS_size: 3
Grammar Representation
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Rule 2
LHS 7
RHS 10 11 RHS_size: 2
Grammar Representation
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Rule 2
LHS 8 RHS 2 8 RHS_size: 2
Grammar Representation
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Rule 2
LHS 8 RHS 0 RHS_size: 1
Grammar Representation Example
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
LHS: 6 RHS: 7, 8, 9
LHS: 7 RHS: 10, 11
LHS: 8 RHS: 2 , 8
LHS: 8 RHS: 0
LHS: 9 RHS: 3, 9
LHS: 9 RHS: 0
LHS: 10 RHS: 4, 10
LHS: 10 RHS: 0
LHS: 11 RHS: 5, 11
LHS: 11 RHS: 0
Rules
Grammar Representation
▪ Youneedalistofalltherules:thiscansimplybeavectorofrules
▪ EveryruleshasaLHSwhichisanintegerindex
▪ EveryrulehasaRHSwhichisavectorofintegers,oneintegerfor every symbol on the RHS
▪ ToputtheLHSandRHStogetheryoucandeclareastructurewith two fields, one for the LHS and one for the RHS
Iterating over grammar representation
▪ Onceyouhaveavectorofrules,youcaneasilyiterateoverallthe rules
▪ Also,foragivenrule,youcaneasilyiterateovertheRHS
▪ ForcalculatingFIRSTsets(seelateralso),youcannowreferto FIRST[rule.LHS] or FIRST[rule.RHS[j]], which is more convenient than writing FIRST[index(rule.LHS)] and FIRST[index(rule.RHS[j])]
▪ Havingallentriesasintegerindicesmakesthecodeeasiertowork with
▪ Thestrings(namesofvarioussymbols)areonlyneededwhenthe output is produced. To print a symbol whose index is A, you simply print Symbols[A].
Useless Symbols
A symbol is useless if it does not appear in the derivation of a string of terminals or in the derivation of the empty string
A symbol is not useless if it appears in the derivation of a string of terminals or in a derivation of the empty string
**
S ⇒ x A y ⇒ w ∈ T*
Calculating Useless Symbols
1.
We start by calculating generating symbols
– A symbol A is generating if it can derive a string in T* (sequence of zero or more terminals)
*
A ⇒ w ∈ T*
– At the end of this step, you should remove any grammar rule that has a non-generating symbol
Then we determine reachable symbols
– A symbol A is reachable if S can derive a sentential form containing A:
* S⇒xAy
– At the end of this step, you should remove all grammar rules that have non-reachable symbols
2.
The order given is important. The calculation should be done in the order given: Calculating reachable first, then calculating generating does not work
Calculating generating symbols
1. Initialization
▪ ▪
2.
all terminals are generating ɛ is generating
If A➝A1A2 …Ak isagrammarruleand A1 generating and
A2 generating and
▪
▪
▪ … and ▪ …
▪ Ak generating
A is generating
Iterative approach to calculating
generating symbols
Generating array
F
F
F
F
F
F
F
F
F
F
F
F
S➝ABC (1) A➝DE (2)
B➝bB
B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
(3)
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Iterative approach to calculating
generating symbols: Initialization
Generating array
T
F
T
T
T
T
F
F
F
F
F
F
S➝ABC (1) A➝DE (2)
B➝bB
B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
(3)
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
F
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ SinceA,B,andCarenotknowntobe generating, we cannot say that S is generating, so there is no change
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
F
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ SinceDandEarenotknowtobe generating, we cannot say that A is generating, so there is no change
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
F
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ Here b is know to be generating (we can determine that by checking the array), but B is not known to be generating, so , again there is no change.
▪ Note that we do not care that the symbol B appears on the left and right sides. We just do the same check for all the rules.
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
F
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ Here,sinceɛisgenerating,weconclude that B is generating
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ B’sentrychangestotrue
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
F
F
F
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ Nochange
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
F
F
F
S➝ABC (1) A➝DE (2)
B➝bB
B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
(3)
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ Cisgenerating
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
T
F
F
S➝ABC (1) A➝DE (2)
B➝bB
B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
(3)
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ C’sentrychangestotrue
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ At the end of the first round (going over all rules), we get the array on the right
▪ Sincesomeentrieshavechanged,we need to do another round
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪
We examine the first rule again, but we cannot tell that S is generating because, even though B and C are generating, A is not known to be generating
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
F
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪
We examine the second rule and now we can tell that A is generating because every symbol on the RHS of the rule for A is generating.
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
T
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ SowechangeA’sentrytotrue
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
F
T
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ The remaining rules do not result in any change
▪ Butsincesomeentrieshavechangedin the second round, we need to do a third round
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
T
T
T
T
T
T
S➝ABC (1) A➝DE (2)
B➝bB
B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
(3)
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪
▪
In the third round, we determine that S is generating because all the symbols on the RHS of the rule S ➝ABC are generating and the entry for S is changed to true.
Since some entries changed in the third round, we need to do a fourth round
Iterative approach to calculating
generating symbols: Iteration
Generating array
T
F
T
T
T
T
T
T
T
T
T
T
S➝ABC (1) A➝DE (2) B➝bB (3) B➝ɛ (4) C➝ cC (5) C➝ɛ (6) D➝dD (7) D➝ɛ (8) E➝eE (9) E➝ɛ (10)
Symbols
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
“#”
“$”
“b”
“c”
“d”
“e”
“S”
“A”
“B”
“C”
“D”
“E”
▪ Inthefourthroundnothingchangesand we have our answer
Removing rules with non-generating
symbols
▪ Afterwecalculategeneratingsymbols,weremoveallrulesthathave a symbol that is not generating
▪ Onewaytodothisisthefollowing.Weiterateoveralltherulesinthe vector of rules
– For each rule,
▪ if every symbol in the rule is generating, push the rule to a new vector.
▪ If some symbol in the rule is not generating go do the next rule
At the end, the new vector, let us call it RulesGen contains all the grammar rules with generating symbols.
Calculating Useless Symbols
▪ Westartbycalculatinggeneratingsymbols
– A symbol is generating if it can derive a string in T* (zero or more sequence of
terminals)
▪ Thenweremoveallrulesthathaveasymbolthatisnotgenerating
▪ Wehavenowanewsetofrules,whichisRulesGen
▪ ThenwestartwiththeRulesGenvectortodeterminereachable symbols
– A symbol A is reachable if S can derive a sentential form containing the symbol:
* S⇒xAy
Calculating reachable symbols
1. 2.
S is reachable
If A➝A1A2 …Ak isagrammarrule and A is reachable
A1 andA2and … and Ak
are reachable
Calculating reachable symbols
▪ Calculationcanbedoneinawaythatissimilartohowwedid generating symbols
▪ Attheend,wehaveabooleanarrayindicatingwhichsymbolsare reachable
▪ Weremoveallrulesthathaveanon-reachablesymbol
Things to think about
▪ You should decide on the data structures you will be using. Things you need to represent are
– initial list of non-terminals
– initial list of terminals
– you should think about how these lists will be used for the various tasks and if they need to be combined into a larger list of symbols
– grammar rules: LHS, RHS
– Set representation. You should think about the operation you will need to be doing on sets
▪ Before you start coding, you should have any outline of how you will be using your data structures to implement the various tasks
▪ Before you start coding, make you you have a correct understanding of the requirements
▪ ▪
I and the TAs will be happy to look at your initial outline of how you will approach the project to give you feedback
When you start coding, we will be happy to look at your code to give you feedback. The earlier you ask the better off you will be.