Compilers and computer architecture: From strings to ASTs (1): lexing
Martin Berger 1 October 2019
1Email: M.F.Berger@sussex.ac.uk, Office hours: Wed 12-13 in Chi-2R312
1/1
Recall the function of compilers
2/1
Plan for the next 9 weeks
Source program
Lexical analysis
Syntax analysis (parsing)
Semantic analysis, e.g. type checking
Intermediate code generation
Optimisation
Code generation
Translated program
Remember the shape of compilers?
For the next 9 weeks or so, we will explore this pipeline step-by-step, starting with lexing.
3/1
From strings to ASTs
The purposes of the lexing and parsing phase is twofold.
Toconverttheinputfromstrings(arepresentationthatis convenient for humans) to an abstract syntax tree (AST), a representation that is convenient for (type-checking and) code generation.
Checksyntacticcorrectness.
4/1
From strings to ASTs
We will later define in details what ASTs are, but for now, a picture says more than 1000 words. We want to go from the representation on the left to that on the right:
while( n > 0 ){
n–;
res *= 2; }
T_while
T_greater
T_var ( n ) T_num ( 0 )
T_semicolon
T_decrement
T_var ( n )
T_var ( res )
T_var ( res )
T_update
T_mult
T_num ( 2 )
5/1
From strings to ASTs
T_while
T_greater
T_var ( n ) T_num ( 0 )
T_semicolon
T_decrement
T_var ( n )
T_var ( res )
T_var ( res )
T_update
T_mult
T_num ( 2 )
while( n > 0 ){
n–;
res *= 2; }
Why? To make the backend and type-checking much easier and faster: type-checking and code generation need acces to components of the program. E.g. to generate code for if C then Q else R we need to generate code for C. Q and R. This is difficult directly from strings. The AST is a data structure optimised for making this simple. ASTs use pointers (references) to point to program components. CPUs can maninpulate pointers very efficiently.
6/1
From strings to ASTs
It is possible to go in from strings to ASTs in one step (called scannerless parsing). But that is complicated. Most compilers use two steps.
Lexicalanalysis(lexing).
Syntacticalanalysis(parsing).
Another example of divide-and-conquer.
This will involve a bit of language theory and automata, some of which you have already seen last year in one of the introductory courses. You will also learn about tools for generating lexers and parsers.
Knowing about this, (e.g. regular expressions, context-free languages) is one of the most useful things to know as a programmer, even if you’ll never write a compiler: just about every application you’ll ever encounter will involve reading or creating formal languages.
7/1
Lexical analysis (lexing)
The purpose of lexing is to prepare the input for the parser. This involves several related jobs.
Removingirrelevantdetail(e.g.whitespace,tab-stops, new-lines).
Splitthestringintobasicpartscalledtokens.Youcanthink of them as ’atoms’.
Why is using tokens useful?
8/1
Lexical analysis (lexing)
Why is using tokens useful?
Because for syntax analysis it’s not important if a variable is
called n or numberOfUsers, and likewise for x vs y 3 vs 4 etc.
In other words, the program
if n == 3 then x := 0 else x := 1
is syntactically correct exactly when
if numberOfUsers == 666 then y := 0 else z := 171717
is syntactically correct.
The precise choice of variable name is important later (for type-checking and for code generation) so we keep the name of the identifier for later phases. But syntactic correctness is decided independently of this detail.
9/1
Lexical analysis (lexing)
Another reason why using tokens is useful?
Information hiding / modularisation: for syntax analysis it’s not important if the syntax for conditionals is if then else or IF THENDO ELSEDO etc. In other words, the reason why we deem
if n == 3 then x := 0 else x := 1
to be syntactically correct (in one language) is exactly the same as the reason why we deem
IF n == 3 THENDO x := 0 ELSEDO x := 1
to be syntactically correct (in another language).
The precise choice of keywords name is irrelevant in later stages, so let’s abstract it by way of a token representing concrete keywords. That makes it much easier to change language keywords later, only the lexer needs adaptation.
10/1
Lexical analysis (lexing)
Let’s look at an analogy. You can classify natural languages sentences according to syntactic function. For example
Synchronised dancing is great! This can be simplified as follows
Adjective Verb Punctuation
Synchronised dancing
Noun
is
great !
Adverb
Roughly: sentences of the form …
Adjective Noun Verb Adverb Punctuation
… are syntactically correct in English.
11/1
Lexical analysis (lexing)
The process of going from string “Synchronised dancing is great!” to list “Adjective Noun Verb Adverb Punctuation” is lexing. Let’s break it down.
To work out if an English sentence is syntactically correct, we can do the following:
Classifyeverywordintermsofitsfunction(tokens),i.e. perform lexical analysis.
Checkiftheorderingoftokensisacceptableaccordingto the grammar of English. (This is parsing aka sytnax analysis.)
Question: How do we know what the words are? Answer: We have a concept of boundary, typically whitespace.
We do the same with computer languages.
12/1
Breaking down this process even further
Lexing/tokenisation can be seen as two separate steps:
StartwithString
SplitstringonboundariesintoList(String)
Convert/classifytheList(String)intoList(Token)
Example:
String:“Synchroniseddancingis great!”
List(String):[“Synchronised”,“dancing”,“is”,“great”,“!”] List(Token):[Adjective,Noun,Verb,Adverb,Punctuation]
In practise, those steps are often excuted in one go for efficiency.
13/1
Question: why bother with this?
Recall that our goal is twofold:
ProduceanASTformtheinputstring,becausecode generation works with ASTs.
Checksyntacticcorrectness.
Both are much easier using a list of tokens, rather than a string. So tokenisation is a form of simplification (information hiding): it shields the next stage (parsing) from having to deal with irrelevant information.
14/1
Question: why bother with this?
In summary, lexing has two beneficial effects.
Itsimplifiesthenextstage,parsing(whichcheckssyntactic correctness and constructs the AST).
Itabstractstherestofthecompilersfromthelexicaldetail of the source language.
15/1
Lexical analysis (lexing)
When designing a programming language, or writing a compiler, we need to decide what the basic constituents of programs are, and what the grammatical structure of the language is. There are many different ways of doing this. Let’s look at one.
Keywords(IF,ELSE,WHILE,…),butnotGIRAFFE
Identifiers(x,i,username,…),butnot_+++17
Integers(0,1,17,-3,…),butnot…
Floatingpointnumbers(2.0,3.1415,-16.993,…)
Binaryoperators(+,*,&&,||,=,==,!=,:=,…)
Leftbracket
Rightbracket
Tokenboundaries(whitespace,tab-stops,newlines,
semicolon, …)
…
So the tokens are: Keywords, Identifiers, Integers, Floating point numbers, Binary operators, Left bracket, Right bracket, …
16/1
Example
Let’s look a the raw string
\t IF ( x > 1 \n) \t\t\n THEN \tz := -3.01 \n
This gives rise to the following token list:
Keyword: “IF” LeftBracket
Ident: “x”
Binop: “>”
Int: “1”
RightBracket
Keyword: “THEN” Identifier: “z”
Binop: “:=”
Unop: “-”
Float“3.01”
17/1
Lexical analysis (lexing): ingredients
Hence, to lex strings we need to decide on the following ingredients.
Tokens(representingsetsofstrings).
Boundariesbetweentokens.
Inhabitationoftokens,i.e.decidewhichstringsare classified by what tokens.
Analgorithmthatinputsastringsandoutputsatokenlist.
Whattodoifweencounteraninputthatisn’twell-formed, i.e. a string that cannot be broken down into a list of tokens.
18/1
Example: Tokens and their inhabitation
For example the token class for Identifiers is inhabited by: e.g. non-empty strings of letters, digits, and “_” (underscore), starting with a letter. (This is the language designer’s choice.)
Token class Integers is inhabited by: e.g. non-empty strings of digits.
Keywords: “ELSE”, “IF”, “NEW”, … each is its own token class.
Whitespace: a non-empty string of blanks, newlines and tab-stops. Not inhabiting any token classes, since irrelevant to rest of compilation.
Can you see a problem with this?
19/1
Tokens and their inhabitation
Some strings can be described by more than one token, e.g. “THEN” is an identifier and a keyword.
We must have a mechanism to deal with this: e.g. have priorities like: if a string is both a keyword and an identifier, it should be classified as a keyword. In other words, keywords have priority over identifiers.
20/1
Tokens and their inhabitation
Aside: Some old programming languages let keywords be identifiers, for example in PL/1 the following is valid (or something close to it):
if if = then ( if, else ) then else = 1 else else = 3
Allowing identifiers also to be keywords is rarely useful, so most modern programming languages prohibit it.
21/1
Tokens and their inhabitation
There is a second problem!
What about the string “IFTHEN”? Should it be the tokens “IF” followed by “THEN”, or a single token standing for the identifier “IFTHEN”? Usual answer: use longest match.
22/1
Tokens and their inhabitation
However, problems still remain. For efficiency, most lexers usually scan their input from left to right, but only once! A keyword might be a proper prefix of an identifier:
int formulaLength = 17; for i = 0 to …
So the lexer can only classify strings by looking ahead.
How does the lexer know from looking ahead that the “for” in “formulaLength” isn’t a keyword? Answer: because we know what word boundaries are (such as whitespace, semicolon).
23/1
Tokens and their inhabitation
Question: how far does the lexer have to look ahead (in Java-like languages) before it can decide whether for… is the keyword for or the beginning of an identifier?
The more lookahead required, the less efficient the lexing process. Some old language can require unbounded lookahead. Modern languages require little lookahead (typically 1 character). Bear that in mind when designing a language.
24/1
Summary
The goal of lexical analysis is:
Getridofsemanticallyirrelevantinformation(e.g. whitespace, tab stops, syntax of keywords, …)
Givearoughclassification(simplification)oftheinputto simplify the next stage (parsing). In detail:
Identify the lexical structure and tokens of the language.
Partition the input string into small units (tokens) used by
the parser.
Lexing does a left-to-right scan of the input string, sometimes with lookahead.
25/1
Tasks for the lexical stage
Let’s rephrase what we’ve just said in a slightly different language. The point of the lexical phase is:
1. Description of the lexical structure of the language (determine token classes). We use regular expressions for this purpose.
2. From the description in (1) derive a scanning algorithm, called lexer, that determines the token class of each lexical unit. We use FSAs (finite state automata) for this.
26/1
Regular expressions
Regular expressions are a way of formally (= precisely) describing the set of lexemes (= strings) associated with each token. Informally, with REs we can say something like this:
An integer is a non-empty string of digits.
But we want to be more terse and precise than using natural language.
Aside: Invented in the 1950s to study neurons / neural nets! (Then called “nerve nets”.)
27/1
Preparation for regular expressions
An alphabet is a set of characters. Examples {1, 4, 7} is an alphabet as is {a, b, c, …, z}, as is the empty set.
A string over an alphabet A is a finite sequence elements from A. Example with alphabet {a, b, c, …, z}:
“hellomynameismartin” “”
Question: Is “hello my name is martin” a string over the same alphabet?
Question: what are the strings that you can form over the empty set as alphabet?
28/1
Preparation for regular expressions
A language over an alphabet A is a set of strings over A. Examples over {a, b, c, …, z}:
∅,theemptylanguage.
{””}.
{”hellomynameismartin”}.
{”hellomynameismartin”,”hellomynameistom”}.
{”hellomynameismartin”,”hellomynameistom”,””}.
Thesetofallstringsofevenlengthover{a,b,c,…,z}.
Thesetofallstringsofevenlengthover{a,b}.
Whatisthelanguageofallstringsoverthealphabet {0, 1}?
Do you see what’s special about the last three examples? The languages are infinite!
29/1
Specifying languages
Finite languages (= consisting of a finite number of strings) can be given by listing all strings. This is not possible for infinite languages (e.g. the language of all integers as a language over {0, …, 9}).
Regular expressions are a mechanism to specify finite and infinite languages.
This is the real point of regular expressions (and other formal accounts of languages like context free languages that we see later): to enable a terse description of languages that are too large (typically infinite) to enumerate.
The set of all (lexically/syntactically valid) Java/C/Python/Rust … programs is infinite.
30/1
Regular expressions
Regular expressions are a tool for specifying languages. You can think of them as a “domain specific language” or an ’API’ to specify languages. We will describe them precisely but informally now.
31/1
Regular expressions
Let A be an alphabet, i.e. a set of characters. We now define two things in parallel:
TheregularexpressionsoverA.
ThelanguageofeachregularexpressionoverA.We denote the language of r.e. R by lang(R).
32/1
Regular expressions
We have 7 (basic) kinds of regular expressions over alphabet A
∅.
ε.
′c′forallc∈A.
R|R′. RR′.
R∗. (R).
Each specifies a language.
33/1
Regular expressions (1): ∅
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
∅ is a regular expression, denoting the empty set {}. Now lang(∅) = {}.
34/1
Regular expressions (2): ε
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
ε is a regular expression, denoting the set {””}. We write lang(ε) = {””}.
It’s important to realise that ∅ and ε are different regular expressions, denoting different languages.
35/1
Regular expressions (3): alphabet characters
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
For each character c from A, ′c′ is a regular expression, denoting the language
lang(′c′) = {”c”}.
36/1
Regular expressions (4): alternatives
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
If R and S are regular expression, then R|S is a regular expression, denoting the language
lang(R) ∪ lang(S). You can think of R|S as R or S.
37/1
Regular expressions (5): concatenation
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
If R and S are regular expression, then RS (pronounced R concatenated with S, or R then S) is a regular expression, denoting the language
{rs|r ∈ lang(R), s ∈ lang(S)}.
Here rs is the concatenation of the strings r and s. Example: if r = ”hello” and s = ”world”, then rs is ”helloworld”.
38/1
Regular expressions (6): star
The regular expressions presented so far do not, on their own, allow as to define infinite languages. Why?
The next operator changes this. It can be seen as a simple kind of ’recursion’ or ’loop’ construct for languages.
39/1
Regular expressions (6): star
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
If R is a regular expression, then R∗ (pronounced R-star) is a regular expression, denoting the language
lang(R∗) = lang(ε) ∪ lang(R) ∪ lang(RR) ∪ lang(RRR) ∪ · · · In other words
lang(R∗) = lang(RRR…R)
n≥0 n
So a string w is in lang(R∗) exactly when some number n exists
with w ∈ lang(RRR…R).
n times
40/1
Regular expressions (6): star
Example: Let A = {0, 1, 2, …, 9}, then the language of (′0′|′1′)∗
is … the set of all binary numbers.
Whoops, where do these brackets come from in (′0′|′1′)∗?
Can we drop them and write ′0′|′1′∗
No!
41/1
Regular expressions (7): brackets
Let A be an alphabet, i.e. a set of characters. The regular expressions over A are given by the following rules.
If R is a regular expression, then (R) is a regular expression with the same meaning as R, i.e.
lang((R)) = lang(R).
42/1
Regular expressions: summary
Summary: the regular expressions over an alphabet A are ∅.
ε.
′c′forallc∈A.
R|R′,providedRandR′areregularexpressions. RR′,providedRandR′areregularexpressions. R∗,providedRisaregularexpressions.
(R),providedRisaregularexpressions.
43/1
Regular expressions precedence rules
To reduce the number of brackets in regular expressions with assume the following precedence rules.
R∗ binds more highly than RR′, i.e. AB∗ is short for A(B∗). RR′bindsmorehighlythanR|R′,i.e.AB|Cisshortfor
(AB)|C.
So AB∗C|D should be read as ((A(B∗))C)|D.
44/1
Examples of REs
Alphabet
Alphabet
Alphabet
Alphabet
Alphabet
A = A = A = A = A =
{0, 1}. What {0, 1}. What {0, 1}. What {0, 1}. What {0, 1}. What
language language language language language
is ′ 1′∗ ?
is (′ 0′ |′ 1′ )∗ |1? is (′ 0′ |′ 1′ )∗ 1? is ′ 0′∗ |′ 1′∗ ?
is ′ 0′ |′ 1′∗ ?
45/1
Examples of REs
Alphabet A = {0, 1}. What language is (′0′|′1′)∗′0′′0′(′0′|′1′)∗
Answer: the set of all binary strings containing 00 as a substring.
46/1
Examples of REs
Alphabet A = {0, 1}. What regular expression over A has exactly the strings of length 4 as language?
Answer: (′0′|′1′)(′0′|′1′)(′0′|′1′)(′0′|′1′).
47/1
Examples of REs
Alphabet A = {0, 1}. What regular expression has only strings with at most one 0 as language?
Answer: ′1′∗(′0′|ε)′1′∗
48/1
Examples of REs
Alphabet A = {−, 0, 1, …, 9}. What regular expression has only strings representing positive and negative integers as language?
(′−′|ε)(′0′|′1′|′2′|′3′|′4′|′5′|′6′|′7′|′8′|′9′)(′0′|…|′9′)∗
Note that “…” is not part of regular expressions, I was just too lazy to type out all the relevant characters.
49/1
Abbreviations in REs
One often finds the following abbreviations.
Weoftenwrite1insteadof′1′,ainsteadof′a′andsoon for all elements of the alphabet A. With this convention it makes sense to write e.g. A∗.
Instead of ′a′′ ′′s′′t′′r′′i′′n′′g′ we write ′a string′ or ”a string”.
SometimesR+SiswrittenforR|S.
R+standsforRR∗.(NoteR+SisdifferentfromR+S)
Ifthere’sa’natural’orderonthealphabetweoftenspecify ranges, e.g. [a − z] instead of a|b|c|…|y|z or [0 − 9] instead of 0|1|2|3|4|5|6|7|8|9, or [2 − 5] for 2|3|4|5. Etc.
Wewrite[a−zA−Z]insteadof[a−z]|[A−Z]
R?forR|ε
50/1
Lexical specification using REs
Let us now give a lexical specification of a simple programming language using REs. We start with keywords.
“if” | “else” | “for” | “while” Recall that “if” is a shorthand for ’i”f’.
51/1
Lexical specification using REs
Next are digits and integers. Digits are:
0|1|2|3|4|5|6|7|8|9
or simpler
[0-9]
We want to refer to this RE later, so we name it.
digit = [0-9]
52/1
Lexical specification using REs
Now integers can be written as follows
integer = -?digit+
This is short for
integer =
(’-’ | epsilon)
(’0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’)
(’0’|’1’|’2’|’3’|’4’|’5’|’6’|’7’|’8’|’9’)*
Abbreviations are helpful for readability (writing epsilon for ε).
53/1
Lexical specification using REs
When specifying the lexical level of programming languages we also need to specify whitespace (why?). Often the following is used.
whitespace = (’ ’ | ’\n’ | ’\t’)+
What does ’\n’ and ’\t’ mean?
54/1
Lexical specification using REs
Here is a more realistic example: specification of numbers in a real programming language. Examples 234, 3.141526 or 6.2E-14
digit = [0-9]
digits = digit+
optFraction = (’.’digits) | eps
optExponent = ( ’E’( ’-’ | eps ) digits )
| eps
num = digits optFraction optExponent
(Writing eps for epsilon.)
55/1
Real languages
Java’s (JDK 13) lexical spec: https://docs.oracle.com/ javase/specs/jls/se13/html/jls-3.html
Python 3’s lexical spec: https://docs.python.org/3/ reference/lexical_analysis.html
56/1
Lexical specification using REs
I hope you see that REs are useful and can specify the lexical level of programming languages. Two main questions are left open.
1. Can all languages be specified with REs? E.g. all syntactically valid Java programs, or all arithmetic expressions that have balanced parentheses? Answer: no! We need more general languages. e.g. context free and context sensitive languages. More about this soon.
2. How to decide, given a string s and a regular expression R, if s ∈ lang(R)? Answer: FSA as algorithms to decide the language defined by REs.
57/1
The material in the textbooks
DragonBook:Chapter2.6andChapter3.
Appel,Palsberg:Chapter2.1and2.2
“Engineeringacompiler”:Chapter2.1and2.2
58/1