COMP90045 Programming Language Implementation
Lexical Analysis and Regular Expressions
Harald Søndergaard Lecture 2
Semester 1, 2019
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 1 / 39
Lexical Analysis
Syntax analyzers can be made to work directly on a stream of characters, considering each character individually.
However, many compilers use a separate lexical analyzer, (also known as the lexer or scanner) whose main job is to transform the input stream from a stream of characters into a stream of tokens. The syntax analyzer then operates on this stream of tokens.
This arrangement usually yields significantly cleaner code, due to separation of concerns: the syntax analyzer discovers complex structure, while the lexical analyzer, besides discovering some simple structure, deals with some mundane, low level issues.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 2 / 39
The Other Jobs of Lexical Analysis
Discarding unneeded parts of the input (white space and comments between tokens).
Keeping track of line numbers and sometimes column numbers (for error reporting, and in e.g., Haskell, for the offside rule).
Smashing case (for case-insensitive languages).
Processing file inclusion directives, and keeping track of which parts of which files remain to be read.
Processing macro definitions and expanding macros.
The last two are usually present only in scanners which are integrated with a preprocessor, such as the C preprocessor.
Integrating the preprocessor into the scanner can improve performance, but only at the cost of a great increase in complexity.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 3 / 39
QUIZ: White space
In old versions of FORTRAN, white space was ignored everywhere, not just between tokens.
Guess: Which of the following are valid FORTRAN 77 statements?
READJUST MY SCREEN CALLIGRAPHY IS FUN REALIGN ALL ARRAYS LOGIC ALWAYS HELPS
Knowing FORTRAN doesn’t help that much!
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 4 / 39
QUIZ: White space
READJUST MY SCREEN
Reads to a variable named “justmyscreen”.
CALLIGRAPHY IS FUN
Calls the subroutine (function) named “igraphyisfun”.
REALIGN ALL ARRAYS
Declares “ignallarrays” to be a real (float) variable.
LOGIC ALWAYS HELPS
Declares “wayshelps” to be of type “logical” (Boolean).
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 5 / 39
Token Kinds and Values
Each token output by the scanner has a kind. The set of token kinds is fixed when the scanner is designed.
Tokens of some kinds (e.g., identifiers and constants) have an associated value; tokens of other kinds (e.g., left parenthesis) do not.
Different kinds of tokens typically have associated values of different types. For identifiers, the value will be a string, for numeric constants, it may be of the relevant numeric type, e.g., int or float.
However, some lexers represent the values of numeric constants as strings, leaving the conversion to be done later.
This is useful e.g., when a compiler running on a 32-bit machine generates code for a 64-bit machine, since it allows the representation of integer constants that do not fit in 32 bits.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 6 / 39
Representing Tokens
typedef enum {
IDENT, INT_CONST, FLOAT_CONST, BINOP, LEFTPAREN, ..
} TokenKind;
typedef union {
char
int
..
} TokenValue;
typedef struct { TokenKind
TokenValue
} Token;
*name;
int_const;
kind; value;
PLI (Sem 1, 2019)
Lexical Analysis
⃝c University of Melbourne
7 / 39
Representing Tokens
The C representation uses too much memory for tokens without values, and is clumsy and error prone; nothing prevents you from accessing the wrong member of the union.
Languages with variant records, e.g., Ada, remove that source of error, but are still clumsy. The best and most direct representation uses algebraic types, as in Haskell:
data Token
= Ident String
| IntConst Int
| FloatConst Float
| Binop Op
| LeftParen
| RightParen
:
data Op = ADD
| SUB
| MUL
| DIV
PLI (Sem 1, 2019)
Lexical Analysis
⃝c University of Melbourne
8 / 39
Lexemes and Patterns
A lexeme is a sequence of input characters that forms a token: e.g., “while”, “<=", "num_widgets", "1066".
Lexical analyzers associate with each kind of token one or more patterns; a lexeme forms a token of a given kind if it matches one of the token kind’s patterns.
Some kinds of tokens, e.g., those representing keywords, have one pattern ("while") that matches just one string.
Others, e.g., the comparison operator token, have several patterns (<, =<, >, >=, =, !=), each of which matches just one string.
Yet others, e.g., identifier, have a pattern ([a-z][a-z0-9_]*) that matches many strings.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 9 / 39
Regular Expressions
The work of the scanner consists mostly of testing strings against patterns. It is important to optimize the checking processes, because there are many patterns and many strings. (The scanner is the only part of the compiler that examines every character in the program.)
Regular expressions are the ideal formalism for expressing lexical patterns: they are expressive enough, and they have fast recognition algorithms.
They are used in many other contexts, including text editors, command interpreters and scripting languages (perl, python, JavaScript, …). Actually, such languages use extended regular expressions—adding various shorthands and a number of more powerful features.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 10 / 39
Regular Languages
Σ = language = L1∪L2 = L1 L2 = L∗ =
alphabet (set of characters) set of strings
{s|s∈L1 ors∈L2} {s1s2|s1∈L1,s2∈L2} {s1s2…sn|n∈N,si∈L}
Let L1 = {ant,bcc}, L2 = {ab}. Then L1 ∪ L2 = {ant, bcc, ab}, L1L2 = {antab,bccab},
L∗2 ={ǫ,ab,abab,ababab,…}, where ǫ stands for the empty string.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 11 / 39
Regular Expressions
We recursively define a function L taking regular expressions to regular languages as follows.
1 L(ǫ) = {ǫ}
2 L(a) = {a}
3 L(r|s)=L(r)∪L(s)
4 L(r s) = L(r)L(s)
5 L(r *) = L(r )∗
Precedence: highest is ∗, then concatenation, lowest is |. Therefore a|bc*d ≡ (a)|((b)(c*)(d)).
L(a|b) = {a,b}
L((a|b)*) = {ǫ,a,b,aa,ab,ba,bb,…} L(a|a*b) = {a, b, ab, aab, aaab, . . . }
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 12 / 39
QUIZ: Regular expressions
Describe these languages in English:
1 (a|b)*
2 ((a|b|c)(a|b|c))*
3 (a|b|c)*abba(a|b|c)* 4 (a|ǫ)(abc|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 13 / 39
QUIZ: Regular expressions
Describe these languages in English:
1 (a|b)*
all finite strings of as and bs.
2 ((a|b|c)(a|b|c))*
3 (a|b|c)*abba(a|b|c)* 4 (a|ǫ)(abc|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 14 / 39
QUIZ: Regular expressions
Describe these languages in English:
1 (a|b)*
all finite strings of as and bs.
2 ((a|b|c)(a|b|c))*
all even length strings of as, bs and cs.
3 (a|b|c)*abba(a|b|c)* 4 (a|ǫ)(abc|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 15 / 39
QUIZ: Regular expressions
Describe these languages in English:
1 (a|b)*
all finite strings of as and bs.
2 ((a|b|c)(a|b|c))*
all even length strings of as, bs and cs.
3 (a|b|c)*abba(a|b|c)*
all finite strings of as, bs and cs containing “abba”.
4 (a|ǫ)(abc|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 16 / 39
QUIZ: Regular expressions
Describe these languages in English:
1 (a|b)*
all finite strings of as and bs.
2 ((a|b|c)(a|b|c))*
all even length strings of as, bs and cs.
3 (a|b|c)*abba(a|b|c)*
all finite strings of as, bs and cs containing “abba”.
4 (a|ǫ)(abc|ǫ)
the four strings ǫ, a, abc, and aabc
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 17 / 39
QUIZ: Regular expressions
Write down the shortest five strings in the following languages:
1 (nerk|bob)(s|sled) 2 a(ab)*|ǫ
3 (f|ǫ)(i|ǫ)(s|ǫ)(h|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 18 / 39
QUIZ: Regular expressions
Write down the shortest five strings in the following languages:
1 (nerk|bob)(s|sled)
the language only has four strings: {bobs, nerks, bobsled, nerksled}
2 a(ab)*|ǫ
3 (f|ǫ)(i|ǫ)(s|ǫ)(h|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 19 / 39
QUIZ: Regular expressions
Write down the shortest five strings in the following languages:
1 (nerk|bob)(s|sled)
the language only has four strings: {bobs, nerks, bobsled, nerksled}
2 a(ab)*|ǫ
{ǫ, a, aab, aabab, aababab}
3 (f|ǫ)(i|ǫ)(s|ǫ)(h|ǫ)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 20 / 39
QUIZ: Regular expressions
Write down the shortest five strings in the following languages:
1 (nerk|bob)(s|sled)
the language only has four strings: {bobs, nerks, bobsled, nerksled}
2 a(ab)*|ǫ
{ǫ, a, aab, aabab, aababab}
3 (f|ǫ)(i|ǫ)(s|ǫ)(h|ǫ) {ǫ,f,i,s,h}
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 21 / 39
QUIZ: Regular expressions
Bracket the regular expression aab*|b.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 22 / 39
QUIZ: Regular expressions
Bracket the regular expression aab*|b. ((aa)((b)*))|(b)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 22 / 39
QUIZ: Regular expressions
Write a regular expression giving all strings of a’s and b’s with exactly two a’s.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 23 / 39
QUIZ: Regular expressions
Write a regular expression giving all strings of a’s and b’s with exactly two a’s.
b*ab*ab*
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 23 / 39
Regular Definitions
A regular definition gives a name to a regular expression: di → ri
Each di is a distinct name, and each ri is a regular expression involving symbols in Σ∪{d1,…,di−1}.
alpha → digit → id → digits → opt frac → opt exp → float →
A|B|. . . |Z|a|b|. . . |z 0|1|…|9
alpha (alpha|digit)∗ digit digit∗
.digits|ǫ
E (+|-|ǫ) digits | ǫ digits opt frac opt exp
PLI (Sem 1, 2019)
Lexical Analysis ⃝c University of Melbourne 24 / 39
Regular Definition Shorthands
One or more instances: +; that is, r+ ≡ r r* Zero or one instances: ?; that is, r? ≡ (r|ǫ) Character classes using [ ]
[abc] ≡ (a|b|c)
[a-z] ≡ (a|b|. . . |z)
[A-Z0-9] ≡ (A|B|. . . |Z|0|1|. . . |9)
Example:
id → [A-Za-z][A-Za-z0-9]*
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 25 / 39
QUIZ: Regular definitions
1 Give a regular definition for all strings (of lower case letters) that contain all five vowels, in order, interspersed with arbitrary letters, for example aeiou, mandeilorium, aaeliiamoaeu.
2 Give a regular definition for all strings of n as followed by n bs. That is, {ǫ, ab, aabb, . . .}.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 26 / 39
QUIZ: Regular definitions
1 Give a regular definition for all strings (of lower case letters) that contain all five vowels, in order, interspersed with arbitrary letters, for example aeiou, mandeilorium, aaeliiamoaeu. [a-z]*a[a-z]*e[a-z]*i[a-z]*o[a-z]*u[a-z]*
2 Give a regular definition for all strings of n as followed by n bs. That is, {ǫ, ab, aabb, . . .}.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 27 / 39
QUIZ: Regular definitions
1 Give a regular definition for all strings (of lower case letters) that contain all five vowels, in order, interspersed with arbitrary letters, for example aeiou, mandeilorium, aaeliiamoaeu. [a-z]*a[a-z]*e[a-z]*i[a-z]*o[a-z]*u[a-z]*
2 Give a regular definition for all strings of n as followed by n bs. That is, {ǫ, ab, aabb, . . .}.
IMPOSSIBLE!
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 28 / 39
Scanner Generators
A scanner generator is a program generator. It takes a file containing a specification of a regular language, and outputs a source file that defines a lexical analyzer for that language.
lex (and clone flex) are the classic lexer generators for C toolchains. Scanner generators exist for many other programming languages. For
example, alex generates Haskell code.
These tools all use different formalisms, but the overall structure of
specifications is identical.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 29 / 39
Using alex
An alex specification gives the definition of tokens, together with rules that specify how string snippets should be translated into tokens.
Like all such tools, alex offers the convenience of regular definitions. Unlike other such tools, it distinguishes names used for character sets ($xyz) from names used as regular expressions (@xyz).
$digit = 0-9
$alpha = [a-zA-Z]
@ident = $alpha [$alpha $digit \_ \’]*
Now we can use the more readable “$digit” and “@ident” in rules. Special characters (\t, \n, …) are escaped as usual.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 30 / 39
Alex Rules
rules :-
$white+ ;
let {\s-> in {\s-> \= {\s-> \+ {\s-> \- {\s-> \* {\s-> \/ {\s-> \( {\s-> \) {\s-> $digit {\s-> @ident {VAR}
LET }
IN }
EQU }
ADD }
SUB }
MUL }
DIV }
LPA }
RPA }
NUM (read s :: Float) }
Rules = a list of pattern/action pairs. Roughly, whenever the lexical
analyser encounters one of these patterns, it applies the associated
function to the matched string, generating a token.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 31 / 39
The Tokens in Haskell
Where did the tokens LET, IN, and so on, come from?
Snippets of Haskell code can be included in the specification; anything between ‘{’ and ‘}’ is copied verbatim to the generated scanner.
{
data Token
= LET | IN | EQU | ADD | SUB | MUL | DIV | LPA | RPA | NUM Float
| VAR String
deriving (Eq, Show)
}
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 32 / 39
Finishing Touches
The scanner generator will synthesize a function
alexScanTokens :: String -> a
More complex “stateful” scanning functions can also be synthesized; the “basic” version used here is specified with
%wrapper “basic”
The scanner function will most likely be called from a parser, but to test it, we may make it a stand-alone program that we can compile:
main =do
s <- getContents
print (alexScanTokens s)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 33 / 39
A Complete alex Specification: The Preamble
{
module Main (main) where
} ------------------------------------------------------------
--
--
--
-- term ::= basic * term | basic / term | basic
-- basic ::= ( exp ) | number | ident ------------------------------------------------------------ %wrapper "basic" -- Actions will be of type String -> Token
Lexical analysis for a small expression language H top ::= exp|letident=expintop
exp ::= term+exp|term-exp|term
$digit = 0-9
$alpha = [a-zA-Z]
@ident = $alpha [$alpha $digit \_ \’]*
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 34 / 39
Complete alex Specification: The Rules
rules :-
$white+
let
in
\=
\+
\-
\*
\/
\(
\)
$digit
@ident
;
{\s-> LET }
{\s-> IN }
{\s-> EQU }
{\s-> ADD }
{\s-> SUB }
{\s-> MUL }
{\s-> DIV }
{\s-> LPA }
{\s-> RPA }
{\s-> NUM (read s :: Float) } {VAR}
PLI (Sem 1, 2019)
Lexical Analysis ⃝c University of Melbourne 35 / 39
Complete alex Specification: Last Snippets
{
data Token
= LET | IN | EQU | ADD | SUB | MUL | DIV | LPA | RPA | NUM Float
| VAR String
deriving (Eq, Show)
main = do
}
s <- getContents
print (alexScanTokens s)
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 36 / 39
Running This under Unix
Assume we name the specification H.x:
alex H.x
ghc H.hs
./H < input.txt
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 37 / 39
The Generated Lexer
The lexing function divides the input stream into a sequence of strings, each of which matches a pattern in the scanner specification.
The pattern may be a pattern for a token, or a pattern for a string to be ignored (white space or comment).
The lexer function repeatedly
reads input characters until it has built up the longest string of those characters that matches one of the patterns, and
executes the action of the first pattern matching that string.
After an action returns, the next call will start where the previous call left off. If the lexer fails to match any pattern, it reports an error.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 38 / 39
Next Up
We discuss the use of finite-state automata to implement scanners and scanner generators.
PLI (Sem 1, 2019) Lexical Analysis ⃝c University of Melbourne 39 / 39