COMP284 Scripting Languages – Handouts (8 on 1)
COMP284 Scripting Languages
Lecture 5: Perl (Part 4)
Handouts (8 on 1)
Ullrich Hustadt
Department of Computer Science
School of Electrical Engineering, Electronics, and Computer Science
University of Liverpool
Contents
1 Regular expressions (2)
Capture groups
Alternations
Anchors
Modifiers
Binding operator
COMP284 Scripting Languages Lecture 5 Slide L5 – 1
Regular expressions (2) Capture groups
Regular expressions: Capture groups and backreferences
• We often encounter situations where we want to identify the repetition
of the same or similar text, for example, in HTML markup:
…
• We might also not just be interested in the repeating text itself,
but the text between or outside the repetition
• We can characterise each individual example above
using regular expressions:
.* <\/ strong >
but we cannot characterise both without losing fidelity, for example:
<\w+>.*<\/\w+>
does not capture the ‘pairing’ of HTML tags
COMP284 Scripting Languages Lecture 5 Slide L5 – 2
Regular expressions (2) Capture groups
Regular expressions: Capture groups
The solution are capture groups and backreferences
(regexpr) creates a capture group
(?
(?:regexpr) creates a non-capturing group
\N, \gN, \g{N} backreference to capture group N
(where N is a natural number)
\g{name} backreference to a named capture group
Examples:
1 /Sorting took (\d+\.\d+) seconds/
2 /<(\w+)>.*<\/\1 >/
3 /([A-Z])0{2}(\d+)/
4 /(?
5 /((?
COMP284 Scripting Languages Lecture 5 Slide L5 – 3
Regular expressions (2) Capture groups
Regular expressions: Capture groups
Via capture variables the strings matched by a capture group are also
available outside the pattern in which they are contained
$N string matched by capture group N
(where N is a natural number)
$+{name} string matched by a named capture group
The matched strings are available until the end of the enclosing
code block or until the next successful match
Example:
$_ = “Yabba dabba doo”;
if (/((?
print “Match found: $1\n” }
Output:
Match found: abba
COMP284 Scripting Languages Lecture 5 Slide L5 – 4
Regular expressions (2) Alternations
Regular expressions: Alternations
• The regular expression regexpr1|regexpr2 matches
if either regexpr1 or regexpr2 matches
This type of regular expression is called an alternation
• Within a larger regular expression we need to enclose alternations
in a capture group or non-capturing group:
(regexpr1|regexpr2) or (?:regexpr1|regexpr2)
Examples:
1 /Mr|Ms|Mrs|Dr/
2 /cat|dog|bird/
3 /(?: Bill|Hillary) Clinton/
COMP284 Scripting Languages Lecture 5 Slide L5 – 5
Regular expressions (2) Alternations
Regular expressions: Alternations
• The order of expressions in an alternation only matters
if one expression matches a sub-expression of another
Example:
1 $_ = “cats and dogs”;
2 if (/( cat|dog|bird )/) { print “Match 1: $1\n” }
3 if (/( dog|cat|bird )/) { print “Match 2: $1\n” }
4 if (/( dog|dogs )/) { print “Match 3: $1\n” }
5 if (/( dogs|dog )/) { print “Match 4: $1\n” }
Output:
Match 1: cat
Match 2: cat
Match 3: dog
Match 4: dogs
COMP284 Scripting Languages Lecture 5 Slide L5 – 6
Regular expressions (2) Anchors
Regular expressions: Anchors
Anchors allow us to fix where a match has to start or end
\A Match only at string start
^ Match only at string start (default)
Match only at a line start (in //m)
\Z Match only at string end modulo a preceding \n
\z Match only at string end
$ Match only at string end modulo a preceding \n
Match only at a line end (in //m)
\b Match word boundary (between \w and \W)
\B Match except at word boundary
Example:
$_ = “The girl who\nplayed with fire\n”;
if (/fire\z/) { print “‘fire’ at string end\n” }
if (/fire\Z/) { print “‘fire’ at string end modulo \\n\n” }
‘fire’ at string end modulo \n
COMP284 Scripting Languages Lecture 5 Slide L5 – 7
Regular expressions (2) Modifiers
Regular expressions: Modifiers
Modifiers change the interpretation of certain characters in a regular
expression or the way in which Perl finds a match for a regular expression
/ / Default
‘.’ matches any character except ‘\n’
‘^’ matches only at string start
‘$’ matches only at string end modulo preceding \n
/ /s Treat string as a single long line
‘.’ matches any character including ‘\n’
‘^’ matches only at string start
‘$’ matches only at string end modulo preceding \n
/ /m Treat string as a set of multiple lines
‘.’ matches any character except ‘\n’
‘^’ matches at a line start
‘$’ matches at a line end
COMP284 Scripting Languages Lecture 5 Slide L5 – 8
Regular expressions (2) Modifiers
Regular expressions: Modifiers
Modifiers change the interpretation of certain characters in a regular
expression or the way in which Perl finds a match for a regular expression
/ /sm Treat string as a single long line, but detect multiple lines
‘.’ matches any character including ‘\n’
‘^’ matches at a line start
‘$’ matches at a line end
/ /i perform a case-insensitive match
Example:
$_ = “bill\nClinton”;
if (/( Bill|Hillary ). Clinton )/smi) { print “Match: $1\n” }
Output:
Match: bill
Clinton
COMP284 Scripting Languages Lecture 5 Slide L5 – 9
Regular expressions (2) Modifiers
Regular expressions: Modifiers (/ /g and / /c)
Often we want to process all matches for a regular expression,
but the following code has not the desired effect
$_ = “11 22 33”;
while (/\d+/) { print “Match starts at $ -[0]: $&\n” }
The code above does not terminate and endlessly prints out the same text:
Match starts at 0: 11
To obtain the desired behaviour of the while-loop we have to use
the / /g modifier:
/ /g In scalar context, successive invocations against a string will
move from match to match, keeping track of the position in the
string
In list context, returns a list of matched capture groups, or
if there are no capture groups, a list of matches to
the whole regular expression
COMP284 Scripting Languages Lecture 5 Slide L5 – 10
Regular expressions (2) Modifiers
Regular expressions: Modifiers (/ /g and / /c)
With the / /g modifier our code works as desired:
$_ = “11 22 33”;
while (/\d+/g) { print “Match starts at $ -[0]: $&\n” }
Output:
Match starts at 0: 11
Match starts at 3: 22
Match starts at 6: 33
An example in a list context is the following:
$_ = “ab 11 cd 22 ef 33”;
@numbers = (/\d+/g);
print “Numbers: “,join(” | “,@numbers),”\n”;
Output:
Numbers: 11 | 22 | 33
Read / /g as: Start to look for a match from the position where the last
match using / /g ended
COMP284 Scripting Languages Lecture 5 Slide L5 – 11
Regular expressions (2) Modifiers
Regular expressions: Modifiers (/ /g and / /c)
The current position in a string for a regular expression regexpr
is associated with the string, not regexpr
; different regular expressions for the same strings will move forward the
same position when used with / /g
; different strings have different positions and their respective positions
move forward independently
Example:
$_ = “ab 11 cd 22 ef 33”;
if (/\d+/g) { print “Match starts at $-[0]: $&\n” }
if (/[a-z]+/g) { print “Match starts at $-[0]: $&\n” }
if (/\d+/g) { print “Match starts at $-[0]: $&\n” }
Output:
Match starts at 3: 11
Match starts at 6: cd
Match starts at 9: 22
COMP284 Scripting Languages Lecture 5 Slide L5 – 12
Regular expressions (2) Modifiers
Regular expressions: Modifiers (/ /g and / /c)
A failed match or changing the target string resets the position
1 $_ = “ab 11 cd 22 ef 33”;
2 if (/\d+/g) { print “2: Match starts at $-[0]: $&\n” }
3 if (/ab/g) { print “3: Match starts at $-[0]: $&\n” }
4 if (/\d+/g) { print “4: Match starts at $-[0]: $&\n” }
Output:
2: Match starts at 3: 11
4: Match starts at 3: 11
To prevent the reset, an additional modifier / /c can be used
1 $_ = “ab 11 cd 22 ef 33”;
2 if (/\d+/g) { print “2: Match starts at $-[0]: $&\n” }
3 if (/ab/gc) { print “3: Match starts at $-[0]: $&\n” }
4 if (/\d+/g) { print “4: Match starts at $-[0]: $&\n” }
Output:
2: Match starts at 3: 11
4: Match starts at 9: 22
COMP284 Scripting Languages Lecture 5 Slide L5 – 13
Regular expressions (2) Binding operator
Generating regular expressions on-the-fly
The Perl parser will expand occurrences of $variable and @variable
in regular expressions
; regular expessions can be constructed at runtime
Example:
$_ = “Bart teases Lisa”;
@keywords = (“bart”,”lisa”,”marge”,’L\w+’,”t\\w+”);
while ($keyword = shift(@keywords )) {
print “Match found for $keyword: $&\n” if /$keyword/i;
}
Output:
Match found for bart: Bart
Match found for lisa: Lisa
Match found for L\w+: Lisa
Match found for t\w+: teases
COMP284 Scripting Languages Lecture 5 Slide L5 – 14
Regular expressions (2) Binding operator
Binding operator
Perl offers two binding operators for regular expressions
string =∼ /regexpr/ true iff regexpr matches string
string !∼ /regexpr/ true iff regexpr does not match string
• Note that these are similar to comparison operators not assignments
• Most of the time we are not just interested whether these expressions
return true or false, but in the side effect they have on the special
variables $N that store the strings matched by capture groups
Example:
$name = “Dr Ullrich Hustadt”;
if ($name =∼ /(Mr|Ms|Mrs|Dr)?\s*(\w+)/) {print “Hello $2\n”}
$name = “Dave Shield”;
if ($name =∼ /(Mr|Ms|Mrs|Dr)?\s*(\w+)/) {print “Hello $2\n”}
Hello Ullrich
Hello Dave
COMP284 Scripting Languages Lecture 5 Slide L5 – 15
Regular expressions (2) Binding operator
Pattern matching in a list context
• When a pattern match /regexpr/ is used in a list context,
then the return value is
• a list of the strings matched by the capture groups in regexpr
if the match succeeds and regexpr contains capture groups, or
• (a list containing) the value 1
if the match succeeds and regexpr contains no capture groups, or
• an empty list if the match fails
$name = “Dr Ullrich Hustadt”;
($t ,$f ,$l) = ($name =∼ /(Mr|Ms|Mrs|Dr)?\s*(\w+)\s+(\w+)/);
print “Name: $t, $f , $l\n”;
$name = “Dave Shield”;
($t ,$f ,$l) = ($name =∼ /(Mr|Ms|Mrs|Dr)?\s*(\w+)\s+(\w+)/);
print “Name: $t, $f , $l\n”;
Output:
Name: Dr, Ullrich , Hustadt
Name: , Dave , Shield
COMP284 Scripting Languages Lecture 5 Slide L5 – 16
Regular expressions (2) Binding operator
Pattern matching in a list context
• When a pattern match /regexpr/g is used in a list context,
then the return value is
• a list of the strings matched by the capture groups in regexpr
each time regex matches
provided that regexpr contains capture groups, or
• a list containing the string matched by regexpr each time regexpr
matches provided that regexpr contains no capture groups, or
• an empty list if the match fails
$string = “firefox: 10.3 seconds; chrome: 9.5 seconds”;
%performance = ($string =∼ /(\w+)\:\s+(\d+\.\d+)/g);
foreach $system (keys %performance) {
print “$system -> $performance{$system }\n” }
Output:
firefox -> 10.3
chrome -> 9.5
COMP284 Scripting Languages Lecture 5 Slide L5 – 17
Regular expressions (2) Binding operator
Revision
Read
• Chapter 7: In the World of Regular Expressions
• Chapter 8: Matching with Regular Expressions
of
R. L. Schwartz, brian d foy, T. Phoenix:
Learning Perl.
O’Reilly, 2011.
• http://perldoc.perl.org/perlre.html
• http://perldoc.perl.org/perlretut.html
• http://www.perlfect.com/articles/regextutor.shtml
COMP284 Scripting Languages Lecture 5 Slide L5 – 18
http://perldoc.perl.org/perlre.html
http://perldoc.perl.org/perlretut.html
http://www.perlfect.com/articles/regextutor.shtml
Lecture 5
Regular expressions (2)
Capture groups
Alternations
Anchors
Modifiers
Binding operator