Introduction to Regular Expressions
Introduction to Regular Expressions
Faculty of Information Technology
Monash University
FIT5196 week 2
(Monash) FIT5196 1 / 23
Regular Expressions
A regular expression is a set of symbols that describes a text patten.
É \d{4}-\d{2}-\d{2}
É wrangling
Why regular expressions?
É Regular expressions are useful in finding, replacing and extracting information
from text, such as log files, HTML/XML files, and other documents
− Search a document for color or neighbor with or without ’u’
− Covert a tab-delimited file to a comma-delimited file
− Find duplicated words in a text
− Search and replace “Bob” and “Bobby” with “Robert”
É Regular expressions are useful in verifying whether input fits into the text
pattern, such as verifying
− phone numbers: Does a phone number have the right number of digits?
− emails: Is an email address in a valid format?
− date: Is a date in the right format? Does the month exceed 12?
(Monash) FIT5196 2 / 23
Regular Expressions: validate emails
r”(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)”1
7/28/2016 image (10).svg
file:///Users/dulan/Downloads/image%20(10).svg 1/1
group #1
Start of line End of line
One of:
“a” “z”
“A” “Z”
“0” “9”
“_”
“.”
“+”
“”
“@”
One of:
“a” “z”
“A” “Z”
“0” “9”
“”
“.”
One of:
“a” “z”
“A” “Z”
“0” “9”
“”
“.”
Figure: Figure generated by https://regexper.com/
1http://emailregex.com/
(Monash) FIT5196 3 / 23
Regular Expressions: validate emails
r”(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)”17/28/2016 image (10).svg
file:///Users/dulan/Downloads/image%20(10).svg 1/1
group #1
Start of line End of line
One of:
“a” “z”
“A” “Z”
“0” “9”
“_”
“.”
“+”
“”
“@”
One of:
“a” “z”
“A” “Z”
“0” “9”
“”
“.”
One of:
“a” “z”
“A” “Z”
“0” “9”
“”
“.”
Figure: Figure generated by https://regexper.com/
1http://emailregex.com/
(Monash) FIT5196 3 / 23
Outline
1 Regular Expression Syntax
character sets
repetition
grouping
raw string in Python
2 Cases studies
3 Summary
(Monash) FIT5196 4 / 23
Regular Expression Syntax
Matching String Literals
The most obvious feature of regular expressions is matching strings with one or
more literal characters, called string literals.
Everything is essentially a character in regular expressions.
É cat matches “cat”.
É cat matches the first three characters of “cattle” and “catfish”.
It is similar to searching in word processing program
Matching is case-sensitive:
É cat does not match “Cat”.
How does regular expression engine work?
cat
The cow, camel and cat communicated.
(Monash) FIT5196 5 / 23
Regular Expression Syntax character sets
Character sets: [ . . . ]
Assume that we are going to match the following two words:
grey gray
What the regular expression should be?
(Monash) FIT5196 6 / 23
Regular Expression Syntax character sets
Character sets: [ . . . ]
Assume that we are going to match the following two words:
grey gray
What the regular expression should be?
[ . . . ] indicate a set of characters
É Matches any one of several characters in the set, but only one
É The order of characters does not matters.
(Monash) FIT5196 6 / 23
Regular Expression Syntax character sets
Character sets: [ . . . ]
Assume that we are going to match the following two words:
grey gray
What the regular expression should be?
[ . . . ] indicate a set of characters
É Matches any one of several characters in the set, but only one
É The order of characters does not matters.
É The regular expression is gr[ea]y
“gr”
One of:
“e”
“a”
“y”
É gr[ea]y does not match grAy, graay, and graey.
(Monash) FIT5196 6 / 23
Regular Expression Syntax character sets
Character ranges: [a − zA− Z ] and [0− 9]
Assume that we are going to match victory car plate numbers, for example
XRA 000, 1AA 1AA
Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?
(Monash) FIT5196 7 / 23
Regular Expression Syntax character sets
Character ranges: [a − zA− Z ] and [0− 9]
Assume that we are going to match victory car plate numbers, for example
XRA 000, 1AA 1AA
Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?
Character ranges can be indicated
by giving two characters and
separating them by a ‘-‘.
Example:
É [0− 9]
É [a − z ] or [A− Z ]
Caution
É [50− 99] is not all numbers from
50 to 99, it is the same as [0−9].
7/22/2016 image (1).svg
file:///Users/dulan/Downloads/image%20(1).svg 1/1
One of:
“5”
“0” “9”
“9”
(Monash) FIT5196 7 / 23
Regular Expression Syntax character sets
Character ranges: [a − zA− Z ] and [0− 9]
Assume that we are going to match victory car plate numbers, for example
XRA 000, 1AA 1AA
Note the letters can be from A to Z, and the numbers can be from 0 to 9. What
the regular expression should be?
[A-Z0-9][A-Z][A-Z]\s[0-9][A-Z0-9][A-Z0-9]
8/1/2016 image (1).svg
file:///Users/land/Downloads/image%20(1).svg 1/1
One of:
“A” “Z”
“0” “9”
One of:
“A” “Z”
One of:
“A” “Z” white space
One of:
“0” “9”
One of:
“A” “Z”
“0” “9”
One of:
“A” “Z”
“0” “9”
(Monash) FIT5196 7 / 23
Regular Expression Syntax character sets
Negative character sets: [ˆ . . . ]
Assume that we are going write a regular expression that matches only the live
animals
hog dog bog
Question: what is the regular expression?
(Monash) FIT5196 8 / 23
Regular Expression Syntax character sets
Negative character sets: [ˆ . . . ]
Assume that we are going write a regular expression that matches only the live
animals
hog dog bog
Question: what is the regular expression?
[ˆ . . . ]: If the first character of the set is ,̂ all the characters that are not in
the set will be matched.
É [ˆb]og matches “hog” and “dog”, but not “bog”.
É Caution:
− Does see[^mn] match “see”?
− Does see[^mn] match “see ”?
Try the regular expression in Pythex (http://pythex.org/)!
(Monash) FIT5196 8 / 23
Regular Expression Syntax character sets
Metacharacters inside character sets: [.+]
Assume that we are going to match the following two strings:
var(9), var[0]
Now, we need to match () and [ ]. How can we do that?
(Monash) FIT5196 9 / 23
Regular Expression Syntax character sets
Metacharacters inside character sets: [.+]
Assume that we are going to match the following two strings:
var(9), var[0]
Now, we need to match () and [ ]. How can we do that?
Metacharacters inside character sets are already escaped. In other words
they lose their special meaning inside sets.
É Example:
− h[ai.u]t matches “hat”, “h.t”, but not “hot”
Exceptions: ], -, ˆ and \that do need to be escaped.
É h[ai.u]t → h[ai]u]t?
(Monash) FIT5196 9 / 23
Regular Expression Syntax character sets
Metacharacters inside character sets: [.+]
Assume that we are going to match the following two strings:
var(9), var[0]
Now, we need to match () and [ ]. How can we do that?
Metacharacters inside character sets are already escaped. In other words
they lose their special meaning inside sets.
É Example:
− h[ai.u]t matches “hat”, “h.t”, but not “hot”
Exceptions: ], -, ˆ and \that do need to be escaped.
É h[ai.u]t → h[ai]u]t?
var[([][0-9][)\]]
7/22/2016 image (3).svg
file:///Users/dulan/Downloads/image%20(3).svg 1/1
“var”
One of:
“(”
“[”
One of:
“0” “9”
One of:
“)”
“]”
(Monash) FIT5196 9 / 23
Regular Expression Syntax character sets
Metacharacters inside character sets: [.+]
Assume that we are going to match the following two strings:
var(9), var[0]
Now, we need to match () and [ ]. How can we do that?
var[([][0-9][)\]]
7/22/2016 image (3).svg
file:///Users/dulan/Downloads/image%20(3).svg 1/1
“var”
One of:
“(”
“[”
One of:
“0” “9”
One of:
“)”
“]”
var[([][0-9][)]]
7/22/2016 image (4).svg
file:///Users/dulan/Downloads/image%20(4).svg 1/1
“var”
One of:
“(”
“[”
One of:
“0” “9”
One of:
“)” “]”
(Monash) FIT5196 9 / 23
Regular Expression Syntax character sets
Shorthand character sets
Shorthand meaning Equivalent
\d matches any decimal digit from 0 to 9 [0-9]
\w matches any word character [a-zA-Z0-9_]
\s matches any white space character [ \t\n\r]
\D matches any non-digit character; [^0-9]
\W matches any non-alphanumeric character [^a-zA-Z0-9_]
\S matches any non-whitespace character [^ \t\n\r]
(Monash) FIT5196 10 / 23
Regular Expression Syntax character sets
Shorthand character sets
Shorthand meaning Equivalent
\d matches any decimal digit from 0 to 9 [0-9]
\w matches any word character [a-zA-Z0-9_]
\s matches any white space character [ \t\n\r]
\D matches any non-digit character; [^0-9]
\W matches any non-alphanumeric character [^a-zA-Z0-9_]
\S matches any non-whitespace character [^ \t\n\r]
Examples:
É \d\d\d\d matches four-digit numbers, such as “2018”, but not text.
É \w\w\w matches three word characters, such as “abc”, “123” and “d_b”
É \w\w\s\w matches “ab c” but not “a bc”.
É [\w]-[\w] matches two characters separated by a hyphen.
É [^\d] is the same as [\D]
(Monash) FIT5196 10 / 23
Regular Expression Syntax character sets
Shorthand character sets
Caution:
É Is [^\d\s] the same as [\D\S]?
(Monash) FIT5196 11 / 23
Regular Expression Syntax character sets
Shorthand character sets
Caution:
É Is [^\d\s] the same as [\D\S]?
− [^\d\s]: Not digit OR space character
7/23/2016 image.svg
file:///Users/dulan/Downloads/image.svg 1/1
None of:
digit
white space
− [\D\S]: EITHER NOT digit OR NOT space character
7/23/2016 image (1).svg
file:///Users/dulan/Downloads/image%20(1).svg 1/1
One of:
nondigit
nonwhite space
Try the regular expression with the following sentence: “Data Wrangling S2
2018 week 2”
(Monash) FIT5196 11 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps
7/27/2016 image (2).svg
file:///Users/dulan/Downloads/image%20(2).svg 1/1
“o” “o” “ps”
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps
7/27/2016 image (3).svg
file:///Users/dulan/Downloads/image%20(3).svg 1/1
“oo” “o” “ps”
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps
7/27/2016 image (4).svg
file:///Users/dulan/Downloads/image%20(4).svg 1/1
“o” “o” “ps”
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
which regular expression(s) should we use?
1 oo*ps
2 ooo*ps
3 oo+ps
4 oo?ps
7/27/2016 image (5).svg
file:///Users/dulan/Downloads/image%20(5).svg 1/1
“o” “o” “ps”
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: repetition meta-characters
meta-characters meaning
* Match 0 or more repetitions of the preceding regex
+ Match 1 or more repetitions of the preceding regex
? Match 0 or 1 repetitions of the preceding regex
Examples:
É Assume we are going to match the following words
oops ooops ooooops oooooops
but not
ops
The regular expressions that we can use:
ooo*ps oo+ps
Try the regular expression in Pythex!
(Monash) FIT5196 12 / 23
Regular Expression Syntax repetition
Repetition Expressions: quantified repetitions
{m, n}: matches exactly from m to n repetitions of the preceding regular
expression.
É m (min) and n (max) are positive numbers
É m must be always be included, can be 0
É n is optional
Three syntax
É \d{2} matches numbers with exactly 2 digits.
É \d{2, 4} matches numbers with 2 to 4 digits.
É \d{2, } matches numbers with at least 2 digits (n is infinite).
Try the “oops” example in Pythex , but with {m, n}
(Monash) FIT5196 13 / 23
Regular Expression Syntax repetition
Repetition Expressions: quantified repetitions
Suppose we are going to match the following
report_2018_09 assignment_2018_9
budget_18_08 assignment_18_7
but not
report_201809_39 assignment_8_9000
budget_2345678_08 assignment_000999_7
what is the regular expression?
(Monash) FIT5196 14 / 23
Regular Expression Syntax repetition
Repetition Expressions: quantified repetitions
Suppose we are going to match the following
report_2018_09 assignment_2018_9
budget_18_08 assignment_18_7
but not
report_201809_39 assignment_8_9000
budget_2345678_08 assignment_000999_7
what is the regular expression?
\w+_\d{2,4}_\d{1,2}7/27/2016 image (6).svg
file:///Users/dulan/Downloads/image%20(6).svg 1/1
word “_” digit
1…3 times
“_” digit
at most once
Try the regular expression in Pythex
(Monash) FIT5196 14 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.
− Regular expressions try to match the longest possible string
É Examples
.*\d+
number 516
(Monash) FIT5196 15 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.
− Regular expressions try to match the longest possible string
É Examples
.*\d+
number 516
É Question: Given a string like
“data”, “wrangling”, “FIT5196, S2.”
What is the match of regular expression “.+”, “.+”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”
(Monash) FIT5196 15 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Greedy strategy:
É Match as much as possible before giving control to the next regular
expression part.
− Regular expressions try to match the longest possible string
É Examples
.*\d+
number 516
É Question: Given a string like
“data”, “wrangling”, “FIT5196, S2.”
what is the match of regular expression “.+”, “.+”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”
(Monash) FIT5196 15 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part
É Syntax
− *?
− +?
− ??
− {m,n}?
É Example:
.*?\d+
number 516
(Monash) FIT5196 16 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part
É Syntax
− *?
− +?
− ??
− {m,n}?
É Example:
.*?\d+
number 516
Question: Given a string like
“data”, “wrangling”, “FIT5196, S2.”
what is the match of regular expression “.+?”, “.+?”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”
(Monash) FIT5196 16 / 23
Regular Expression Syntax repetition
Repetition Expressions: greedy v.s. lazy regex
Lazy strategy:
É Match as little as possible before giving control to the next regular expression
part
É Syntax
− *?
− +?
− ??
− {m,n}?
É Example:
.*?\d+
number 516
Question: Given a string like
“data”, “wrangling”, “FIT5196, S2.”
what is the match of regular expression “.+?”, “.+?”?
1 “data”, “wrangling”
2 “wrangling”, “FIT5196, S2.”
3 “data”, “wrangling”, “FIT5196, S2.”
(Monash) FIT5196 16 / 23
Regular Expression Syntax grouping
Grouping: (. . . )
(. . . ) matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group.
É Apply repetition operators to a group of regular expressions
É Makes regular expressions easier to read
É Capture groups for use in matching, replacing and extraction, i.e., the
contents of a group can be retrieved.
É Cannot be used insides a character set.
examples:
É abc+ matches abc, abcc, abcccc7/27/2016 image (7).svg
file:///Users/dulan/Downloads/image%20(7).svg 1/1
“ab” “c”
É (abc)+ matches abc, abcabc,
abcabcabc
7/27/2016 image (8).svg
file:///Users/dulan/Downloads/image%20(8).svg 1/1
group #1
“abc”
(Monash) FIT5196 17 / 23
Regular Expression Syntax grouping
Grouping: (. . . )
(. . . ) matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group.
É Apply repetition operators to a group of regular expressions
É Makes regular expressions easier to read
É Capture groups for use in matching, replacing and extraction, i.e., the
contents of a group can be retrieved.
É Cannot be used insides a character set.
examples:
É “Incident American Airlines Flight 11 involving a Boeing 767-223ER in 2001″
É Regular expression: Incident (.*) involving7/27/2016 image (9).svg
file:///Users/dulan/Downloads/image%20(9).svg 1/1
“Incident ”
group #1
any character “ involving”
É Try it with python script!!!
(Monash) FIT5196 17 / 23
Regular Expression Syntax grouping
Alternation: |
“|” is an OR operator
É A|B will match any string that matches either A or B
É Ordered: leftmost expression gets precedence.
É Multiple patterns can be daisy-chained.
É Group alternation expressions to keep them distinct.
Examples:
É apple|orange matches “apple” and “orange”
É (apple|orange) juice matches “apple juice” and “orange juice”
É w(ei|ie)rd matches both “weird” and “wierd”.
(Monash) FIT5196 18 / 23
Regular Expression Syntax raw string in Python
The backslash plague: \
The back slash \indicates special forms or to allow special characters to be
used without invoking their special meaning.2
Characters Stage
\section text string to be matched
\\section Escaped backslash for re.compile()
\\\\section Escaped backslashes for a Python string literal
So, to match a literal backslash, one has to write ’\\\\’ as the regular
expression string
Can we simply the expression?
2see https://docs.python.org/3/howto/regex.html
(Monash) FIT5196 19 / 23
Regular Expression Syntax raw string in Python
Raw String: r”. . . ”
Raw String suppress actual meaning of escape characters, and do not treat
the backslash as a special character at all.
Regular Python string literal Raw string
“\\\\section” r”\\section”
“\\w+\\s+” r”\w+\s+”
Regular expressions will often be written in Python code using this raw
string notation.
Try the Python script!!!
(Monash) FIT5196 20 / 23
Cases studies
Case study 1: validate dates
Date samples (day, month, year):
É 02/08/2018
É 2/8/2018
É 2/8/18
É 23/08/2018
É 23-08-2018
See jupyter notebook!!!
(Monash) FIT5196 21 / 23
Cases studies
Case study 2: validate credit card number
Assume that you’re given the job of implementing an order form for a
company that accepts payment by credit card issued by the world’s major
credit card companies, such as VISA, Master, and American Express.
É All Visa card numbers start with a 4. New cards have 16 digits. Old cards
have 13.
− 4123456789012
− 4123456789012345
É MasterCard numbers either start with the numbers 51 through 55. All have
16 digits.
− 5123456789012345
− 5523456700012345
É American Express card numbers start with 34 or 37 and have 15 digits.
− 341234567890123
− 371234567890123
See jupyter notebook!!!
(Monash) FIT5196 22 / 23
Summary
Summary: what to do this week
Regular expressions are the major tool used in data parsing.
Study materials provided in Moodle.
Attend tutorial 2.
É Try to finish all the materials provided in the tutorial.
Assessment 1 will be released at the end of this week (week 2).
Topic for next week:
É Parsing data stored in different file formats, CSV, JSON, XML, EXCEL, and
PDF
(Monash) FIT5196 23 / 23
Regular Expression Syntax
character sets
repetition
grouping
raw string in Python
Cases studies
Summary