What is a language?
Computer Science and Engineering The Ohio State University
Regular Expressions
Copyright By PowCoder代写 加微信 powcoder
Computer Science and Engineering College of Engineering The Ohio State University
Computer Science and Engineering The Ohio State University
Programming Languages
Computer Science and Engineering The Ohio State University
Q: Are C, Java, Ruby, Python, … languages in this formal sense?
Programming Languages
Computer Science and Engineering The Ohio State University
Regular Expression (RE)
Computer Science and Engineering The Ohio State University
A formal mechanism for defining a
In math, a clear distinction between:
In computer applications, there isn’t
Precise, unambiguous, well-defined
Characters in strings (the “alphabet”) Meta characters used in writing a RE
Is ‘*’ a Kleene star or an asterisk?
(a|b)*a(a|b)(a|b)(a|b)
A literal represents a character from
the alphabet
f,i,s,h, …
Whitespace is hard (invisible!)
Some are easy:
\t is a tab (ascii 0x09)
\n is a newline (ascii 0x0A)
\r is a carriage return (ascii 0x0D)
So the character ‘\’ needs to be escaped! \\ is a \ (ascii 0x5c)
Computer Science and Engineering The Ohio State University
Basic Operators
These operators are meta-characters too To represent the literal: \( \) \|
Activity: For each RE above, write out the
corresponding language explicitly (ie, as a
set of strings)
\(61(3|4)\)
Computer Science and Engineering The Ohio State University
( ) for grouping, | for choice
cat|dog|fish
R(uby|ails)
(G|g)r(a|e)y
Character Class
Set of possible characters (0|1|2|3|4|5|6|7|8|9) is annoying!
[^A-Z] a character that is not a capital letter
Syntax: [ ]
Explicit list as [0123456789]
Activity: Write the language defined by
Range as [0-9]
Negate with ^ at the beginning
0[xX][0-9a-fA-F]
Computer Science and Engineering The Ohio State University
Character Class Shorthands
\s for whitespace, ie [ \t\r\n]
\w for word character, ie [0-9a-zA-Z_]
And negations too
\D, \S, \W (ie [^\d], [^\s], [^\w])
Warning: [^\d\s] ≠ [\D\S]
POSIX standard (& Ruby) includes [[:alpha:]] alphabetic character
[[:lower:]] lowercase alphabetic character [[:digit:]] decimal digit (unicode! Eg ١) [[:xdigit:]] hexadecimal digit [[:space:]] whitespace including newlines
Computer Science and Engineering The Ohio State University
\d for digit, ie [0-9]
Includes space, tab, punctuation, etc! But does not include newline
So add . to list of meta-characters Use \. for a literal period
buckeye\.\d
Problem: What is RE for OSU email
address for everyone named Smith?
Answer is not:
Computer Science and Engineering The Ohio State University
A . matches any character (almost)
Repetition
Computer Science and Engineering The Ohio State University
Applies to preceding character or ( )
\? \* \+ \{ \}
? means 0 or 1 time
* means 0 or more times (unbounded)
+ means 1 or more times (unbounded) {k} means exactly k times {a,b}means k times, for a ≤ k ≤ b
More meta-characters to escape!
Computer Science and Engineering The Ohio State University
0[xX](0|[1-9a-fA-F][0-9a-fA-F]*)
Computer Science and Engineering The Ohio State University
(Language consisting of) strings that:
Contain only letters, numbers, and _ Start with a letter
Do not contain 2 consecutive _’s
Do not end with _
Exemplars and counter-exemplars:
EOF, 4Temp, Test_Case3, _class, a4_Sa
p_X, S__T_2
Write the corresponding RE
Computer Science and Engineering The Ohio State University
(Language consisting of) strings that:
Contain only letters, numbers, and _ Start with a letter
Do not contain 2 consecutive _’s
Do not end with _
Exemplars and counter-exemplars:
EOF, 4Temp, Test_Case3, _class, a4_Sa
p_X, S__T_2
Write the corresponding RE
Computer Science and Engineering The Ohio State University
(Language consisting of) strings that:
Contain only letters, numbers, and _ Start with a letter
Do not contain 2 consecutive _’s
Do not end with _
Exemplars and counter-exemplars:
EOF, 4Temp, Test_Case3, _class, a4_Sa
p_X, S__T_2
Write the corresponding RE
[a-zA-Z](_[a-zA-Z0-9]|[a-zA-Z0-9])*
Finite State Automota (FSA)
Computer Science and Engineering The Ohio State University
An FSA is an “accepting rule”
Finite set of states
Transition function (relation) between
states based on next character in string
DFA vs NFA
Start state (s0)
Set of accepting states
An FSA “accepts” a string if you can start in s0 and end up in an accepting
state, consuming 1 character per step
What language is defined by this FSA?
Computer Science and Engineering The Ohio State University
Computer Science and Engineering The Ohio State University
What language is defined by this FSA?
A. Binary strings (0’s and 1’s) with an
even number of 0’s
Computer Science and Engineering The Ohio State University
(Language consisting of) strings that:
Contain only letters, numbers, and _ Start with a letter
Do not contain 2 consecutive _’s
Do not end with _
Exemplars and counter-exemplars:
EOF, 4Temp, Test_Case3, _class, a4_Sa
p_X, S__T_2
Write the corresponding FSA
Computer Science and Engineering The Ohio State University
Fundamental Results
Expressive power of RE is the same as
Expressive power of RE is limited
Write a RE for “strings of balanced
()(()()), ()(), (((()))), … (((, ())((), …
Can not be done! (impossibility result) Take CSE 3321…
Computer Science and Engineering The Ohio State University
REs in Practice
Possible uses:
Report matching substrings and locations
Replace match with something else
Practical aspects of using REs this way Anchors
Greedy vs lazy matching
Computer Science and Engineering The Ohio State University
REs often used to find a “match”
A substring s within a longer string such that s is in the language defined by the RE
(CSE|cse) ?3901
Used to specify where matching string
should be with respect to a line of text
Newlines are natural breaking points
^Hello World$
^[^\d].\.jpe?g
^ anchors to the beginning of a line
$ anchors to the end of a line
Ruby: \A \z for beginning/end of string
Computer Science and Engineering The Ohio State University
Greedy vs Lazy
Repetition (+ and *) allows multiple matches to begin at same place
The match selected depends on whether the repetition matching is
Default is typically greedy
For lazy matching, use *? or +?
Example: <.*>
Title
Title
greedy, ie matches as much as possible lazy, ie matches as little as possible
Computer Science and Engineering The Ohio State University
Regular Expressions in Science and Engineering The Ohio State University
Instance of a class (Regexp) pattern = Regexp.new(‘^Rub.’)
But literal notation is common: /pattern/ /[aeiou]*/
%r{hello+} #no need to escape /
Match operator =~ (negated as !~)
‘hello world’ =~ /o/ #=> 4 /or/ =~ ‘hello’ #=> nil
Case equality, Regexp === String, -> Boolean
Options post-pended: /pattern/options
Operands: String and Regexp (in either order)
Returns index of first match (or nil if not present)
i ignore case
x ignore whitespace & comments (“free spacing”)
Strings and Regular Expressions
Computer Science and Engineering The Ohio State University
Find all matches as an array
a.scan /[[:alpha:]]/
Delimeter for spliting string into array
a.split /[aeiou]/
Substitution: sub and gsub (+/- !)
Replace first match vs all (“globally”)
a = “the quick brown fox”
a.sub /[aeiou]/,
#=> “th@ quick brown fox”
a.gsub /[aeiou]/,
Your Turn (Regular Expressions)
Computer Science and Engineering The Ohio State University
Check if phone number in valid format
phone = “614-292-2900” #not ok phone = “(614) 292-2900” #ok
format = ?
if phone ? format #well-formatted
Your Turn (Regular Expressions)
Computer Science and Engineering The Ohio State University
Check if phone number in valid format
phone = “614-292-2900” #not ok phone = “(614) 292-2900” #ok
format = /\A\(\d{3}\) \d{3}-\d{4}\z/ if phone =~ format #well-formatted
Expressive power same as RE
Language: A set of strings
RE: Defines a language
Repetition
Recipe for making elements of language Literals
Distinguish characters and metacharacters Character classes
Represent 1 character in RE
Computer Science and Engineering The Ohio State University
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com