CMPSC 311 – Introduction to Systems Programming
Regular Expressions
Professors:
Copyright By PowCoder代写 加微信 powcoder
(Slides are mostly by Professor Patrick McDaniel and Professor Abutalib Aghayev)
CMPSC 311 – Introduction to Systems Programming
Regular expressions
• Often shortened to “regex” or (more rarely) “regexp”
• Regular expressions are a language for matching patterns
• Super-powerful find and replace tool
• Can be used on the command line tools, in shell scripts, as a text editor feature, or as part of a program language (e.g., Python, Perl)
CMPSC 311 –Introduction to Systems Programming
What are they good for?
• Searching for specifically formatted text • Email address
• Phone number
• Anything that follows a pattern
• Validating input • Same idea
• Powerful find-and-replace
• E.g.change“XandY”to“YandX”foranyX,Y
CMPSC 311 –Introduction to Systems Programming
Regex “flavors”
(source: https://gist.github.com/CMCDragonkai/)
CMPSC 311 –Introduction to Systems Programming
On the command line
• The grep command is a regex filter
• That’s what the “re” (regular expression) in the middle stands for • grep –F looks for literal strings
• Today we will use grep –E
• E for “extended” regular expressions • Very close to other languages’ flavors • Or egrep
CMPSC 311 –Introduction to Systems Programming
grep command syntax
• To find matches in files: • grep -E regex file(s)
• To filter standard input:
• grep -E regex
• where regex is a regular expression, and file(s) are the files to search
• Options (aka “flags”):
• -i: ignore case
• -v: find non-matching lines (inverted search) • -r: search entire directories
• -l: list the files that match, not the lines
CMPSC 311 –Introduction to Systems Programming
Playing along …
• First, go to the course website and grab the grep-text.txt file and put it in your VM (it is under the files tab of canvas)
• E1*: Now test it out:
% grep -E this grep-text.txt
this is a text file
this is a text file text file %
*These numbered exercises are used throughout this lecture to help illustrate regular expression use. Answer key given later.
CMPSC 311 –Introduction to Systems Programming
First lesson
• Letters, numbers, and a few other things match literally • E2: Find all the lines containing “fgh”
• E3: Find all the lines containing “lmn”
• Note: a regex can match anywhere in the string • Doesn’t have to match the whole string
CMPSC 311 –Introduction to Systems Programming
• Caret ^ matches at the beginning of a line • Dollar sign $ matches at the end of a line
• Use “…” to protect characters from the shell, e.g., grep -E “blah” grep-text.txt
• E4: Find words ending in “gry”
• E5: Find words starting with “ah”
• What happens if we use both anchors?
CMPSC 311 –Introduction to Systems Programming
Single-character wildcard
• Dot . matches any single character (exactly one) • E6: Find the lines line with 6-letter word where the
second, fourth, and sixth letters are “o”
• E7: Find lines with words that start with ”gr” and end with ”at” with exactly one character in the middle
CMPSC 311 –Introduction to Systems Programming
Multi-character wildcard
• Dot-star .* will match 0 or more characters • We’ll see why on the next slide
• E8: Find any words that start with ”gr” and end with ”at” with any number of characters in the middle
CMPSC 311 –Introduction to Systems Programming
Quantifiers
• How many repetitions of the previous thing to match? • Star *: 0 or more
• Plus +: at least 1
• Question mark ?: 0 or 1 (i.e., optional)
• Brackets {#} : exactly # of times • Try it out
• E9: Spell check: necc?ess?ary
• E10: Outside the US: colou?r
• E11: Find lines with words with u, o, i, e, a in that order and at least one letter in between each
CMPSC 311 –Introduction to Systems Programming
Character classes and spaces
• Spaces can be specified with “\s”
• Square brackets [abc] will match any one of the enclosed characters
• You can use quantifiers on character classes
• E12: Find words starting with b where all the rest of the letters are c, h, or s
CMPSC 311 –Introduction to Systems Programming
• Part of character classes
• You can specify a range of characters with [a-j]
• One hex digit: [0-9a-f]
• Consonants:[b-df-hj-np-tv-z]
• E13: Find lines that have all the words you can make exclusively with a through e
CMPSC 311 –Introduction to Systems Programming
Negative character classes
• If the first character is a caret, matches anything except these characters
• Consonants:[^aeiou]
• Can be combined with ranges
• Any character that isn’t a digit: [^0-9]
• E14: Find words that contain “aq”, followed by something other than “u”
CMPSC 311 –Introduction to Systems Programming
• Parentheses (…) create groups within a regex
• Quantifiers operate on the entire group
• E15: Find words with an “m”, followed by “ach” one or more times, followed by “e” • E16: Find words where every other character, starting with the first, is an “e”
CMPSC 311 –Introduction to Systems Programming
• The vertical bar | denotes that either the left or right side matches • It’s the “or” operator
• Useful inside parentheses
• E17: Find lines with the word “this” or “fish”
CMPSC 311 –Introduction to Systems Programming
Special characters
• We’ve seen a lot already • ^$.*+?[]()|\
• Backslash \ will escape a special character to search for it literally • E18: Search for all lines with an asterisk (*)
CMPSC 311 –Introduction to Systems Programming
Backreferences
• Groups in () can be referred to later
• Must match the exact same characters again
• Numbered \1, \2, \3 from the start of the regex
• E19: Find lines that have words that have a four-character sequence repeated immediately
CMPSC 311 –Introduction to Systems Programming
Want to learn more?
• regexcrossword.com
• Great way to practice your regex-fu
• Starts with simpler tutorial puzzles and works up
• Write your own regex engine
CMPSC 311 –Introduction to Systems Programming
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com