13-regexp.pptx
Ling 131A
Introduction to NLP with Python
Regular Expressions
Marc Verhagen, Fall 2017
Today
• Python lab
– Monday 1pm and Wednesday 1pm
• Assignment 3
• Assignment 4
– WordNet and POS tagging
• NLTK Chapter 3
– Regular Expressions
Reading HTML
Note: this is an image from the book, but it is not quite right
Reading HTML
from urllib import request
from bs4 import BeautifulSoup
response = request.urlopen(“http://nltk.org/”)
html_content = response.read().decode(‘utf8’)
text_content = BeautifulSoup(html_content).get_text()
$ pip3 install bs4
The example in the book uses nltk.clean_html(), but that does not exist
anymore. Instead you can use BeautifulSoup.
The example in the book uses a URL that does not exist anymore
from urllib import request
url = “http://www.gutenberg.org/files/158/158-0.txt”
response = request.urlopen(url)
raw = response.read().decode(‘utf8’)
tokens = nltk.word_tokenize(raw)
Finite State Processing
• Chomsky Hierarchy of languages/grammars
• Difference is in complexity of grammar and
power of processing automaton
• We will look at context free grammars later, but
now focus just on regular grammars
Grammar Language Automaton Rules
Type-0 Recursively enumerable Turing machine x → y
Type-1 Context sensitive LBA xAy → xBy
Type-2 Context free Pushdown automaton A → x
Type-3 Regular Finite state automaton A → a; A → aB
Finite State Processing
Behold a finite state automaton (FSA)
States, start state, final state(s), transitions,
transition labels, alphabet of labels
Accepts all and only strings for some regular language
Finite State Processing
Finite State Processing
Finite State Processing
Finite State Processing
Finite State Processing
Finite State Processing
Finite State Processing
At end of string plus in a final state ! success
Finite State Processing
• Characteristics
– It is finite (only a finite number of states)
– It is fast: usually speed is to the order of the length
of the input, that is O(n).
– It only works for regular languages
– It is the basis for regular expressions
• In fact, you can proof that for each FSA there is an
equivalent regular grammar and vice versa.
– It is still powerful enough for many NLP tasks
Regular Expression Basics
Say you have an alphabet { a, b }
- a and b are regular expressions
- If x and y are regular expressions
then
- xy is a regular expression
- x|y is a regular expression
- x* is a regular expression
- x? is a regular expression
- x+ is shorthand for xx*
- (x) is a regular expression
a(ba)*cd*
Regular Expressions
• Always defined with reference to some
alphabet (Σ)
– Generally, ASCII characters, A-Z
– Could be smaller (Σ = {a, b, !}) or larger (Unicode)
Regular expressions
• Programming language-independent
– Has good support in Python, Perl, Java
• Practical for text processing
• Specifies members of a set, for example:
– All strings that contain 3 capital H’s
– All strings that contain “Lemma”
– All strings that contain an upper-case vowel
– All strings that begin with a b, d, or g
– All strings that end with VBZ
Applications of regular expressions
• Finding words with certain patterns
– Can you find in nltk.corpus.treebank.words():
• All words that end with dollar signs?
• All words that indicate a year (e.g., 1988)?
• Low-level language processing tasks:
– Tokenization
– Stemming
– Normalization
https://regex101.com/
Regular expressions
• Basics
– characters and character classes
– anchors
– quantifiers
• Python
– re.search() and friends
– match object
– groupings
– flags
Character classes
[AEIOU]
Any one of A, E, I, O, or U
H[aeiou]d
H, followed by any one of a, e, i, o, or u,
followed by d
Character classes: complements
• Any character that’s not a vowel
• [^aeiouAEIOU]
In this context, ^ means “not”
Character classes
• […] matches any character contained in the list.
– [abc] means one occurrence of either a, b, or c
• [^…] matches any character not contained in the list.
– [^abc] means one occurrence of any character that is not an a, b, or c,
• [ABCDEFGHIJKLMNOPQRSTUVWXYZ] one upper-case unaccented letter
• [0123456789] means one digit.
• [0123456789]+\.[0123456789]+ matches decimal numbers.
• [Cc]omputer [Ss]cience
– matches Computer Science, computer Science, Computer science, computer
science.
Character classes: ranges
• All upper-case, all lower-case, all letters, any
digit from zero to 9…
• [A-Z] = [ABCDEFGHIJKLMNOPQRSTUVWXYZ]
• [a-z]
• [A-Za-z] or [A-z]
• [0-9]
Pre-defined character classes
• \d digit, equivalent to [0-9]
• \D non-digit: [^0-9]
• \s whitespace, equivalent to [ \t\n\r\f\v]
• \S non-whitespace, equivalent to [^ \t\n\r\f\v]
• \w alphanumeric character,
equivalent to [a-zA-Z0-9_]
• \W non-alphanumeric character
• . Anything except for \n, or anything in
multiline mode
Anchors
• ^T line that begins with T
• T$ line that ends with T
• \bT word that begins with T
• T\b word that ends with T
• \AT string that starts with T
(similar to ^ unless in multiline mode)
• T\Z string that ends with T
Quantifiers
• a{n} n occurrences of “a”
• a{n,m} between n and m occurrences of “a”
• a{,m} at most m occurrences of “a”
• a{n,} at least n occurrences of “a”
• a+ one or more “a’s”, equivalent to a{1,}
• a* zero or more “a’s”,
equivalent to a{0,} or a{,}
• a? one “a”, or nothing
equivalent to a{0,1} or a{,1}
Disjunction and Scope
• a|b: disjunction
• a(b|c)+ parenthesis indicating scope of
operators
What do the following do?
• [a-zA-Z]
• [A-Z][a-z]*
• P[aeiou]{,2}t
• \d+(\.\d+)?
• ([^aeiou][aeiou][^aeiou])*
• \w+|[^\w\s]+
Regular expressions
• Basics
– characters and character classes
– anchors
– quantifiers
• Python
– re.search() and friends
– match object
– groupings
– flags
Using regular expressions in Python
• https://docs.python.org/3/howto/regex.html
• import re
Regular expression methods
• re.search(REGEXP, STRING)
– Does STRING match REGEXP? Returns a match object
• re.match(REGEXP, STRING)
– Does the beginning of STRING match REGEXP? Returns a match
object
• re.findall(REGEXP, STRING)
– Returns all matches of REGEXP in STRING in a list
• re.sub(REGEXP, REPLACEMENT, STRING)
– Replaces all occurrences of REGEXP with REPLACEMENT in
STRING
• re.finditer(REGEXP, STRING)
– Return an iterator of match objects
• re.split(REGEXP, STRING)
– Split STRING on REGEXP, return a list of substrings
Compiling regular expressions
• Up to now:
– re.search(“[aeiou]”, “this is a string”)
– re.findall(“[aeiou]”, “this is a string”)
• Other option: make a regular expression
object
import re
>>> myreg = re.compile(“[aeiou]”)
>>> myreg.search(“this is a string”).group()
‘i’
>>> myreg.findall(“this is a string”)
[‘i’, ‘i’, ‘a’, ‘i’]
Compiling regular expressions
• search() and findall() exist
– as functions of the re module
– as methods of the regular expression class
• Make a regular expression object using re.compile(“…”)
• When should you make a compiled regular expression object?
– Same regular expression used many times: regular expression object is
faster
– Regular expression with clear semantics: making an object as
documentation
word = re.compile(r”\b\w+\b”)
– Otherwise, you might as well use the re module functions
Grouping
>>> line = “the 10-12 class”
>>> p = re.compile(r”the ((\d+)-(\d+))”)
>>> m = p.search(line)
>>> m.group()
‘the 10-12’
>>> m.group(0)
‘the 10-12′
>>> m.group(1)
’10-12′
Grouping
>>> m.group(2)
’10’
>>> m.group(3)
’12’
>>> m.group(4)
IndexError
>>> m.group(2, 3)
(’10’, ’12’)
More methods on Match Object
>>> m.groups()
(’10-12′, ’10’, ’12’)
>>> m.start(3)
7
>>> m.end(3)
9
>>> m.span(3)
(7, 9)
Give names to groups
>>> line = “Arthur has arrived!”
>>> p = re.compile(r’\b(?P
>>> m = p.search(line)
>>> m.group(1)
‘Arthur’
>>> m.group(‘name’)
‘Arthur’
Back references
>>> p = re.compile(r'(\b\w+\b).+(\1)’)
>>> line = “test this is a test”
>>> p.search(line).groups()
(‘test’, ‘test’)
>>> line = “test this is a pest”
>>> p.search(line).groups()
AttributeError
Search is greedy (till you tell it not to)
>>> line = “Sentence One. Sentence Two.”
>>> p1 = re.compile(r”.*\.”)
>>> m1 = p1.search(line)
>>> m1.group()
‘Sentence One. Sentence Two.’
>>> p2 = re.compile(r’.+?\.’)
>>> m2 = p2.search(line)
>>> m2.group()
‘Sentence One.’
Use the raw string, always
>>> s = “\section”
>>> m = re.search(“\\section”, s)
>>> m.group()
AttributeError
>>> m = re.search(“\\\\section”, s)
>>> m.group()
‘\\section’
>>> m = re.search(r”\\section”, s)
>>> m.group()
‘\\section’
Flags to change the behavior of re
>>> s = “1st of November”
>>> m = re.search(“””
^\s* # space at beginning of string
\d+(th|st|nd|rd) # ordinal number (no commas)
\s+ # at least one space
\w+ # then a word
“””, s, re.VERBOSE)
>>> m.group()
‘1st of’
Flags to change the behavior of re
• re.VERBOSE, re.X:
– whitespace in the regular expression ignored,
except in bracket expressions or as \s
– comments at end of lines, starting with #
– longer regular expression strings, using triple
quotes
– Highly recommended for complex expressions
• This lets you write regular expressions that are actually
readable
Flags to change the behavior of re
• re.IGNORECASE, re.I:
– Ignore case. [A-Z] is the same as [a-z]
• re.MULTILINE, re.M:
– Use this on multiline strings, i.e. strings that contain newline.
– ^ matches beginning of string and beginning of each line
– $ matches end of string and end of each line
Case-sensitive and -insensitive match
>>> line = “Python is not python”
>>> re.findall(‘python’, line)
>>> [‘python’]
>>> re.findall(‘python’, line, re.I)
>>> [‘Python’, ‘python’]
>>> re.findall(‘(?i)python’, line)
>>> [‘Python’, ‘python’]
alternative syntax