Regular Expressions
Peter Dixon April 19, 2021
1 Regular expressions
Regular expressions are a useful tool in lots of situations. They’re also closely connected to regular languages. Regular expressions match patterns. They’re made of symbols and operations. Symbols can be:
1. a character in the alphabet Σ
2. λ
3. ∅, which matches nothing
4. #, which matches any single character in Σ 5. @, which matches everything
These can be combined with the following operations:
1. OR, written as +
2. AND, written as ∩
3. concatenation, denoted by writing nothing in between the two strings
4. NOT, written as ∼
5. Kleene star ∗ (in the superscript)
6. “Kleene star but not empty” + (in the superscript)
7. and it’s not really an operation, but use parentheses if the order of oper- ations is ambiguous
There’s some redundancy here – @ is the same as writing #∗. (This is an important example – note that you do the ∗ before “choosing” what character # represents.)
As usual, we’ll write L(p) to denote “the set of strings matched by p”. Let’s do all the definitions:
1. forx∈Σ,L(x)=x
1
2. L(λ) = λ
(Note that technically these are two different things. The ∅ on the left is a pattern, and on the right is a string. )
3. L(∅) = ∅
(Note that technically these are two different things. The ∅ on the left is a pattern, and on the right is a set.)
4. L(#) = Σ
5. L(@) = Σ∗
6. For any patterns p1, p2:
7. L(p1 + p2) = L(p1) ∪ L(p2)
8. L(p1 ∩ p2) = L(p1) ∩ L(p2)
9. L(p1p2)={x|∃y,z∈Σ∗ :x=yz,y∈L(p1),z∈L(p2)}
10. L(∼ p1) = L(p1)C (that’s “complement”)
11. L(p∗1) = ∪∞i=0L(pi1) (that’s “i concatenations”; formally this is defined
inductively)
12. L(p+1 ) = ∪∞i=1(p1)i
13. and it’s not really an operation, but use parentheses if the order of oper- ations is ambiguous
Ok, let’s do some examples. Warning: the syntax in this class is different from the syntax used by a lot of regular expressions! Be extra super careful about making sure you use this class’s syntax! Let’s say the alphabet is Σ = {a, b, c}.
• If you want to match “strings that end in c”, you could write @c
• If you want to match “strings that start in a and end in c”, you could
write a@c
• If you want to match “strings that start in a, end in c, and contain at
least one b in between, you could write a@b@c
• If you want to match “strings with at least two cs”, you could write
@c@c@.
• If you want to match “strings with at most two cs”, you could write
∼ (@c@c@c@).
• If you want to match “strings with exactly two cs”, you could write (@c@c@∩ ∼ (@c@c@c@))
2
• …or you could write (#−c)∗c(#−c)∗c(#−c)∗
You don’t need all these symbols and operations. In fact, all you need is a symbol for each x ∈ Σ, a symbol for ∅, and the operations ∩,∗,∼ and concatenation.
Let’s say Σ = {a, b}, and p and q are patterns. Then you could write: • p+qusing∼(∼p∩∼q)
• #usinga+b
• @ using #∗
• p+ using pp∗
• λusinga∗∩∼a+
You could also get rid of ∅ fairly easily, or ∼ (much harder).
So the big question is: what languages can be expressed with regular ex- pressions? You may have already guessed: it’s regular languages. We want to show: for every DFA, there’s a regex that accepts the same language, and for every regex there’s a NFA that accepts the same language. So let’s build a NFA matching a regex.
We’ll build it kind of inductively: we’ll describe how to build a NFA for all the symbols
Side note about the practical side of regex: a lot of regex implementations allow “backreferences”: you can write, for example, ⟨#⟩\1\1 to match “3 of the same character”. The ⟨⟩ surround a string that can be referenced in the future. This is more powerful than the regex we’re using (you can do some context-free stuff), so forget you ever saw this.
3