COMP9319 Web Data Compression and Search
Regular expressions and basic indexed search
Including some slides modified from comp3402 at cs.ucf.edu, haimk, Tel-Aviv U. and the slides from .
Copyright By PowCoder代写 加微信 powcoder
Regular Expressions
• Notationtospecifyalanguage – Declarative
– Sort of like a programming language.
• Fundamental in some languages like Perl and applications like grep or lex
– Capable of describing the same thing as an NFA • The two are actually equivalent, so RE = NFA = DFA
– We can define an algebra for regular expressions
Definition of a Regular Expression
R is a regular expression if it is:
1. a for some a in the alphabet å, standing for the language {a}
2. ε, standing for the language {ε}
3. Ø, standing for the empty language
4. R1+R2 where R1 and R2 are regular expressions, and + signifies union (sometimes | is used)
5. R1R2 where R1 and R2 are regular expressions and this signifies concatenation
6. R* where R is a regular expression and signifies closure
7. (R) where R is a regular expression, then a parenthesized R is also a regular expression
This definition may seem circular, but 1-3 form the basis Precedence: Parentheses have the highest precedence, followed by *, concatenation, and then union.
Using Regular Expressions
• Regular expressions are a standard programmer’s tool.
• Built in to Java, Perl, Unix, Python, . . . .
RE Examples
• L(0+10*) = { 0, 1, 10, 100, 1000, 10000, … }
• L(0*10*) = {1, 01, 10, 010, 0010, …} i.e. {w | w has exactly a single 1}
• L(åå)* = {w | w is a string of even length}
• L( (0(0+1))* ) = { ε, 00, 01, 0000, 0001, 0100, 0101, …}
• L((0+ε)(1+ ε)) = {ε, 0, 1, 01}
• L(1Ø) = Ø ; concatenating the empty set to any set yields the empty set.
• Note that R+ε may or may not equal R (we are adding ε to the language)
• Note that RØ will only equal R if R itself is the empty set.
• L(001) = {001}
Exercise 1
• Let ∑ be a finite set of symbols • ∑={10,11},∑*=?
Answer: ∑* = {є, 10, 11, 1010, 1011, 1110, 1111, …}
Exercise 2
• L1={10,1},L2={011,11},L1L2=?
Answer • L1L2 = {10011, 1011, 111}
Exercise 3
• Write RE for
– All strings of 0’s and 1’s
– All strings of 0’s and 1’s with at least 2
consecutive 0’s
– All strings of 0’s and 1’s beginning with 1 and not having two consecutive 0’s
All strings of 0’s and 1’s.
• (0|1)*00(0|1)*
All strings of 0’s and 1’s with at least 2 consecutive 0’s
All strings of 0’s and 1’s beginning with 1 and not having two consecutive 0’s
More Exercises
• 1) (0|1)*011 • 2) 0*1*2*
• 3) 00*11*22*
More Exercises (Answers) 1) (0|1)*011
Answer: all strings of 0’s and 1’s ending in 011
• Answer: any number of 0’s followed by any number of 1’s followed by any number of 2’s
• 3) 00*11*22*
Answer: strings in 0*1*2 with at least one of each symbol
Deterministic Finite Automata (DFA)
• Simple machine with N states.
• Begin in start state.
• Read first input symbol.
• Move to new state, depending on current state and input symbol.
• Repeat until last input symbol read.
• Accept or reject string depending on label of last state.
Theory of DFAs and REs
• RE. Concise way to describe a set of strings.
• DFA. Machine to recognize whether a given string is in a given set.
• Duality: for any DFA, there exists a regular expression to describe the same set of strings; for any regular expression, there exists a DFA that recognizes the same set.
Duality Example • DFAformultipleof3b’s:
• REformultipleof3b’s:
Fundamental Questions
• Which languages CANNOT be described by any RE?
• Set of all bit strings with equal number of 0s and 1s.
• Set of all decimal strings that represent prime numbers.
• Manymore….
• Make a DFA that accepts the strings in the language denoted by regular expression ab*a
• Write the RE for the following automata:
• a(a|b)*a
DFA to RE: State Elimination
• Eliminates states of the automaton and replaces the edges with regular expressions that includes the behavior of the eliminated states.
• Eventually we get down to the situation with just a start and final node, and this is easy to express as a RE
State Elimination
• Consider the figure below, which shows a generic state s about to be eliminated.
• The labels on all edges are regular expressions.
• To remove s, we must make labels from each qi to p1 up to pm that include the paths we could have made through s.
DFA to RE via State Elimination (1)
• Starting with intermediate states and then moving to accepting states, apply the state elimination process to produce an equivalent automaton with regular expression labels on the edges.
• The result will be a one or two state automaton with a start state and accepting state.
DFA to RE State Elimination (2)
• Ifthetwostatesaredifferent,wewillhavean automaton that looks like the following:
• Wecandescribethisautomatonas:(R| SU*T)*SU*
DFA to RE State Elimination (3)
• Ifthestartstateisalsoanacceptingstate,then we must also perform a state elimination from the original automaton that gets rid of every state but the start state. This leaves the following:
• WecandescribethisautomatonassimplyR*
DFA to RE State Elimination (4)
• If there are n accepting states, we must repeat the above steps for each accepting states to get n different regular expressions, R1, R2, … Rn.
• For each repeat we turn any other accepting state to non-accepting.
• The desired regular expression for the automaton is then the union of each of the n regular expressions: R1 U R2… U RN
DFA->RE Example • ConvertthefollowingtoaRE:
• FirstconverttheedgestoRE’s:
DFA -> RE Example (2) • EliminateState1:
• Noteedgefrom3->3
• Answer: (0+10)*11(0+1)*
Second Example
• Automatathat accepts even number of 1’s
• Eliminatestate2:
Second Example (2)
• Twoacceptingstates,turnoffstate3first
• This is just 0*; can ignore going to state 3 since we would “die”
Second Example (3)
• Turn off state 1 second:
• This is just 0*10*1(0|10*1)*
• Combine from previous slide to get 0* | 0*10*1(0|10*1)*
Text search
• Pattern matching directly – Brute force
• Regular expressions
• Indices for pattern matching – Inverted files
– Signature files
– Suffix trees and Suffix arrays
Inverted Index
For each term t, we store a list of all documents that contain t.
dictionary postings
Create postings lists, determine document frequency
Positional indexes
§Postings lists in a nonpositional index: each posting is just a docID
§Postings lists in a positional index: each posting is a docID and a list of positions
Positional indexes: Example
Query: “to1 be2 or3 not4 to5 be6” TO, 993427:
‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›;
4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›;
7: ‹13, 23, 191›; . . . › BE, 178239:
‹ 1: ‹17, 25›;
4: ‹17, 191, 291, 430, 434›;
5: ‹14, 19, 101›; . . . › Document 4 is a match!
Signature files
• Definition
– Word-oriented index structure based on hashing.
– Use liner search.
– Suitable for not very large texts.
• Structure
– Based on a Hash function that maps words to bit masks.
– The text is divided in blocks.
• Bit mask of block is obtained by bitwise ORing the
signatures of all the words in the text block.
• Word not found, if no match between all 1 bits in the query mask and the block mask.
Signature files
• Example:
block 2 block3 block 4
This is a text. A text has many words. Words are made from letters
Text signature
h(text) = 000101 h(many) = 110000 h(words) = 100100 h(made) = 001100 h(letters) = 100001
Signature function
Signature files • False drop Problem
– The corresponding bits are set even though the word is not there!
– The design should insure that the probability of false drop is low.
• Also the Signature file should be as short as possible.
– Enhance the hashing function to minimize the error probability.
Signature files
1. Forasingleword,HashwordtoabitmaskW.
2. Forphrases,
1) Hashwordsinquerytoabitmask.
2) BitwiseORofallthequerymaskstoabitmaskW.
3. CompareWtothebitmasksBiofallthetextblocks.
• If all the bits set in W are also in Bi, then text block may contain the word.
4. Forallcandidatetextblocks,anonlinetraversalmust be performed to verify if the actual matches are there.
• Construction
1. Cutthetextinblocks.
2. Generateanentryofthesignaturefileforeachblock.
• This entry is the bitwise OR of the signatures of all the words
in the block.
• Searching
Suffix trees and suffix arrays
• A tree representing a set of strings.
aeef ad bbfe bbfg c}
Trie (Cont)
• Assume no string is a prefix of another
Each edge is labeled by a letter,
no two edges outgoing from the same node are labeled the same.
Each string corresponds to a leaf.
Compressed Trie
• Compressunarynodes,labeledgesbystrings
Suffix tree
Given a string s a suffix tree of s is a compressed trie of all suffixes of s
To make these suffixes prefix-free we add a special character, say $, at the end of s
Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
ab$ a$ bab$ a$b abab$ } b
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
Put the suffix ab$ in
Put the suffix b$ in
Put the suffix $ in
$ We will also label each leaf with the a
starting point of the corresponding suffix.
Takes O(n2) time to build.
*It can also be constructed in O(n) time.
What can we do with it ?
Exact string matching:
Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.
W e may also want to find all occurrences of P in T
Exact string matching
In preprocessing we just build a suffix tree in O(n) time
Given a pattern P = ab we traverse the tree according to the pattern.
If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time.
Generalized suffix tree
Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of sÎS
To make these suffixes prefix-free we add a special char, say $, at the end of s
To associate each suffix with a unique string in S add a different special char to each s
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2
b$ b# ab$ ab# bab$ aab# abab$
So what can we do with it ?
Matching a pattern against a database of strings
Longest common substring (of two strings)
Every node with a leaf descendant from string s1 and a leaf descendant from string
s2 a$ represents a maximal common substring and b
vice versa.
Find such node with largest “string depth”
b a ab b # $
Lowest common ancestor
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
Finding maximal palindromes
• A palindrome: caabaac, cbaabc
• Want to find all maximal palindromes in a
Let s = cbaaba
The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr
Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#
For every i find the LCA of suffix i of s and suffix m-i+1 of sr
Let s = cbaaba$ then sr = abaabc# a#
b a a b a $
O(n) time to identify all palindromes
• Suffix trees consume a lot of space
• It is O(n) but the constant is quite big
• Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node
Suffix array
• We loose some of the functionality but we save space.
Let s = abab
Sort the suffixes lexicographically:
ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
How do we build it ?
• Build a suffix tree
• Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array.
• O(n) time
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array • Takes O(mlogn) time
Let S = mississippi L
Let P=issa
ippi issippi ississippi
mississippi pi
sisippi ssippi
R ssissippi
Supra index
– Suffix arrays are space efficient implementation of suffix
• Structure trees.
– Simply an array containing all the pointers to the text suffixes listed in lexicographical order.
– Supra-indices:
• If the suffix array is large, this binary search can perform
poorly because of the number of random disk accesses.
• Suffix arrays are designed to allow binary searches done by
comparing the contents of each pointer.
• To remedy this situation, the use of supra-indices over the suffix array has been proposed.
Supra index • Example
1 6 911 1719 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters
Suffix Array
Supra-Index Suffix Array
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com