Data Mining and Machine Learning
Sequence Analysis & Dynamic Programming
Peter Jančovič Slide 1
Data Mining and Machine Learning
Objectives
To consider data mining for sequential data
To understand Dynamic Programming (DP)
Using DP to compute distance between sequences To understand what is meant by:
– An alignment path
– The DP recurrence equation
– The distance matrix
– The accumulated distance matrix – The optimal path
Slide 2
Data Mining and Machine Learning
Sequences
Sequences are common in real applications:
– DNA analysis in bioinformatics and forensic science
– Sequences of the letters A, G, C and T – Signature recognition biometrics
– Words and text
– Spelling and grammar checkers, author verification,.. – Speech, music and audio
– Speech/speaker recognition, speech coding and synthesis
– Electronic music
– Radar signature recognition…
Slide 3
Data Mining and Machine Learning
Mining sequential data
Sequences may not be amenable to human interpretation (complexity, dimension, quantity)
Need for automated sequential data retrieval/mining
For clustering and other tools, the fundamental requirement is for a measure of the distance between two sequences
Slide 4
Data Mining and Machine Learning
Basic definitions
In a typical sequence analysis application we have a
basic alphabet consisting of N symbols 1,…,n,…,N
Examples:
– In text A is the set of letters plus punctuation plus
‘white space’
– Bioinformatics A = {A, G, C, T} (elements of DNA sequences)
Slide 5
Data Mining and Machine Learning
Sequences of continuous variables
In some applications, elements of a discrete sequence are taken from a continuous vector space, rather than a finite set
Sequences of continuous variables can be dealt with in two ways:
– Directly
– Vector quantization (VQ):
– Represent space as a set of K centroids:
– Replace each data point by its closest centroid
Slide 6
Data Mining and Machine Learning
Distance between sequences (1)
Sequences from the alphabet A = {A, B, C, D}
How similar are the sequences:
– S1 = ABCD
– S2 = ABD
Intuitively S2 is obtained from S1 by deleting C
Alternatively S2 is obtained from S1 by substituting D for C and then deleting D
Slide 7
Data Mining and Machine Learning
Distance between sequences (2)
Or S2 was obtained from S1 be deleting ABCD and inserting ABC
…
First explanation is intuitively ‘correct’, final
explanation is intuitively ‘wrong’. Why?
We favour the simplest explanation, involving the minimum number of insertions, deletions and substitutions
…but maybe not always
Slide 8
Data Mining and Machine Learning
Distance between sequences (3)
Consider:
– S1 = AABC
– S2 = SABC – S3 = PABC – S4 = ASCB
If these sequences were typed then maybe S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard
Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error
Slide 9
Data Mining and Machine Learning
Alignments
Relationship between two sequences can be expressed as an alignment between their elements
Insertion (w.r.t. ABC) is a horizontal step
ABBC A
B
C
Slide 10
Data Mining and Machine Learning
Alignment: deletion and substitution
Deletion is a Substitution or perfect vertical step alignment are diagonal steps
AC AXC AA
BB CC
Slide 11
N.B: Edits described relative to vertical string Data Mining and Machine Learning
Alternative alignment paths
ACXCCD A
B
C
D
Which alignment path is best?
Slide 12
Data Mining and Machine Learning
The Distance Matrix
Let d be a metric, so d(A,B) is the distance between the alphabet symbols A and B
Examples:
– d(A,B) = 0 if A = B, otherwise d(A,B) = 1
– In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B
– For continuous valued sequences d could be Euclidean distance, or City Block distance, or L distance
Slide 13
Data Mining and Machine Learning
Notation
Suppose we have an alphabet:
1,…,n,…,N
The distance matrix for A is an N N matrix
D Dm,n , 1 m,n N
where Dm,n dm ,n is the distance between the
mth and nth alphabet symbols
Slide 14
Data Mining and Machine Learning
The Accumulated Distance
Consider two sequences: – S1 = ABCD
– S2 = ACXCCD
For an alignment path p between S1 and S2 the accumulated distance between S1 and S2, denoted by ADp(S1,S2), is the sum over all the nodes of p of the corresponding distances between elements of S1 and
S2 Slide 15
Data Mining and Machine Learning
Accumulated distance along p ACXCCD
A
B
C
D
Path p
ADp(S1,S2) dC,C dC,CdC,CdD,D
Slide 16
Data Mining and Machine Learning
dA,A
dA,B
dC,X
Accumulated distance (continued)
ACXCCD A
B
C
D
ADp(S1,S2)dA,BdC,X ADp(S1,S2) dA,B dC,X
KDEL
KINS
KINS
KINS
K DEL
KINS
KINS
KINS
Slide 17
Data Mining and Machine Learning
Optimal path and DP distance
Optimal path is path with minimum accumulated distance
pˆ
pˆ a r g m i n A D p S 1 , S 2 , o r A D pˆ S 1 , S 2 m i n A D p S 1 , S 2
Formally the optimal path is where:
pp
The DP distance, or accumulated distance AD(S1,S2)
between S1 and S2 is given by:
A D S 1 , S 2 A D pˆ S 1 , S 2
Slide 18
Data Mining and Machine Learning
Calculating the optimal path
Given
– the distance matrix D,
– the insertion penalty KINS , and – the deletion penalty KDEL*
How can we compute the optimal path between two (potentially very long) sequences S1 and S2?
*If KDEL and KINS are not defined you should assume that they are zero
Slide 19
Data Mining and Machine Learning
Dynamic Programming (DP)
Optimal path calculated using Dynamic Programming (DP), based on principle of optimality
• IfpathsfromXtoYgothroughAorBimmediately before Y, optimal path from X to Y is best of:
• Best path from X to A plus cost to go from A to Y
• Best path from X to B plus cost to go from B to Y
A XY
B
Slide 20
Data Mining and Machine Learning
DP – step 1
Step 1: draw the trellis of all possible paths
ACXCCD A
B
C
D
Slide 21
Data Mining and Machine Learning
ad(1,1)=d(A,A)
ad(2,1)=ad(1,1)+d(2,1)+KDEL
DP – forward pass – initialisation
Accumulated distance matrix
ACXCCD A
B
C D
Path matrix
Slide 22
Data Mining and Machine Learning
ad(i,j)
ad(i,j) is the sum of distances along the best (partial)
path from (1,1) to (i,j)
Calculated using the principle of optimality
(i-1,j-1) adi, j min adi 1, j 1 di, j
(i-1,j)
adi1,jKDEL di,j
adi,j1KINS di,j
Forward path matrix records local optimal paths
(i,j-1)
Slide 23
Data Mining and Machine Learning
DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
B
C D
Path matrix
ad(1,2) = ad(1,1)+d(A,C)+KINS
Slide 24
Data Mining and Machine Learning
DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
ad2,2dC,
Slide 25
ad2,3KDEL dC,X a d 3 , 3 m i n X
ad3,2KINS dC,X Data Mining and Machine Learning
DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
Slide 26
ad3,6KDEL d(D,D) A D S 1 , S 2 a d 4 , 6 m i n a d 3 , 5 d ( D , D )
ad4,5KINS d(D,D) Data Mining and Machine Learning
DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
Optimal path obtained by tracing back through path matrix, starting at the bottom right-hand corner
Slide 27
Data Mining and Machine Learning
Summary
Introduction to sequence analysis
Dynamic Programming (DP) and the principle of optimality
Computing the accumulated distance using DP
– Distance matrix, Accumulated distance matrix,
Path matrix, and Optimal path
Recovering the optimal path
Slide 28
Data Mining and Machine Learning