程序代写代做代考 data mining DNA Bioinformatics Data Mining and Machine Learning

Data Mining and Machine Learning
Sequence Analysis & Dynamic Programming
Peter Jančovič Slide 1
Data Mining and Machine Learning

Objectives
 To consider data mining for sequential data
 To understand Dynamic Programming (DP)
 Using DP to compute distance between sequences  To understand what is meant by:
– An alignment path
– The DP recurrence equation
– The distance matrix
– The accumulated distance matrix – The optimal path
Slide 2
Data Mining and Machine Learning

Sequences
 Sequences are common in real applications:
– DNA analysis in bioinformatics and forensic science
– Sequences of the letters A, G, C and T – Signature recognition biometrics
– Words and text
– Spelling and grammar checkers, author verification,.. – Speech, music and audio
– Speech/speaker recognition, speech coding and synthesis
– Electronic music
– Radar signature recognition…
Slide 3
Data Mining and Machine Learning

Mining sequential data
 Sequences may not be amenable to human interpretation (complexity, dimension, quantity)
 Need for automated sequential data retrieval/mining
 For clustering and other tools, the fundamental requirement is for a measure of the distance between two sequences
Slide 4
Data Mining and Machine Learning

Basic definitions
 In a typical sequence analysis application we have a
basic alphabet consisting of N symbols 1,…,n,…,N
 Examples:
– In text A is the set of letters plus punctuation plus
‘white space’
– Bioinformatics A = {A, G, C, T} (elements of DNA sequences)
Slide 5
Data Mining and Machine Learning

Sequences of continuous variables
 In some applications, elements of a discrete sequence are taken from a continuous vector space, rather than a finite set
 Sequences of continuous variables can be dealt with in two ways:
– Directly
– Vector quantization (VQ):
– Represent space as a set of K centroids:
– Replace each data point by its closest centroid
Slide 6
Data Mining and Machine Learning

Distance between sequences (1)
 Sequences from the alphabet A = {A, B, C, D}
 How similar are the sequences:
– S1 = ABCD
– S2 = ABD
 Intuitively S2 is obtained from S1 by deleting C
 Alternatively S2 is obtained from S1 by substituting D for C and then deleting D
Slide 7
Data Mining and Machine Learning

Distance between sequences (2)
 Or S2 was obtained from S1 be deleting ABCD and inserting ABC
…
 First explanation is intuitively ‘correct’, final
explanation is intuitively ‘wrong’. Why?
 We favour the simplest explanation, involving the minimum number of insertions, deletions and substitutions
 …but maybe not always
Slide 8
Data Mining and Machine Learning

Distance between sequences (3)
 Consider:
– S1 = AABC
– S2 = SABC – S3 = PABC – S4 = ASCB
 If these sequences were typed then maybe S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard
 Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error
Slide 9
Data Mining and Machine Learning

Alignments
 Relationship between two sequences can be expressed as an alignment between their elements
 Insertion (w.r.t. ABC) is a horizontal step
ABBC A
B
C
Slide 10
Data Mining and Machine Learning

Alignment: deletion and substitution
Deletion is a Substitution or perfect vertical step alignment are diagonal steps
AC AXC AA
BB CC
Slide 11
N.B: Edits described relative to vertical string Data Mining and Machine Learning

Alternative alignment paths
ACXCCD A
B
C
D
Which alignment path is best?
Slide 12
Data Mining and Machine Learning

The Distance Matrix
 Let d be a metric, so d(A,B) is the distance between the alphabet symbols A and B
 Examples:
– d(A,B) = 0 if A = B, otherwise d(A,B) = 1
– In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B
– For continuous valued sequences d could be Euclidean distance, or City Block distance, or L distance
Slide 13
Data Mining and Machine Learning

Notation
 Suppose we have an alphabet:
1,…,n,…,N
 The distance matrix for A is an N N matrix
D  Dm,n , 1 m,n  N
where Dm,n  dm ,n  is the distance between the
mth and nth alphabet symbols
Slide 14
Data Mining and Machine Learning

The Accumulated Distance
 Consider two sequences: – S1 = ABCD
– S2 = ACXCCD
 For an alignment path p between S1 and S2 the accumulated distance between S1 and S2, denoted by ADp(S1,S2), is the sum over all the nodes of p of the corresponding distances between elements of S1 and
S2 Slide 15
Data Mining and Machine Learning

Accumulated distance along p ACXCCD
A
B
C
D
Path p
ADp(S1,S2)  dC,C dC,CdC,CdD,D
Slide 16
Data Mining and Machine Learning
dA,A
dA,B
dC,X

Accumulated distance (continued)
ACXCCD A
B
C
D
ADp(S1,S2)dA,BdC,X ADp(S1,S2) dA,B   dC,X
KDEL
KINS
KINS
KINS
K DEL
KINS
KINS
KINS
Slide 17
Data Mining and Machine Learning

Optimal path and DP distance
 Optimal path is path with minimum accumulated distance

pˆ  a r g m i n A D p  S 1 , S 2  , o r A D pˆ  S 1 , S 2   m i n A D p  S 1 , S 2 
 Formally the optimal path is where:
pp
 The DP distance, or accumulated distance AD(S1,S2)
between S1 and S2 is given by:
A D  S 1 , S 2   A D pˆ  S 1 , S 2 
Slide 18
Data Mining and Machine Learning

Calculating the optimal path
 Given
– the distance matrix D,
– the insertion penalty KINS , and – the deletion penalty KDEL*
 How can we compute the optimal path between two (potentially very long) sequences S1 and S2?
*If KDEL and KINS are not defined you should assume that they are zero
Slide 19
Data Mining and Machine Learning

Dynamic Programming (DP)
 Optimal path calculated using Dynamic Programming (DP), based on principle of optimality
• IfpathsfromXtoYgothroughAorBimmediately before Y, optimal path from X to Y is best of:
• Best path from X to A plus cost to go from A to Y
• Best path from X to B plus cost to go from B to Y
A XY
B
Slide 20
Data Mining and Machine Learning

DP – step 1
 Step 1: draw the trellis of all possible paths
ACXCCD A
B
C
D
Slide 21
Data Mining and Machine Learning

ad(1,1)=d(A,A)
ad(2,1)=ad(1,1)+d(2,1)+KDEL
DP – forward pass – initialisation
Accumulated distance matrix
ACXCCD A
B
C D
Path matrix
Slide 22
Data Mining and Machine Learning

ad(i,j)
 ad(i,j) is the sum of distances along the best (partial)
path from (1,1) to (i,j)
 Calculated using the principle of optimality
(i-1,j-1) adi, j min adi 1, j 1 di, j
(i-1,j)
adi1,jKDEL di,j
adi,j1KINS di,j
 Forward path matrix records local optimal paths
(i,j-1)
Slide 23
Data Mining and Machine Learning

DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
B
C D
Path matrix
ad(1,2) = ad(1,1)+d(A,C)+KINS
Slide 24
Data Mining and Machine Learning

DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
ad2,2dC,
Slide 25
ad2,3KDEL dC,X a d  3 , 3   m i n  X 
ad3,2KINS dC,X Data Mining and Machine Learning

DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
Slide 26
ad3,6KDEL d(D,D) A D  S 1 , S 2   a d  4 , 6   m i n  a d  3 , 5   d ( D , D )
ad4,5KINS d(D,D) Data Mining and Machine Learning

DP – forward pass – continued
Accumulated distance matrix
ACXCCD A
Path matrix
B
C D
Optimal path obtained by tracing back through path matrix, starting at the bottom right-hand corner
Slide 27
Data Mining and Machine Learning

Summary
 Introduction to sequence analysis
 Dynamic Programming (DP) and the principle of optimality
 Computing the accumulated distance using DP
– Distance matrix, Accumulated distance matrix,
Path matrix, and Optimal path
 Recovering the optimal path
Slide 28
Data Mining and Machine Learning