Multiple sequence alignment,
UPGMA, Feng- Sci 369, 2022
School of Computer Science, University of Auckland
Copyright By PowCoder代写 加微信 powcoder
Last lecture
Alignment using a ne gap penalty Approaches to multiple sequence alignment
This lecture
Constructing a tree with UPGMA
Feng-Doolitle algoruthm for multiple sequence alignment
MSA via progressive alignment
Most widely used technique for MSA Involves a series of pairwise alignments.
Basic idea: choose a pair of sequences and align them, choose a third and align it to either of rst two, and so on until all sequences aligned.
May want to align two MSA together: eg when aligning 4 sequences, might align two pairs rst then align the resulting alignments together.
MSA via progressive alignment
Issues to address:
Order in which to align the sequences How to align two sequences together How to align a sequence to a MSA How to align two MSAs together
The order of alignment can be determined using a guide tree. We’ll look at a method of construction a guide tree, known as UPGMA.
Building trees with UPGMA
UPGMA: Given objects (sequences) and distances between each pair of objects (a distance matrix), UPGMA builds a tree where distance along tree between a pair is same as distance in matrix.
UPGMA: Unweighted Pair Group Method using Arithmetic Averages Basic idea:
1 Start with each object in own cluster and assign leaf node to each cluster
2 Merge closest cluster together and assign (internal) node at height half the distance
3 if number of clusters is greater than two, go to 2, else stop.
Given objects and distnace matrix where and th object
De ne distance between two clusters (sets) of objects and distance between all pairs between clusters:
is distance between as the average
is the number of sequences in cluster .
jC∈y,iC∈x |jC||iC|
yxd ∑ 1 =jid
UPGMA algorithm
Initialise: Assign each sequence to it’s own cluster . Assign a leaf node to each cluster at height 0.
Repeat until only one cluster remains:
1 Find clusters and such that is minimal (choose randomly between
equidistant candidates).
2 Join and to make the new cluster .
3 De ne a node in the tree placed at height with child nodes and . 4 Update the distance matrix using
lC∈y,kC∈x |lC||kC| yxd∑ 1=lkd
ji 2/jid k
jC ∪ iC = kC j i
UPGMA example
Given 4 sequences, and and distance matrix , construct the UPGMA tree.
First step is to make 4 clusters and assign each a node at 0
−D 6−C 88−B=d 884−A
Pair with distance Join the cluster
is closest.
which has height
Recalculate distance matrix to nd
)D ,E(d )C ,E(d
2 = 2/)B ,A(d }B ,A{ = B ∪ A = E
4 = )B,A(d )B,A(
Recalculate distance matrix to nd and . and similarly for .
−D 6−C 88−E
8 = )8+8(2 = ))C,B(d+)C,A(d(1⋅2 = )C,E(d 11
)D ,E(d )C ,E(d
Now form the cluster and place the node at .
The distance matrix is now the single distance between the remaining clusters: So make the last node and place it at height .
4 = 2/8 }F,E{ = G
8 = )8+8+8+8(4 = ))D,B(d+)C,B(d+)D,A(d+)C,A(d(2⋅2 = )F,E(d 11
3 = 2/)D,C(d }D,C{ = F
Resulting UPGMA tree is
Feng-Doolittle progressive alignment
Steps for aligning sequences are:
1 Calculate the distances between all sequences pairs.
2 Build a guide tree based on the recorded scores (we use UPGMA).
3 Build the alignment in the order that nodes were added to the tree.
2/)1 − n(n n
Distance used in Feng-Doolittle algorithm
The distances are found by aligning each pair and recording a normalized score. The normalized score is
is the expected score for an alignment of the pair when the residues are
randomly shu ed.
is roughly a normalised percentage similarity which decays exponentially towards zero with increasing evolutionary distance.
Taking to make the score increase approximately linearly with evolutionary distance.
is the score from the pairwise alignment
is the average of scores obtained by aligning each sequences of the pair
dnarS−xamS gol−= ffeSgol−=D dnarS − sboS
) f feS(gol −
Aligning sequences and MSAs in Feng- pair of sequences is aligned in the normal way.
A sequence is aligned to an MSA by aligning it to each sequence in the MSA and choosing the highest scoring alignment.
Two alignments are aligned to each other by aligning all pairs of sequences between the two groups and choosing the best alignment.
After alignment completed at each step, gap characters are replaced with a neutral character which can be aligned to any other character (gap or residue) with no
cost. Tends to cluster gaps together.
Can improve inital MSA by making random adjustments, eg remove a sequence at random and re-align it.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com