Lecture 12: Another Greedy Algorithm
Optimal 2-way Merge
We are given n sorted lists L1, L2, …, Ln , which need to be merged into a combined sorted list, but we can merge only two at a time. Find an optimal merge pattern which minimizes the total number of comparisons
Copyright By PowCoder代写 加微信 powcoder
Example. Suppose there are 3 sorted lists L1, L2, L3 of sizes 30, 20, and 10, respectively.
We can first merge L1 and L2, which uses 30 + 20 = 50 comparisons resulting in a list of size 50. We can then merge this list with list L3, using another 50 + 10 = 60 comparisons, so the total number of comparisons is 50 + 60 = 110.
Alternatively, we can merge lists L2 and L3, using 20 + 10 = 30 comparisons, the resulting list (size 30) can then be merged with list L1, for another 30 + 30 = 60 comparisons. So the total number of comparisons is 30 + 60 = 90.
You could also merge L1 and L3, first and then the result with L2. The total number of comparisons is 100.
Binary Merge Tree
Equivalent problem: You are given a set of leaf nodes 𝑎, … , 𝑎 and associated leaf weights 𝑤(𝑎), … , 𝑤(𝑎) (the leaf nodes correspond to the initial lists, and the weights to their sizes).
Create a binary tree from the leaf nodes towards the root, in which the size of each node is the sum of the sizes of the two children.
A binary merge tree is optimal if it minimizes the weighted external path length. The weighted external path length of the tree is 𝐵 𝑇 = ∑ 𝑤 𝑎 𝑑(𝑎)
Cost = 30*2 + 20*2 + 10*1 = 110
Cost = 30*1 + 20*2 + 10*2 = 90
Merge L1 and L2, then with L3 Merge L2 and L3, then with L1
Optimal Binary Merge Tree Algorithm
Input: n 2 leaf nodes, each with a size (i.e., # list elements) . Output: a binary tree with the given leaf nodes which has a
minimum total weighted external path lengths
Algorithm:
Create a min-heap T[1..n ] based on the n initial sizes. While (the heap size 2) do
extract from the heap two smallest values a and b
create intermediate node of size a + b whose children are a
insert the value (a + b) into the heap
Time complexity O(nlogn)
It can be shown that the Binary Merge Tree is optimal
Iteration 2: merge 5 and 5 into 10
Initially, 5 leaf nodes with sizes
Iteration 1: merge 2 and 3 into 5
Iteration 3: merge 7 and 9 (chosen among 7, 9, and 10) into 16
Iteration 4: merge 10 and 16 into 26
Example of Optimal Merge Tree
Cost = 2*3 + 3*3 + 5*2 + 7*2 + 9*2 = 57.
Frequency (in thousands) Fixed-length codeword Variable-length codeword
45 13 12 16 9 5
000 001 010 011 100 101 0 101 100 111 1101 1100
Encoding: Replace characters by corresponding codewords. Example: Encode the word faded
Fixed-length Code: 101000011100011 Variable-length Code: 110001111101111
Frequency (in thousands)
Fixed-length codeword
Variable-length codeword
45 13 12 16 9 5
000 001 010 011 100 101
0 101 100 111 1101 1100
Encoding: Replace characters by corresponding codewords.
Q: How can one design a code minimizing length of encoded message?
Ex: For a file with 100,000 characters that appear with the frequencies given in the table,
The fixed-length code requires
The variable-length code requires
Decoding: Replace codewords by corresponding characters.
C1 = {a = 00, b = 01, c = 10, d = 11}. C2 = {a = 0, b = 110, c = 10, d = 111}. C3 = {a = 1, b = 110, c = 10, d = 111}
A message is uniquely decodable if it can only be decoded in one way.
In fact, one can show that every message encoded using C1 or C2 is uniquely decodable.
C1: Because it is a fixed-length code. C2: Because it is a prefix-free code.
Relative to C1, 010011 is uniquely decodable to bad.
Relative to C2, 1100111 is uniquely decodable to bad.
But, relative to C3, 1101111 is not uniquely decipherable since it could have encoded to either bad or acad.
Prefix Codes
Def: A code is called a prefix (free) code if no codeword is a prefix of another codeword.
Theorem: Every message encoded by a prefix free code is uniquely decodable. Pf: Since no codeword is a prefix of any other, we can always find the first
codeword in a message, peel it off, and continue decoding. Ex: code: {a = 0, b = 110, c = 10, d = 111}.
Note: There are other kinds of codes that are also uniquely decodable.
Theorem (proof omitted): The best prefix code can achieve the optimal data compression among any code that is uniquely decodable.
01101100 = 0 110 110 0
Problem: For a given input file, find the (a) prefix code that results in the smallest encoded message. (Compression)
Correspondence between Binary Trees and Prefix Codes
Frequency (in thousands) Fixed-length codeword Variable-length codeword
b:001 c:010 d:011 e:100 f:101
45 13 12 16 9 5
000 001 010 011 100 101 0 101 100 111 1101 1100
c:100 b:101 f:1100
d:111 e:1101
Left edge labeled 0; right edge is labeled 1. Binary string read off on path from root to a leaf
is the codeword associated with the character at that leaf. Depth of a leaf is equal to the length of the codeword.
Weighted External Path Length
Given a tree with leaves labeled and associated leaf weights
, the weighted external path length of the tree is
a:45 b:13 c:12 d:16 e:9 f:5
45+13+12+16+9+5 ∗3=300
c:12 b:13 d:16
45∗1+ 12+13+16 ∗3+ 9+5 ∗4=224
Correspondence between Binary Trees and Prefix Codes
Frequency (in thousands) Fixed-length codeword Variable-length codeword
a:000 b:001 c:010 d:011 e:100
a b c d e f TotalCost 45 13 12 16 9 5
000 001 010 011 100 101 0 101 100 111 1101 1100
c:100 b:101
45 13 12 16 9 5
Set weight of leaf to be frequency of associated code word
f:1100 e:1101
External Weighted Path Length (cost ) of tree is EXACTLY total cost of code
=> Finding min cost code is same as finding min-cost tree!
Problem Restated
Problem definition: Given an alphabet of characters with
1 such that
is minimized, where
Greedy idea:
, find a binary tree with leaves labeled
Label the root of this subtree as
Set frequency
Remove from and add to
is the depth of .
Pick two characters from with the smallest weights
Create a subtree that has these two characters as leaves.
Repeat the above procedure (called a “merge”), until only one character is left.
(a) f:5 e:9
c:12 b:13 d:16
b:13 0 14 1 d:16 a:45 f:5 e:9
0 14 1 f:5 e:9
d:16 0 25 1 c:12 b:13
0 30 1 a:45
14 d:16 01
25 30 a:45 0551
c:12 b:13 14 d:16 0 25 1 010
c:12 b:13 14
The Algorithm
Huffman(𝐴):
create a min-priority queue 𝑄 on 𝐴, with weight as key for 𝑖←1 to 𝑛−1
allocate a new node 𝑧 𝑥 ← Extract-Min(𝑄)
𝑦 ← Extract-Min(𝑄)
𝑧. 𝑙𝑒𝑓𝑡 ← 𝑥
𝑧. 𝑟𝑖𝑔h𝑡 ← 𝑦
𝑧. 𝑤𝑒𝑖𝑔h𝑡 ← 𝑥. 𝑤𝑒𝑖𝑔h𝑡 + 𝑦. 𝑤𝑒𝑖𝑔h𝑡 Insert(𝑄, 𝑧)
return Extract-Min(𝑄) // return the root of the tree
Running time:
: Correctness
Lemma 1: An optimal prefix code tree must be “full”, i.e., every internal node has exactly two children.
Pf: If some internal node had only one child,
then we could simply get rid of this node, replacing it with its child. This would decrease the total cost of the encoding
(because no leaf increases depth and some leaves(s) decrease depth)
: Correctness
Observation: Moving a small-frequency character downward in increase tree cost.
Lemma 2: Let be prefix code tree and
be another obtained from by swapping two leaf nodes
doesn’t and .
: Correctness
𝑓(𝑏) b c 𝑓(𝑥) x
: Correctness
Lemma 3: Consider the two characters and with the smallest frequencies. There is an optimal code tree in which these two letters are sibling leaves at the deepest level of the tree.
Pf: Let be any optimal prefix code tree, and be two siblings at the deepest level of the tree (must exist because is full).
T T’ T” xbb
Assume without loss of generality that and
(If necessary) swap with and swap with .
Proof follows from Lemma 2.
(Lemma 2 => cost can’t increase => since old tree optimal, new one is also)
: Correctness
Lemma 4: Let be a prefix code tree in which and are two
sibling leaves. Let naming their parent
be obtained from , and setting
by removing and , . Then
Observation:
: Correctness
• Let be the tree produced by Huffman’s algorithm for alphabet .
• Let and be the first two items merged together by the algorithm.
Note that these are siblings in . • Let be a new character with
• Let ’ be the tree obtained from parent , and setting
by removing .
and , naming their
Then ’ is exactly the tree constructed by the Huffman algorithm on .
: Correctness
Theorem: The Huffman tree is optimal.
Pf: (By induction on , the number of characters)
Base case : Tree with two leaves. Obviously optimal.
: Correctness
Theorem: The Huffman tree is optimal.
Pf: (By induction on , the number of characters)
Induction hypothesis: Huffman’s algorithm produces optimal tree
for all inputs case of characters.
Induction step: Consider input of characters:
– Let be the tree produced by Huffman’s algorithm.
– Need to show: is optimal.
From operation of Huffman’s algorithm:
– There exist two characters and with two
smallest frequencies that are sibling leaves in .
Let be obtained from (i) removing and ,
(ii) naming their parent (iii) setting
Alphabet for ; Alphabet for By Lemma 4,
: Correctness
is the tree produced by Huffman’s algorithm for is the tree produced by Huffman’s algorithm for
By the induction hypothesis, is optimal for .
By Lemma 3, there exists some optimal tree for which and are sibling leaves.
(with x,y)
(with z, without x,y)
Let be obtained from by (i) removing ,
(ii) naming the parent (iii) setting
is a prefix code tree for alphabet
is optimal for
By Lemma 4,
=> must be optimal!
A is original character set
𝐴′ = 𝐴∪{𝑧}−{𝑥,𝑦}
H built by on A => H’ built by on A’ By induction H’ optimal
𝐵(𝐻) = 𝐵(𝐻′) + 𝑓(𝑥) + 𝑓 (𝑦)
T chosen as Optimal tree for A T’ built from T as tree on A’
𝐵𝑇 =𝐵𝑇 +𝑓𝑥 +𝑓 𝑦
𝐵(𝐻) = 𝐵(𝐻′) + 𝑓(𝑥) + 𝑓(𝑦) ≤𝐵 𝑇 +𝑓 𝑥 +𝑓 𝑦
Diagram of the proof
Optimal By Assumption
Optimal By Induction
=>HisoptimalforA
Exercise on Matching Points and Covering Intervals
Givennpointsxi (1≤i≤n)onthereallineandnintervalsIj =[sj ,fj],(1≤j≤n), design an algorithm to determine if each point can be assigned to a distinct interval that covers it.
intervals I3
I1I2 I4 I5
p1 p2 p3 p4 p5
intervals I3
I1I2 I4 I5
p1 p2 p3 p4 p5
Sort points in non-decreasing order x1 ≤ …. ≤ xi ≤ …. ≤ xn For i=1 to n in the sorted order
Find interval Ij such that sj ≤ xi and fj is min among intervals covering xi If such interval exists assign xi to fj
else return false // no assignment possible
G={(p1,I2), (p2,I1),(p3,I4), (p4,I3),…}
O={(p1,I2), (p2,I1),(p3,I3),…} O*={(p1,I2), (p2,I1),(p3,I4), …}
Exercise on Tiling Path
Let X be a set of n intervals on the real line; each interval x has a starting x.s and a finishing time x.f. A subset of intervals Y X is called a tiling path if the intervals in Y cover the intervals in X, that is, any real value that is contained in some interval in X is also contained in some interval in Y . The size of a tiling cover is just the number of intervals. Design an algorithm to compute the minimal tiling path of X. Assume that all start and finishing times are distinct.
Q: Is the above tiling path minimal?
A: No. The 2nd and 3rd intervals in the path can be replaced by a single one.
Q: In which order you consider the intervals of the tiling path?
A: Increasing order of starting time.
Q: When an interval of the tiling path reaches its finishing time, which interval you would select to succeed it in the path?
A: The interval among those encountered with the largest finishing time.
Exercise on Tiling Path – Algorithm
{x1,x2,..,xn} Sort intervals in increasing starting time. Insert x1 to tilling path Y
last x1; next x1;
For i=2 to n
if xi.s < last.f then // last covers the beginning of xi
if xi.f > next.f then next xi // xi may be next in Tiling Path (i) last does not cover the beginning of xi
if next≠last then insert next to Y
if next.f > xi.s then // next covers the beginning of xi
last next
if xi.f > next.f then next xi (ii) else // next does not cover the beginning of xi
insert xi to Y; last xi; next xi (iii) if next≠last then insert next to Y
Case (i) Case (ii)
Case (iii)
lastnext lastnext
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com