CS代写 CGT 110/200 90/200 νCGT

Analysis of Algorithms
CSOR W4231

Computer Science Department

Copyright By PowCoder代写 加微信 powcoder

Columbia University
Data compression and huffman coding

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

Motivation
Data compression: find compact representations of data
Data compression standards
􏰉 jpeg for image transmission
􏰉 mp3 for audio content, mpeg2 for video transmission 􏰉 utilities: gzip, bzip2
All of the above use the Huffman algorithm as a basic building block.

Data representation
􏰉 An organism’s genome consists of chromosomes (giant linear DNA molecules)
􏰉 Chromosome maps: sequences of hundreds of millions of bases (symbols from {A, C, G, T }).
􏰉 Goal: store a chromosome map with 200 million bases. How do we represent a chromosome map?

Data representation
􏰉 An organism’s genome consists of chromosomes (giant linear DNA molecules)
􏰉 Chromosome maps: sequences of hundreds of millions of bases (symbols from {A, C, G, T }).
􏰉 Goal: store a chromosome map with 200 million bases. How do we represent a chromosome map?
􏰉 Encode every symbol that appears in the sequence separately by a fixed length binary string.
􏰉 Codeword c(x) for symbol x: a binary string encoding x of length l(x)

Example code
􏰉 Alphabet A = {A,C,G,T} with 4 symbols 􏰉 Encode each symbol with 2 bits
alphabet symbol x
codeword c(x)

Example code
􏰉 Alphabet A = {A,C,G,T} with 4 symbols 􏰉 Encode each symbol with 2 bits
alphabet symbol x
codeword c(x)
Output of encoding: the concatenation of the codewords for every symbol in the input sequence

Example code
􏰉 Alphabet A = {A,C,G,T} with 4 symbols 􏰉 Encode each symbol with 2 bits
alphabet symbol x
codeword c(x)
Output of encoding: the concatenation of the codewords for every symbol in the input sequence
􏰉 Input sequence: ACGTAA

Example code
􏰉 Alphabet A = {A,C,G,T} with 4 symbols 􏰉 Encode each symbol with 2 bits
alphabet symbol x
codeword c(x)
Output of encoding: the concatenation of the codewords for every symbol in the input sequence
􏰉 Input sequence: ACGTAA
􏰉 Output: c(A)c(C)c(G)c(T)c(A)c(A) = 000110110000 􏰉 Totallengthofencoding=6·2=12bits.

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

Symbol codes
Symbol code: a set of codewords where every input symbol is encoded separately.

Symbol codes
Symbol code: a set of codewords where every input symbol is encoded separately.
Examples of symbol codes
􏰉 C0 = {00,01,10,11} is a symbol code for {A,C,G,T}.
􏰉 ASCII encoding system: every character and special symbol on the computer keyboard is encoded by a different 7-bit binary string.

Symbol codes
Symbol code: a set of codewords where every input symbol is encoded separately.
Examples of symbol codes
􏰉 C0 = {00,01,10,11} is a symbol code for {A,C,G,T}.
􏰉 ASCII encoding system: every character and special symbol on the computer keyboard is encoded by a different 7-bit binary string.
C0 and ASCII are fixed-length symbol codes: each codeword has the same length.

Unique decodability
Decoding C0?

Unique decodability
Decoding C0
􏰉 read 2 bits of the output;
􏰉 print the symbol corresponding to this codeword; 􏰉 continue with the next 2 bits.

Unique decodability
Decoding C0
􏰉 read 2 bits of the output;
􏰉 print the symbol corresponding to this codeword; 􏰉 continue with the next 2 bits.
􏰉 This decoding algorithm works for ASCII (replace 2 by 7) 􏰉 C0, ASCII: distinct input sequences have distinct encodings

Unique decodability
Decoding C0
􏰉 read 2 bits of the output;
􏰉 print the symbol corresponding to this codeword; 􏰉 continue with the next 2 bits.
􏰉 This decoding algorithm works for ASCII (replace 2 by 7) 􏰉 C0, ASCII: distinct input sequences have distinct encodings
Definition 1.
A symbol code is uniquely decodable if, for any two distinct input sequences, their encodings are distinct.

Lossless compression
􏰉 Lossless compression: compress and decompress without errors.
􏰉 Uniquely decodable codes allow for lossless compression.
􏰉 A symbol code achieves optimal lossless compression when it produces an encoding of minimum length for its input (among all uniquely decodable symbol codes).
􏰉 Huffman algorithm: provides a symbol code that achieves optimal lossless compression.

Fixed-length vs variable-length codes
Chromosome map consists of 200 million bases as follows:
alphabet symbol x
frequency freq(x)
110 million
25 million
60 million

Fixed-length vs variable-length codes
Chromosome map consists of 200 million bases as follows:
alphabet symbol x
frequency freq(x)
110 million
25 million
60 million
􏰉 A appears much more often than the other symbols. ⇒ It might be best to encode A with fewer bits.
􏰉 Unlikely that the fixed-length encoding C0 is optimal

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

Prefix codes
Variable-length encodings
alphabet symbol x
codeword c(x)

Prefix codes
Variable-length encodings
alphabet symbol x
codeword c(x)
􏰉 C1 is not unique decodable! E.g., 101110: how to decode it?

Prefix codes
Variable-length encodings
alphabet symbol x
codeword c(x)

Prefix codes
Variable-length encodings
alphabet symbol x
codeword c(x)
􏰉 C2 is uniquely decodable.
􏰉 C2 is such that no codeword is a prefix of another.

Prefix codes
Variable-length encodings
alphabet symbol x
codeword c(x)
􏰉 C2 is uniquely decodable.
􏰉 C2 is such that no codeword is a prefix of another.
Definition 2 (prefix codes).
A symbol code is a prefix code if no codeword is a prefix of another.

Decoding prefix codes
1. Scan the binary string from left to right until you’ve seen enough bits to match a codeword;
2. Output the symbol corresponding to this codeword.
􏰉 Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used to encode any other symbol.
3. Continue starting from the next bit of the bit string.

Decoding prefix codes
1. Scan the binary string from left to right until you’ve seen enough bits to match a codeword;
2. Output the symbol corresponding to this codeword.
􏰉 Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used to encode any other symbol.
3. Continue starting from the next bit of the bit string.
Thus prefix codes allow for
􏰉 unique decoding;
􏰉 fast decoding (the end of a codeword is instantly recognizable).

Decoding prefix codes
1. Scan the binary string from left to right until you’ve seen enough bits to match a codeword;
2. Output the symbol corresponding to this codeword.
􏰉 Since no other codeword is a prefix of this codeword or
contains it as a prefix, this sequence of bits cannot be used to encode any other symbol.
3. Continue starting from the next bit of the bit string.
Thus prefix codes allow for
􏰉 unique decoding;
􏰉 fast decoding (the end of a codeword is instantly recognizable).
Examples of prefix codes: C0, C2

Prefix codes and optimal lossless compression
􏰉 Decoding a prefix code is very fast.
⇒ Would like to focus on prefix codes (rather than all uniquely decodable symbol codes) for achieving optimal lossless compression.
􏰉 Information theory guarantees this: for every uniquely decodable code, exists a prefix code with the same codeword lengths
􏰉 So we can solely focus on prefix codes for optimal compression.

Compression gains from variable-length prefix codes
Chromosome map: do we gain anything by using C2 instead of C0 when compressing the map of 200 million bases?
Input Code C0 Code C2
110 million
25 million
60 million

Compression gains from variable-length prefix codes
Chromosome map: do we gain anything by using C2 instead of C0 when compressing the map of 200 million bases?
Input Code C0 Code C2
110 million
25 million
60 million
􏰉 C0: 2 bits ×200 million symbols = 400 million bits 􏰉 C2: 1·110+3·5+3·25+2·60=320millionbits 􏰉 Improvement of 20% in this example

The optimal prefix code problem
􏰉 Alphabet A = {a1,…,an}
􏰉 SetP={p1,…,pn}ofempiricalprobabilitiesoverAsuch that pi = Pr[ai]
Output: a binary prefix code C∗ = {c(a1), c(a2), . . . , c(an)} for (A, P ), where codeword c(ai) has length li and is such that its expected length
L(C∗) = 􏰆 pi · li ai ∈A
is minimum among all binary prefix codes.

Chromosome example
Input Code C0 Code C2
􏰉 L(C2)=1.6
􏰉 Coming up: C2 is the output of the Huffman algorithm, hence an optimal encoding for (A, P ).

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

Prefix codes and trees
􏰉 A binary tree T is a rooted tree such that each node that is not a leaf has at most two children.
􏰉 Binary tree for a prefix code: a branch to the left represents a 0 in the encoding and a branch to the right a 1.
01 01 Pr[A] Pr[C] Pr[G] Pr[T] ACGT Binary tree for prefix code C0

Prefix codes and trees
􏰉 A binary tree T is a rooted tree such that each node that is not a leaf has at most two children.
􏰉 Binary tree for a prefix code: a branch to the left represents a 0 in the encoding and a branch to the right a 1.
Pr[C] Pr[G]
CG Binary tree for prefix code C2

Properties of binary trees representing prefix codes
1. Where do alphabet symbols appear in the tree?
2. What do codewords correspond to in the tree?
3. Consider the tree corresponding to the optimal prefix code. Can it have internal nodes with one child?

Properties of binary trees representing prefix codes
1. Symbols must appear at the leaves of the tree T (why?) ⇒ T has n leaves.
2. Codewords c(ai) are given by root-to-leaf paths.
Recall that li is the length of the codeword c(ai) for input symbol ai. Therefore, on the tree T, li corresponds to the depth of ai (we assume that the root is at depth 0).
⇒ Can rewrite the expected length of the prefix code as: L(C)=􏰆pi·li= 􏰆 pi·depthT(ai)=L(T).
ai ∈A 1≤i≤n
3. Optimal tree must be full: all internal nodes must have exactly two children (why?).

More on optimal tree
There is an optimal prefix code, with corresponding tree T∗, in which the two lowest frequency characters are assigned to leaves that are siblings in T∗ at maximum depth.

More on optimal tree
There is an optimal prefix code, with corresponding tree T∗, in which the two lowest frequency characters are assigned to leaves that are siblings in T∗ at maximum depth.
By an exchange argument: start with a tree for an optimal prefix code and transform it into T∗.

Proof of Claim 1
􏰉 Let T be the tree for the optimal prefix code.
􏰉 Let α, β be the two symbols with the smallest probabilities,
that is, Pr[α] ≤ Pr[β] ≤ Pr[s] for all s ∈ A − {α, β}.
􏰉 Let x and y be the two siblings at maximum depth in T.
Pr[α] Pr[x] αx
Pr[β] Pr[β] ββ
Pr[x] Pr[y] Pr[α] Pr[y] xy αy
We want L(T) ≥ L(T′)

Proof of Claim 1
􏰉 Let T be the tree for the optimal prefix code.
􏰉 Let α, β be the two symbols with the smallest probabilities,
that is, Pr[α] ≤ Pr[β] ≤ Pr[s] for all s ∈ A − {α, β}.
􏰉 Let x and y be the two siblings at maximum depth in T.
Pr[x] Pr[x] xx
Pr[β] Pr[y] βy
Pr[α] Pr[y] Pr[α] Pr[β] αy αβ
We want L(T′) ≥ L(T′′)

How do the expected lengths of the two trees compare?
L(T)−L(T′) = =
􏰆 Pr[ai] · depthT (ai) − 􏰆 Pr[ai] · depthT ′ (ai) ai ∈A ai ∈A
Pr[α] · depthT (α) + Pr[x] · depthT (x)
− Pr[α] · depthT ′ (α) − Pr[x] · depthT ′ (x)
Pr[α] · depthT (α) + Pr[x] · depthT (x)
− Pr[α] · depthT (x) − Pr[x] · depthT (α)
(Pr[α] − Pr[x]) · (depthT (α) − depthT (x)) ≥ 0
􏰉 The third equality follows from the exchange.
􏰉 Similarly, exchanging β and y in T ′ yields L(T ′) − L(T ′′) ≥ 0. 􏰉 HenceL(T)−L(T′′)≥0.
􏰉 Since T is optimal, it must be L(T ) = L(T ′′).
􏰉 So T′′ is also optimal.
The claim follows by setting T∗ to be T′′.

Building the optimal tree
Claim 1 tells us how to build the optimal tree greedily!
1. Find the two symbols with the lowest probabilities.
2. Remove them from the alphabet and replace them with a new meta-character with probability equal to the sum of their probabilities.
􏰉 Idea: this meta-character will be the parent of the two deleted symbols in the tree.
3. Recursively construct the optimal tree using this process.
Greedy algorithms: make a local (myopic) decision at every step that optimizes some criterion and eventually show that this is the optimal way for building the entire solution.

1 Data compression
2 Symbol codes and optimal lossless compression
3 Prefix codes
4 Prefix codes and trees
5 The Huffman algorithm

Huffman algorithm
Huffman(A, P )
if |A| = 2 then
Encode one symbol using 0 and the other using 1
Let α and β be the two symbols with the lowest probabilities Let ν be a new meta-character with probability Pr[α] + Pr[β] LetA1 =A−{α,β}+{ν}
Let P1 be the new set of probabilities over A1
T1 = Huffman(A1, P1)
return T as follows: replace leaf node ν in T1 by an internal node, and add two children labelled α and β below ν.
Output of Huffman procedure is a binary tree T; the code for (A,P) is its corresponding prefix code.

Example: recursive Huffman for chromosome map
Recursive call 1: Huffman({A,C,G,T},{110, 5 , 25 , 60 }) 200 200 200 200
Recursive call 2: Huffman({A,νCG,T},{110, 30 , 60 }) 200 200 200
Recursive call 3: Huffman({A, νC GT }, { 110 , 90 }) 200 200
01 110/200 90/200
110/200 90/200 νCGT 110/200 90/200 νCGT
A 0 1 A 0 1
60/200 30/200 60/200 30/200 νCG
End of rec. call 3
TνCG T End of rec. call 2
5/200 25/200
End of rec. call 1

Correctness
Proof: by induction on the size of the alphabet n ≥ 2. 􏰉 Base case. For n = 2, Huffman is optimal.
􏰉 Hypothesis. Assume that Huffman returns the optimal prefix code for alphabets of n symbols.
􏰉 Induction Step. Let A be an alphabet of size n + 1, P the corresponding set of probabilities.
Let T1 be the optimal (by the hypothesis) tree returned by our algorithm for (A1, P1), where A1, P1, T1 as in the pseudocode. Let T be the final tree returned for (A, P ) by our algorithm. We claim that T is optimal.
We will prove the claim by contradiction. Assume T ∗ is the optimal tree for (A, P ) such that
L(T ∗) < L(T ). (1) A useful fact Let T be a binary tree representing a prefix code. If we replace sibling leaves α, β in T by a meta-character ν where Pr[ν] = Pr[α] + Pr[β], we obtain a tree T1 such that L(T ) = L(T1) + (Pr[α] + Pr[β]). Notation: dT (ai) = depthT (ai) • α, β are sibling leaves in T , hence dT (α) = dT (β). • T differs from T1 only in that α,β are replaced by ν. Since dT1 (ν) = dT (α) − 1, we obtain L(T ) − L(T1) = Pr[α]dT (α) + Pr[β]dT (β) − (Pr[α] + Pr[β])dT1 (ν) = Pr[α] + Pr[β]. (2) Correctness (cont’d) 􏰉 Claim1guaranteesthereissuchanoptimaltreefor(A,P) where α, β appear as siblings at maximum depth. 􏰉 W.l.o.g. assume that T ∗ is such an optimal tree. By Fact 3, if we replace siblings α, β in T ∗ by ν′ where Pr[ν′] = Pr[α] + Pr[β], the resulting tree T1∗ satisfies L(T ∗) = L(T1∗) + (Pr[α] + Pr[β]). 􏰉 Similarly, the tree T returned by the Huffman algorithm satisfies L(T ) = L(T1) + (Pr[α] + Pr[β]). 􏰉 By the induction hypothesis, we have L(T1∗) ≥ L(T1) since T1 is optimal for alphabets of size n. Hence L(T∗) = L(T1∗) + Pr[α] + Pr[β] ≥ L(T1) + Pr[α] + Pr[β] = L(T), (3) where the inequality follows from the induction hypothesis. 􏰉 Equation (3) contradicts Assumption (1). Thus T must be optimal. Implementation and running time 1. Straightforward implementation: O(n2) time 2. Store the alphabet symbols in a min priority queue implemented as a binary min-heap with keys their probabilities 􏰉 Operations: Initialize (O(n)), Extract-min (O(log n)), Insert (O(log n)) Total time: O(n log n) time For an iterative implementation of Huffman, see your textbook. Example: iterative Huffman for chromosome map Input (A, P ) Output code Step 1 30/200 01 5/200 25/200 Step 2 90/200 Step 3 01 01 30/200 110/200 90/200 01A01 5/200 25/200 60/200 30/200 5/200 25/200 Beyond Huffman coding 􏰉 Huffman algorithm provides an optimal symbol code. 􏰉 Codes that encode larger blocks of input symbols might achieve better compression. 􏰉 Storage on noisy media: what if a bit of the output of the compressor is flipped? 􏰉 Decompression cannot carry through. 􏰉 Need error correction on top of compression. 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com