程序代写代做代考 chain data structure go CS 112 : Data Structures

CS 112 : Data Structures
Sesh Venugopal
Huffman Coding:
Tree Building, Encoding Text, Decoding Back to Text

Building the Huffman Tree

Sesh Venugopal
CS 112: Huffman Coding 3
Input Symbols with Probabilities
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
Input Symbols, with probabilities of occurrence in text (Pre-sorted IN INCREASING ORDER OF PROBABILITIES)

Symbols Queue and Trees Queue – Initialize
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
Initially, enqueue all symbols – taken in increasing order of probabilities — into symbols queue. Trees queue is initially null. (Symbols are actually wrapped inside tree nodes, which are enqueued.)
Symbols queue
pfrsate
Trees queue
Sesh Venugopal
CS 112: Huffman Coding 4

Building Huffman Tree – Step 1
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
Dequeue the first two symbols from the symbols queue, build a subtree out of them
Symbols queue
pfrsate
0.1 Rootnodeprobability=0.05(p)+0.05(f)
Trees queue
0.1
fp
Left and right subtrees are interchangeable
This is also OK
pf
Sesh Venugopal
CS 112: Huffman Coding 5

Building Huffman Tree – Step 2 (a)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(a) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
0.1
Trees queue
pf
Dequeue the lesser of the two. Here, since they are both 0.1, either can
be dequeued (pick arbitrarily). Say we pick r from the symbols queue,
and dequeue it
rsate
0.1
Sesh Venugopal CS 112: Huffman Coding 6

Building Huffman Tree – Step 2 (b)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(b) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
r 0.15s a t e Trees queue
pf
Dequeue the lesser of the two. Here, the trees queue front has a smaller probability, so it will be dequeued
0.1
Sesh Venugopal CS 112: Huffman Coding 7

Building Huffman Tree – Step 2 (c)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(c) Build a subtree out of the dequeued trees (symbols are single node trees), and enqueue into the trees queue
Symbols queue
sate
Trees queue
0.2 Root node probability = 0.1 (r) + 0.1 (p-f subtree)
This is also OK
0.2
r
pf
0.1
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding 8

Building Huffman Tree – Step 3 (a)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(a) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
0.15
Trees queue
sate
Dequeue the lesser of the two. Here s has a smaller probaility, so it is dequeued
0.2
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding
9

Building Huffman Tree – Step 3 (b)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(b) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
s 0.2a t e Trees queue
Dequeue the lesser of the two. Here, since they are both 0.2, either can be dequeued (pick arbitrarily). Say we pick a from the symbols queue, and dequeue it
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding 10
0.2

Building Huffman Tree – Step 3 (c)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(c) Build a subtree out of the dequeued trees (symbols are single node trees), and enqueue into the trees queue
Symbols queue
te
Trees queue
0.2 0.1
Root node probability = 0.15 (s) + 0.2 (a) 0.35
This is also OK 0.35 as
rsa pf
Sesh Venugopal
CS 112: Huffman Coding 11

Building Huffman Tree – Step 4 (a)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(a) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
0.2
Trees queue
Dequeue the lesser of the two. Here, since they are both 0.2, either can be dequeued (pick arbitrarily). Say we pick t from the symbols queue, and dequeue it
te
0.2
0.35
rsa pf
0.1
Sesh Venugopal
CS 112: Huffman Coding 12

Building Huffman Tree – Step 4 (b)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(b) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
Dequeue the lesser of the two. Here, the trees front has a smaller probability, so it is dequeued
t 0.25e Trees queue
0.2
0.35
rsa pf
0.1
Sesh Venugopal
CS 112: Huffman Coding 13

Building Huffman Tree – Step 4 (c)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(c) Build a subtree out of the dequeued trees (symbols are single node trees), and enqueue into the trees queue
Symbols queue
e
Trees queue
0.35
sa
This is also OK
0.4
0.2
0.1
pf
t
r
0.4 0.2 t
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding 14

Building Huffman Tree – Step 5 (a)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(a) Compare the probability of front of symbols queue, with that of the front of trees queue
Symbols queue
0.25
Trees queue
0.4 s a 0.2 t
e
Dequeue the lesser of the two. Here e has a smaller probaility, so it is dequeued
0.35
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding
15

Building Huffman Tree – Step 5 (b)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(b) Compare the probability of front of symbols queue, with that of the front of trees queue. But the Symbols queue is empty, so the only option is to dequeue from the Trees queue
Symbols queue
e
Trees queue
sa
0.4 0.2 t
0.35
0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding 16

Building Huffman Tree – Step 5 (c)
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
(c) Build a subtree out of the dequeued trees (symbols are single node trees), and enqueue into the trees queue
Symbols queue Trees queue
0.4 0.2 t
0.1
pf
0.6 0.35 e
sa
This is also OK
0.6
e
sa
0.35
r
Sesh Venugopal
CS 112: Huffman Coding 17

Building Huffman Tree – Step 6
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
Since Symbols queue is empty, dequeue pairs of trees from Trees queue, build subtree, and enqueue, until Trees queue has a single item
Trees queue
1.0
Root node probability of the final tree MUST be 1.0
0.4
t 0.35e
sa
0.6
0.2 0.1
r pf
Sesh Venugopal
CS 112: Huffman Coding 18

Complete Huffman Tree
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
1.0
0.4
t 0.35e
0.6
0.2
0.1
pf
r
sa
Sesh Venugopal
CS 112: Huffman Coding 19

Assigning Huffman Codes

Assigning Bits to Branches
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
Each left branch is labeled with a 0 (bit) and right branch is labeled with a 1 (bit).
1.0
01
The left=0/right=1 choice is arbitrary. You could equally label left branches with 1’s and right branches with 0’s – the absolute code does not matter
0.4
01
0.6
01
0.2 t 00.351e 01
0.1
01
pf
r
sa
Sesh Venugopal
CS 112: Huffman Coding 21

Gathering Codes for Symbols
pfrsate
0.05 0.05 0.1 0.15 0.2 0.2 0.25
1.0
01
0.4 0.6
0101
01
Code is sequence of bits along the path from from root to symbol
p 0000
0.2 t 0.35 e
f
r s
0001 001 100
01
0 0.11 pf
r sa
a101 t 01 e 11
Sesh Venugopal
CS 112: Huffman Coding 22

Codes for Symbols
1.0
01
0.4
01
p 0000 0.05
0001
001
100
101
01
11
f 0.6 r
0.05 0.1
0.15 0.2
0.2 0.25
01 s
0.2
01
0 0.11 pf
t 0.35e 01
a
t e
r
sa
Sesh Venugopal
CS 112: Huffman Coding 23
Average code length
= 4*0.05 + 4*0.05 + 3*0.1 + 3*0.15 + 3*0.2 + 2*0.2 + 2*0.25 = 2.65

An Alternative Huffman Tree (subtrees switched)
1.0
01
0.4
010
p 0100 0.05
0101
011
110
111
00
10
f 0.6 r
01 s t 0.2 e 0.351 a
0.05 0.1
0.15 0.2
0.2 0.25
01
0.1
01
pf
r
sa
t e
Sesh Venugopal
CS 112: Huffman Coding 24
Average code length
= 4*0.05 + 4*0.05 + 3*0.1 + 3*0.15 + 2*0.2 + 2*0.2 + 2*0.25 = 2.65

An Alternative Huffman Tree (probability ties broken differently in Step 3(b))
0
0.6
1.0
1
0.4
p 00000 0.05
00001
0001
001
11
10
01
0101
0.35 e t a 01
s
01
00.11 pf
f
r s a
t e
0.05 0.1
0.15 0.2
0.2 0.25
0.2
r
Sesh Venugopal
CS 112: Huffman Coding 25
Average code length
= 5*0.05 + 5*0.05 + 4*0.1 + 3*0.15 + 2*0.2 + 2*0.2 + 2*0.25 = 2.65

Average Code Length is what matters
Various ways in which alternative Huffman trees can be obtained:
– Tie broken differently when dequeueing equal probability items from Symbols and Trees queues
– Left and right subtrees switched when building bigger tree
– Left and right branches switched when numbering with 0 or 1
Although the actual codes might differ, ALL Huffman trees for a given set of symbols+probabilities will have the SAME average code length!!!
Sesh Venugopal CS 112: Huffman Coding 26

Prefix Property
No two symbols can have codes such that one is the prefix of the other: for instance 0001 and 000 cannot both be codes since 000 is a prefix of 0001
This is true since all symbols are at the leaf nodes of the Huffman tree
01
pf
0 0.4 1
0.6
01
1.0
01
t 0.2 010
0.1
r
e 0.351 sa
It’s critical that a code not be a prefix of another. WHY?
Sesh Venugopal CS 112: Huffman Coding 27

Encoding Text

1.0
01
Encoding Example
Input text: teaser Encoded: 011110110011001
Scan the text from start to end. For each character, look up its code in the table and write out the code. (Huffman tree itself is not needed.)
Code must be written to file in binary format, so that ‘11’ is written out as 2 bits, not 2 bytes (which would happen if
0.4
01
0.6
01
0.2 t 0.35 e 0101
0 0.11 pf
r
sa
p 0000
f
r 001
0001 s 100
a
101 ‘11’ is written as a string)! t 01
e 11
Sesh Venugopal
CS 112: Huffman Coding 29

Encoding Example Input text: teaser Encoded: 011110110011001
What is the compression compared to ASCII?
ASCII would use 8 bits per character, so the ASCII-coded result would be 8 bits * 6 characters = 48 bits long.
The Huffman coded result above is 15 bits long.
So the ratio of Huffman coded to ASCII-coded is 15/48 = 0.3125, which means the compression is about 70%!!
BUT WAIT! This is not apples-to-apples. Reason being ASCII needs 8 bits only when all possible characters are considered.
However, the Huffman tree was built to code 7 symbols (p, f, r, s, a, t, e), so to be fair we need to think of how many bits would ASCII need to store 7 possible values (one per symbol) – the answer is 3, since 3 bits can store up to 23 values.
In which case, the ratio is 15/18 = 0.83, for a roughly 17% compression
Sesh Venugopal CS 112: Huffman Coding 30

Decoding Back to Original Text

01
0 0.11 pf
01
sa
1.0
01
0.4 0.6
0 1 0 1
Decoding Example
Encoded:
011110110011001
0.2 t 0.35 e
Sesh Venugopal
CS 112: Huffman Coding 32
r
For decoding, we need the Huffman tree.
Set a pointer to the root of the Huffman tree Scan the encoded bit string from left to right:
If you see a 0 in the encoded string, go to the left child in the tree, otherwise go to the right child.
If you hit the bottom (symbol), write that symbol out and reset back to the root of the tree.
Go to next bit in the encoded bit string

Encoding/Decoding Example 2
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do. Once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in
it, “and what is the use of a book,” thought Alice, “without pictures or conversations?”
So she was considering in her own mind (as well as she could, for the day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that, nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be too late!” But when the Rabbit actually took a watch out of its waistcoat-pocket and looked at it and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and, burning with curiosity, she ran across the field after
it and was just in time to see it pop down a large rabbit-hole, under the hedge. In another moment, down went Alice after it!


Sesh Venugopal
CS 112: Huffman Coding 33
File has 6942 characters
The ASCII encoded file would be 6942 bytes long (each character is an 8-bit/1-byte code)

Characters, Probabilities, Huffman Codes
Input text file was analyzed,
and probabilities of occurrence were computed for all characters that appeared in the file
A Huffman tree was built, and
B|1.440507E-4|1001001010100 C|1.440507E-4|1001001010101 F|1.440507E-4|1001001010110 G|1.440507E-4|1001001010111 K|1.440507E-4|1001001011000 L|1.440507E-4|1001001011001 q|2.881014E-4|100100101101 :|4.3215213E-4|01010010110 H|4.3215213E-4|01010010111 …
… g|0.01627773|100111 f|0.016565831|101000
|0.017862288|101001 u|0.018870642|110010 w|0.021031404|110110 d|0.03615673|10101 l|0.037453182|11000 r|0.039325844|11010 s|0.044655718|0000 i|0.047104582|0001 n|0.04897724|0100 h|0.05157015|0110 a|0.056179777|0111 o|0.061941803|1000 t|0.07346586|1011 e|0.097522326|001
|0.17660616|111
codes computed
New line character
Blank space character
Sesh Venugopal CS 112: Huffman Coding 34

Encoding File/Decoding back to original text
10010011110000001100101001111110110011100001110101110011001110001010001000001010010011111110111000……
Alice was beginning to get very tired of sitting by her sister on the…
The encoded (compressed) file is a string of bits, total # of bits is 30887 The ASCII encoded file (uncompressed original) is 6942 bytes long =
6942*8 = 55536 bits
Ratio of compressed/uncompressed = 30887/55536 = 0.56 Compression factor = 1 – 0.56 = 0.44 = 44%!
NOTE: The encoded file must be written in binary form – you don’t want each 1/0 to be written as a 1-byte character!!
Sesh Venugopal CS 112: Huffman Coding 35