Coding and Compression Lecture 9
Faraz Janan Lecturer (Notes adopted from Dr. John Murray, Senior Lecturer)
Data Compression • Why we need data compression?
– –
– –
– –
To save space when storing it.
To save time when transmitting it.
• reduces the size of data frames to be transmitted over a network link, which reduces the time required to transmit the frame across the network.
Most files have lots of redundancy.
Moore’s law: # transistors on a chip doubles every 18-24 months.
Parkinson’s law: data expands to fill space available.
Text, images, sound, video, … Images: GIF, JPEG.
Sound: MP3.
Video: MPEG, DivXTM, HDTV. Google, fb, CISCO etc
Algorithms 2nd edition, Chapter 22
2
http://www.cs.princeton.edu/introalgsds/65compression
Data Compression
• Why data compression?
• Storing or transmitting multimedia data requires large space or bandwidth
• The size of one hour 44K sample/sec 16-bit stereo (two channels) audio is 3600x44000x2x2= 633.6MB, which can be recorded on one CD (650 MB). MP3 compression can reduce this number by factor of 10
• The size of a 500×500 color image is 750KB without compression (JPEG can reduced this by a factor of 10 to 20)
• The size of one minute real-time, full size, color video clip is 60x30x640x480x3= 1.659GB. A two-hour movie requires 200GB. MPEG2 compression can bring this number down to 4.7 GB (DVD)
Ref: Spatial and Temporal Data Mining, Data Compression by V.
Megalooikonomou
3
• What is ASCII
Data = ASCII, UTF
– American Standard Code for Information Interchange
– AStandard
– Awaytorepresentsymbols • In this case, the Alphabet
– 128 Characters (33 control +95 printable) Eventually going to be replaced by UTF-8
• Since 2007, it is taking over, and now more application use this as compared to • ASCII
• Compatible with ASCII
Supports over 120,000 characters
http://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters
4
• A • S • C • I • I
= 01000001 = 01010011 = 01000011 = 01001001 = 01001001
•
How many bytes are needed to send the word – Hello
– Goodbye
– The quick brown fox jumps over the lazy dog
ASCII
5
Coding
• Let’s say we have a 100,000 character data file to store.
– How many bytes is this?
• Lets assume the file only contains 6 symbols with the following frequency:
abcdef
Frequency in ‘000’s 45 13 12 16 9 5
• Can we make this more efficient than 100kB?
6
Coding
• Abinarycodeencodeseachcharacterasa binary string or codeword.
• We would like to find a binary code that encodes the file using as few bits as possible
– i.e., compresses it as much as possible.
7
8
Fixed vs Variable Length Coding
• In a fixed-length code each symbol code has the same length.
• In a variable-length code symbol representations can have different
lengths.
Frequency in 000’s Fixed-length coding Variable-length coding
abcdef
45
13
12
16
9
000
001
010
011
100
• The fixed-length code requires 300,000 bits (or 300 bytes)
• The variable length code uses:
– (45×1+13×3+12×3+16×3+9×4+5×4)x1000 = 224,000 bits (224 bytes) • Saving a LOT of space!
• Can we do better?!
0
101
100
111
5 101
1101 1100
9
Data Compression
• All Data compression methods have 2 components
– Encoding: at the source,
• In some cases can afford time and resources, can be very expensive (i.e rendering video clips, uploading on web etc)
• In others cant afford high quality compression, i.e video conferencing; Decoding: at the destination, should be fast, cheap
• They are not necessary symmetric. For example a movie can be encoded once, while could be decoded hundred’s of thousands of time when viewed.
10
Data Compression
• Compression Techniques
– Entropy Encoding
• Depends on the information content
• Lossless, fully reversible, bit encoding irrespective of what a bit means
– Source Encoding
• Dropping nonessential detail from the data source • It analyse data properties to compress hard
• Lossy, asymmetric model, irreversible recovery
11
Entropy
• Introduced by Claude Shanon
• Entropy quantifies information (average uncertainty)
• The expected value (average) of the information contained in each message.
• Why is it important?
Symbol
Occurrence probability
A
25%
B
25%
C
25%
D
25%
Symbol
Occurrence probability
A
50%
B
12.5%
C
12.5%
D
25%
Text 1
– If the entropy (uncertainty of what is being communicated) decreases, the ability to compress increases
– If we want to go beyond entropy, we must lose some information
Text 2
which can be compressed more? Lets ask at the end of the lecture
12
Google some of these methods if you like
13
Entropy Encoding
• Run Length Encoding
– Encode repeated symbols
– Example: 7WWWWWWWWWWWWBWWWWWWWW WWWWBBBWWWWWWWWWWWWWWW WWWWWWWWWB8WWWWWWWWWWW WWW4444
– Encoding: 7*12W*1B*12W*3B*24W*8*1B*14W*44
14
Entropy Encoding
• Statistical Encoding
– Short Codes to represent frequent symbols
Frequency in 000’s Fixed-length coding Variable-length coding
0
101
100
111
5 101
1101 1100
abcdef
45
13
12
16
9
000
001
010
011
100
– Morse Code, Huffman Coding, Shanon-Fano, Ziv-Lempel etc.
15
Entropy Coding
• Colour Look-Up Table (CLUT)
– Phone display uses RGB images which contain 224 colours, using 3bytes/pixel.
– In practice it uses less, especially in cartoons, drawings etc.
– If a picture uses only 256 colours than a the data could be compressed significantly
– This is an example where encoding takes clearly longer than decoding
16
Source Coding
• Data is represented by changes in data rather than data levels or frequency
• Hence more reliable detection of transition rather than level (data content can be lost, structure remains intact)
• Examples:
– Differential coding, i.e PCM
– Transformation coding, i.e FFT for music systems, images, signal estimation
– Vector Quantization
– Popular methods, JPEG, MPEG, MP3
17
JPEG (Joint Photographic Experts Group)
• JPEG is a lossy compression technique for color images. Although it can reduce files sizes to about 5% of their normal size, some detail is lost in the compression.
• It’s a combination of methods instead of a single technique
• Basic steps: (1) preprocess, (2) transformation, (3) quantization, (4) encoding.
• JPEG is roughly semetric, in most cases decoding takes as much time as encoding
• Is normally lossy (using DCT), but could be lossless (e.g JPEG 2000, JPEG-LS)
• Popular because image transformations are possible without generating the original image
18
A bitmap image automatically compressed by facebook after uploading it to a chat window
19
MPEG (Motion Picture Experts
Group)
• In essence, it is a JPEG applied to each movie
frame combined with audio compression
• Can compress both audio and video
• Uncompressed video requires 472Mbs, MPEG can take it down to 4/6Mbps using MPEG-2 (for HDTV) and 64kbps using MPEG-4 (for video conferencing)
• Give more compression options and choice of encoders: A piano concrete would use 128kbps, whereas a rock n’ roll concert do well with 86kbps (because of SNR)
20
Compression Examples
21
Encoding
• Given a code (based on some alphabet Θ) and
a message it is easy to encode the message
• Example: Θ = {a, b, c, d}
• Code 1 is:
– C1{a = 00, b = 01, c = 10, d = 11} – Then bad is encoded into
• 010011
• Code 2 is:
– C2{a = 0, b = 110, c = 10, d = 111}
– Then bad is encoded into
– 1100111
22
Decoding
• Given an encoded message, a message is uniquely decodable if it can only be decoded in one way
– C1 = {a = 00, b = 01, c = 10, d = 11}
– C2 = {a = 0, b = 110, c = 10, d = 111}
– C3 = {a = 1, b = 110, c = 10, d = 111}
• Relative to C1, 010011 is what?
• Relative to C2, 1100111 is what?
• Relative to C3, 1101111 is what?
23
Decoding
• In fact, using codes C1 or C2 any possible message is uniquely decipherable.
– This is not the case with code C3
• The unique decipherability property is
needed in order for a code to be useful
• It can be achieved with prefix free codes
24
Prefix-Codes
• Fixed-length codes are always uniquely decipherable, why?
• However, these do not always give the best compression so we use variable length codes
• Prefix Code: A code is called a prefix (free) code if no codeword is a prefix of another one.
• Example: {a = 0, b = 110, c = 10, d = 111} is a prefix code
25
Prefix-Codes
• Fixed-length codes are always uniquely decipherable, why?
• However, these do not always give the best compression so we use variable length codes
• Prefix Code: A code is called a prefix (free) code if no codeword is a prefix of another one.
• Example: {a = 0, b = 110, c = 10, d = 111} is a prefix code
26
Prefix-Codes
• Every message encoded by a prefix free code is uniquely decipherable. Since no codeword is a prefix of any other.
• We can always find the first codeword in a message, remove it, and continue decoding.
• Example:
01101100 = 01101100 = abba
• We are therefore interested in finding good (best compression) prefix-free codes.
27
•
Optimum Source Coding Problem
Problem: Given an alphabet A = {a1,…,an} with
frequency distribution f(ai) find a binary prefix
code C for A that minimises the number of bits n
B(C) f (a )L(c(a )) ii
ai n f(a)
• Needed to encode a message of
a1 characters, where c(ai) is the codeword for
encoding ai and L(c(ai)) is the length of the codeword c(ai)
28
Shanon-Fano coding
• Lossless prefix code based on a set of symbols and their probabilities (estimated or measured).
• Method:
• Step 1: Symbols are counted for occurrence or
the probability of occurrence
• Step 2: Recursively divide symbols in nearly equal parts, until all parts give nearly equal counts
• Step 3: Assign codes
• Step 4: Repeat till all variables are encoded
29
Shanon-Fano Coding
Example [1]
1. https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding
2. http://stackoverflow.com/questions/26635280/shannon-fano-algorithm
30
Shanon-Fano Coding
1. https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding
2. http://stackoverflow.com/questions/26635280/shannon-fano-algorithm
Example [2]
31
Huffman Code
• Huffmandevelopeda‘greedyalgorithm’for solving this problem and producing a minimum cost (therefore optimum) lossless prefix code.
• Better than Shanon-Fano, as it guarantees the smallest code for each symbol
• By building the tree from the bottom up instead of the top down, Huffman avoided the major flaw of Shanon coding.
32
Huffman Code Steps
• Step 1: Pick two symbols x, y from alphabet A with the smallest frequencies and create a subtree that has these two symbols as leaves
– Label the root of this subtree as z
• Step 2: Set frequency f(z) = f(x) + f(y)
• Remove x, y and set z creating new alphabet
• Repeat this procedure, called merge, with new alphabet until only one symbol is left
33
Example of Huffman Coding
• Let A = {a /20, b/15, c/5, d/15, e/45} be the alphabet and frequency distribution.
a/20 b/15 e/45
c/5 d/15
34
Example of Huffman Coding
• Let A = {a/20, b/15, c/5, d/15, e/45} be the alphabet and frequency distribution.
a/20 b/15 n1/20 e/45 01
c/5 d/15
• Alphabet is now A1 = {a/20, b/15, n1/20, e/45}
35
• Alphabet is now A1 = {a/20, b/15, n1/20, e/45} • Algorithm merges a and b
a/20 b/15
n1/20
01
c/5 d/15
e/45
36
• Alphabet is now A1 = {a/20, b/15, n1/20, e/45} • Algorithm merges a and b
n2/35 n1/20
0101
a/20 b/15 c/5 d/15
• Alphabet is now A2 = {n2/35, n1/20, e/45}
e/45
37
• Alphabet is now A2 = {n2/35, n1/20, e/45} • Algorithm merges n1 and n2
n2/35 n1/20
0101
a/20 b/15 c/5 d/15
e/45
38
• Alphabet is now A2 = {n2/35, n1/20, e/45} • Algorithm merges n1 and n2
n3/55
n2/35
e/45
0
1
n1/20
0101
a/20 b/15 c/5 d/15
• Alphabet is now A3 = {n3/55, e/45}
39
• Alphabet is now A3 = {n3/55, e/45} • Algorithm merges e and n3
a/20
0
1
n1/20
n2/35
0101
b/15
n3/55
c/5
d/15
e/45
40
• Alphabet is now A3 = {n3/55, e/45} • Algorithm merges e and n3
n4/100
0
n3/55
0
1
1
e/45
n2/35
n1/20
a/20
0101
b/15
c/5
d/15
41
Huffman Code
• The Huffman code is obtained from the Huffman Tree.
– Starting from Node, work down to leaf • Huffman code is:
– a = 000, b = 001, c = 010, d = 011, e = 1
• This is the optimum (minimum-cost) prefix code for this distribution.
42
Compare to Shanon-Fano Code
abcde/ 100
0
e/45
1
abcd/55
1
bcd/35
10
0
a/20
cd/20
0
b/15
1
c/5
{a 10= , b = 110, c = 1111, d = 1110, e = 0}
d/15
43
Words to try at Home
• SQUIRRELLED (an inquisitive move)
• SUBDERMATOGLYPHIC (fingerprints mask matrix)
• FLOCCINAUCINIHILIPILIFICATION (something as having little or no value)
• PNEUMONOULTRAMICROSCOPICSILICOVOLCANO CONIOSIS (name of a lung disease)
• LOPADOTEMACHOSELACHOGALEOKRANIOLEIPSA NODRIMHYPOTRIMMATOSILPHIOPARAOMELITOKA TAKECHYMENOKICHLEPIKOSSYPHOPHATTOPERIS TERALEKTRYONOPTEKEPHALLIOKIGKLOPELEIOL AGOIOSIRAIOBAPHETRAGANOPTERYGON
(a sweet dish made from rotted dogfish head in Assemblywomen)
Use both Shanon and Huffman coding methods, and see how much you could reduce
44
Back to entropy question (Slide 12) • Which text could be compressed more?
• Answer: Text 2 has less information, hence less entropy and could be compressed more.
• How?
– –
count questions per symbol, for each compute an average of the sum of products with its probability
(Hint: the answers are 2 & 1.75) abcd/100
cd/100
a/50
ab/50
cd/50
d/25
bcd/50
bc/25
d/25
b/12.5
c/12.5
a/25
b/25
c/25
45
Questions
46