Design
and
Implementa/on
of
Next
Genera/on
Video
Coding
Systems
(H.265/HEVC
Tutorial)
Vivienne
Sze
(sze@mit.edu)
Madhukar
Budagavi
(m.budagavi@samsung.com)
ISCAS
Tutorial
2014
• Vivienne
Sze
(Assistant
Professor
at
MIT)
– Involved
with
video
implementaBon
research
and
standards
for
7+
years
• Contributed
over
70
technical
documents
to
HEVC.
• Within
JCT-‐VC
CommiNee,
Primary
Coordinator
of
the
core
experiments
on
coefficient
scanning
and
coding;
chairman
of
ad
hoc
groups
on
topics
related
to
entropy
coding
and
parallel
processing.
• Published
over
25
journal
and
conference
papers.
• Madhukar
Budagavi
(Research
Director
at
Samsung
Research
America)
– Involved
with
video
standards
and
product
development
for
15+
years
• Contributed
over
100
technical
documents
to
HEVC.
• Within
JCT-‐VC
CommiNee,
Chaired
and
co-‐chaired
sub-‐group
acBviBes
on
spaBal
transforms,
quanBzaBon,
entropy
coding,
in-‐loop
filtering,
intra
predicBon,
screen
content
coding
and
scalable
HEVC
(SHVC).
• Published
over
40
journal
and
conference
papers,
book
chapters.
Instructors
• Part
I:
Overview
of
current
video
coding
technology
and
systems
• Part
II:
High
Efficiency
Video
Coding
(HEVC)
• Part
III:
Video
Codec
ImplementaBons
• Part
IV:
Emerging
ApplicaBons
and
HEVC
Extensions
Outline
of
Tutorial
Part
I:
Overview
of
current
video
coding
technology
and
systems
Growing
Demand
for
Video
• Video
exceeds
half
of
internet
traffic
and
will
grow
to
86
percent
by
2016.
Increase
in
applicaBons,
content,
fidelity,
etc.
à
Need
higher
coding
efficiency!
• Ultra-‐HD
4K
broadcast
expected
for
Japan
in
2014.
London
Olympics
Opening
and
Closing
Ceremonies
shot
in
Ultra-‐HD
8K.
à
Need
higher
throughput!
• 25x
increase
in
mobile
data
traffic
over
next
five
years.
Video
is
a
“must
have”
on
portable
devices.
à
Need
lower
power!
Sources:
Cisco
Visual
Networking
Index
Cisco
Visual
Networking
Index:
Global
Mobile
Data
Traffic
Forecast
Update
5
Digital
Video
=
Y
Cb
Cr
H
W HW ×
22
HW
×
22
HW
×
0 1
2
3
6
4:2:0
Video
Compression
• Uncompressed
1080p
high
definiBon
(HD)
video
at
24
frames/
second
– Pixels
per
frame:
1920×1080
– Bits
per
pixel:
8-‐bits
x
3
(RGB)
– 1.5
hours:
806
GB
– Bit-‐rate:
1.2
Gbits/s
• Blu-‐Ray
DVD
– Capacity:
25
GB
(single
layer)
– Read
rate:
36
Mbits/s
• Video
Streaming
or
TV
Broadcast
– 1
Mbits/s
to
20
Mbits/s
• Require
30x
to
1200x
compression
7
• Compression
is
achieved
by
removing
redundant
informaBon
from
the
video
sequence
• Types
of
redundancies
in
video
sequences
– SpaBal
redundancy
– Perceptual
redundancy
– StaBsBcal
redundancy
– Temporal
redundancy
Video
Compression
Basics
8
0 1
2
3
• Intra
predicBon
Spa/al
Redundancy
Removal
(1)
Frame
0
current
block
to
be
coded
horizontally
predicted
block
previous
block
Intra
predicBon
encode
difference
9
• Block
Transforms
– Typically
matrix
operaBons
– Used
for
correlaBon
reducBon
and
energy
compacBon
in
the
block
Spa/al
Redundancy
Removal
(2)
151 149 145 140 136 133 128 120
150 147 144 140 136 132 127 118
149 145 142 138 135 129 122 116
147 143 139 136 131 126 120 113
141 139 137 132 127 124 116 109
138 135 133 130 125 120 113 106
135 131 130 128 123 117 111 105
132 130 129 126 120 115 109 105
1037 80 0 9 0 4 0 0
49 1 3 3 0 0 0 1
0 0 1 0 0 0 0 0
0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0
1 1 1 1 2 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0
8×8
2D
Discrete
Cosine
Transform
(DCT)
10
• Not
all
video
data
are
equally
significant
from
a
perceptual
point
of
view
• Make
use
of
the
properBes
of
the
Human
Visual
System
(HVS)
– HVS
is
more
sensiBve
to
low
frequency
informaBon
Perceptual
Redundancy
Removal
(1)
Low
Frequency
High
Frequency
11
• QuanBzaBon
is
a
good
tool
for
perceptual
redundancy
removal
– Most
significant
bits
(MSBs)
are
perceptually
more
important
than
least
significant
bits
(LSBs)
– Coefficient
dropping
(quanBzaBon
with
zero
bits)
example:
Perceptual
Redundancy
Removal
(2)
Original
frame
Image
obtained
by
retaining
36
DCT
coefficients
for
each
8×8
block
12
• Not
all
pixel
values
in
an
image
(or
in
the
transformed
image)
occur
with
equal
probability
• Use
entropy
coding
(e.g.
variable
length
coding)
– Shorter
codewords
used
to
represent
more
frequent
values
– Longer
codewords
used
to
represent
less
frequent
value
Sta/s/cal
Redundancy
Removal
(1)
13
• Original
image:
8
bits/pixel,
Entropy
coding:
7.14
bits/pixel
• Results
more
dramaBc
when
entropy
coding
is
applied
on
transformed
and
quanBzed
image:
1.82
bits/pixel
Sta/s/cal
Redundancy
Removal
(2)
Histogram
0 50 100 150 200 250
0
200
400
600
800
1000
1200
1400
1600
1800
-500 0 500 1000 1500 2000
0
0.5
1
1.5
2
2.5
x 104
Histogram
14
• Inter
predicBon
• Frame
difference
coding
– Difference
can
be
encoded
using
DCT
+
QuanBzaBon
+
Entropy
Coding
Temporal
Redundancy
Removal
(1)
Frame
3
Frame
4
–
Frame
3
Frame
4
15
Temporal
Redundancy
Removal
(2)
• Inter
predicBon
using
MoBon
compensated
predicBon
– Divide
the
frame
into
blocks
and
apply
block
moBon
esBmaBon/
compensaBon
– For
each
block
find
out
the
relaBve
moBon
between
the
current
block
and
a
matching
block
of
the
same
size
in
the
previous
frame
– Transmit
the
moBon
vector(s)
for
each
block
Frame
t-‐1
Frame
t
mv
16
• Intra
Picture
(I)
– Picture
is
coded
without
reference
to
other
pictures
• Inter
picture
(P,
B,
b)
– Uni-‐direcBonally
predicted
(P)
Picture
• Picture
is
predicted
from
one
prior
coded
picture
– Bi-‐direcBonally
predicted
(B,
b)
Picture
• Picture
is
coded
from
two
prior
coded
pictures
Temporal
Predic/on
and
Picture
Coding
Types
I
b B Pb
17
Summary
of
Key
Steps
in
Video
Coding
• Intra
PredicBon
and
Inter
PredicBon
18
Transform
and
QuanBzaBon
many
pixels*
few
coefficients
• Transform
and
QuanBzaBon
of
residual
(predicBon
error)
• Entropy
coding
on
syntax
elements
e.g.
predicBon
modes,
moBon
vectors,
coefficients
previous
current
moBon
vector
predicBon
mode
Inter
PredicBon
(MoBon
CompensaBon)
Intra
PredicaBon
• In-‐loop
filtering
to
reduce
coding
arBfacts
*
Residual
figure
from
J.
Apostolopoulos,
“Video
Compression,”
MIT
6.344
Lecture,
Spring
2004
Video
Compression
Standards
• Ensures
inter-‐operability
between
encoder
and
decoder
• Support
mulBple
use
cases
and
applicaBons
– Levels
and
Profiles
• Video
coding
standard
specifies
decoder:
mapping
of
bits
to
pixels
• ~2x
improvement
in
compression
every
decade
Pre-‐Processing
Encoding
Source
DesBnaBon
Post-‐Processing
Decoding
Scope
of
Standard
1994
2003
2013
MPEG-‐2
H.264/AVC
HEVC
bit-‐rate
19
19
• MPEG:
Moving
Picture
Experts
Group
(ISO/IEC)
• VCEG:
Video
Coding
Experts
Group
(ITU-‐T)
• Other
standards:
VC1,
VP8/VP9,
China
AVS,
RealVideo
History
of
Video
Coding
Standards
1984
VCEG
MPEG/
VCEG
MPEG
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
MPEG-‐1
MPEG-‐4
MPEG-‐2/
H.262
H.264/
MPEG-‐4
Part
10-‐AVC
H.261
H.263
H.263+
H.263++
20
20
Video
Coding
Progress
Source:
T.
Wiegand,
JVT-‐W132,
2007
21
21
H.264/MPEG-‐4
AVC
• Completed
(version
1)
in
May
2003
• H.264/AVC
is
the
most
popular
video
standard
in
market
– 80%
of
video
on
the
internet
is
encoded
with
H.264/AVC
• ApplicaBons
include
– HDTV
broadcast
satellite,
cable,
and
terrestrial
– video
content
acquisiBon
and
ediBng
– camcorders,
security
applicaBons,
Internet
and
mobile
network
video,
Blu-‐ray
Discs
– real-‐Bme
video
chat,
video
conferencing,
and
telepresence
• ~50%
higher
coding
efficiency
than
MPEG-‐2
(used
in
DVD,
US
terrestrial
broadcast)
22
• PredicBon
– Intra
predicBon
using
neighboring
samples
– Temporal
predicBon
using
mulBple
frames
– MoBon
compensaBon
on
variable
block
size,
quarter-‐pel
• Transform
– 4×4/8×8
Integer
transform,
2×2/4×4
Secondary
Hadamard
• QuanBzaBon
– Finer
quanBzaBon
supported
• Entropy
coding
– Context
adapBve
variable
length
coding
(CAVLC)
and
arithmeBc
coding
(CABAC)
• In-‐loop
deblocking
filter
Improvements
of
H.264/MPEG-‐4
AVC
over
previous
standards
23
Part
II:
High
Efficiency
Video
Coding
(HEVC)
• Achieves
2x
higher
compression
compared
to
H.264/AVC
• High
throughput
(Ultra-‐HD
8K
@
120fps)
&
low
power
– ImplementaBon
friendly
features
(e.g.
built-‐in
parallelism)
• Benefits
include
– reduce
the
burden
on
global
networks
– easier
streaming
of
HD
video
to
mobile
devices
– account
for
advancing
screen
resoluBons
(e.g.
Ultra-‐HD)
High
Efficiency
Video
Coding
(HEVC)
“HEVC
will
provide
a
flexible,
reliable
and
robust
solu9on,
future-‐proofed
to
support
the
next
decade
of
video”
-‐
ITU-‐T
Press
Release
(2013)
Samsung
Galaxy
S4
Live
delivery
of
French
Open
Neulix
Ultra-‐HD
4K
Samsung
TV
Ultra-‐HD
4K
25
Ac/vity
in
JCT-‐VC
Commi_ee
• Chairs
– G.
J.
Sullivan
(Microsov)
– J.
R.
Ohm
(Aachen
University)
• Meet
Quarterly
– 1st
meeBng
(A)
[January
2010]
…..
– 12th
meeBng
(L)
[January
2013]
• ~250
aNendees
per
meeBng
represenBng
~70
companies
• Several
hundred
contribuBons
per
meeBng
• Each
meeBng
is
around
9
-‐
10
days
(14+
hours/day)
• MulBple
parallel
tracks
0
200
400
600
800
1000
1200
A B C D E F G H I J
Attendees Contributions
26
• MeeBng
ContribuBons
– hNp://phenix.int-‐evry.fr/jct/
• SpecificaBon
– hNp://www.itu.int/ITU-‐T/recommendaBons/rec.aspx?rec=11885
• Reference
Sovware
(HM)
– hNps://hevc.hhi.fraunhofer.de/svn/svn_HEVCSovware/
HEVC
Reference
Documents
• References
– G.
J.
Sullivan,
et
al.
“Overview
of
the
High
Efficiency
Video
Coding
(HEVC)
standard,”
IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
– V.
Sze,
M.
Budagavi,
G.
J.
Sullivan
(Editors),
“High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,”
Springer,
2014
hNp://www.springer.com/engineering/signals/book/
978-‐3-‐319-‐06894-‐7
27
Coding
Efficiency
of
HEVC
(Objec/ve)
J.
R.
Ohm
et
al.,
“Comparison
of
the
Coding
Efficiency
of
Video
Coding
Standards—Including
High
Efficiency
Video
Coding
(HEVC),”IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
28
PSNR =10 log10
(2bitdepth −1)2 *W *H
{Oi −Di}
2
i
∑
Coding
Efficiency
of
HEVC
(Subjec/ve)
J.
Ohm
et
al.,
“Comparison
of
the
Coding
Efficiency
of
Video
Coding
Standards—Including
High
Efficiency
Video
Coding
(HEVC),”IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
Sequences
Bit-‐rate
Savings
BQ
Terrace
63.1%
Basketball
Drive
66.6%
Kimono1
55.2%
Park
Scene
49.7%
Cactus
50.2%
BQ
Mall
41.6%
Basketball
Drill
44.9%
Party
Scene
29.8%
Race
Horse
42.7%
Average
49.3%
SubjecBve
Tests
for
Entertainment
ApplicaBons
(Random
Access)
29
H.265/HEVC
vs.
H.264/AVC
Decoder
Entropy
Decoder
Q-‐1
+T-‐1
Intra
PredicBon
MoBon
Comp.
+
Deblocking
Filter
Picture
Buffer
Encoded
bitstream
Decoded
pixels
In-‐loop
Filter
Sample
AdapBve
Offset
High
Throughput
CABAC
&
Advanced
MoBon
Vector
PredicBon
Larger
Transforms
and
More
Sizes
More
PredicBon
Modes
Larger
InterpolaBon
Filter
Fewer
Edges
Larger
and
Flexible
Coding
Block
Size
64×64
30
Key
Features
In
HEVC
High
Coding
Efficiency
High
Throughput
/
Low
Power
Larger
and
Flexible
Coding
Block
Size
X
More
SophisBcated
Intra
PredicBon
X
Larger
InterpolaBon
Filter
for
MoBon
CompensaBon
X
Larger
Transform
Size
X
Parallel
Deblocking
Filter
X
Sample
AdapBve
Offset
X
High
Throughput
CABAC
X
X
High
Level
Parallel
Tools
X
Parallel
Merge/Skip
X
M.
Zhou,
V.
Sze,
M.
Budagavi,
“Parallel
Tools
in
HEVC
for
High-‐Throughput
Processing,”
SPIE
Op9cal
Engineering
+
Applica9ons,
Applica9ons
of
Image
Processing
XXXV,
2012.
31
Larger
Coding
Blocks
• Each
frame
is
broken
up
into
blocks
• Large
block
sizes
reduce
signaling
overhead
• In
H.264/AVC,
macroblock
is
always
16×16
pixels
– Each
macroblock
is
either
inter
or
intra
coded
• In
HEVC,
Coding
Tree
Unit
(CTU)
can
have
up
to
64×64
pixels
– CTU
can
have
a
combinaBon
of
inter
and
intra
coded
blocks
N=16,
32,
or
64
N
N
32
Flexible
Coding
Block
Structure
• BeNer
adaptaBon
to
different
video
content
• CTU
divided
into
Coding
Units
(CU)
with
Quad
tree
• Coding
units
divided
into
predicBon
units
(PU)
• PU
have
different
moBon
data
or
predicBon
modes
Coding
Tree
Unit
(CTU)
PredicBon
Unit
(PU)
skip
Coding
Tree
composed
of
Coding
Units
(CU)
Asymmetric
MoBon
ParBBon
33
• Intra-‐Coded
CU
can
only
be
divided
into
square
parBBon
units
– For
a
CU,
make
decision
to
split
into
four
PU
(8×8
CUs
only)
or
single
PU
• Inter-‐Coded
CU
can
be
divide
into
square
and
non-‐square
PU
as
long
as
one
side
is
at
least
4
pixels
wide
(note:
no
4×4
PU)
Predic/on
Units
Two
methods
of
parBBoning
for
intra-‐coded
CU
Eight
methods
of
parBBoning
for
inter-‐coded
CU
N
N N/2
N/2
N
N
N
N/2
N/2
N/2
N/2
N
3N/4
N/4
N
N
3N/4
N/4
N
N
3N/4
N/4
3N/4
N/4
34
Large
Transforms
• HEVC
supports
4×4,
8×8,
16×16,
32×32
integer
transforms
– Two
types
of
4×4
transforms
(IDST-‐based
for
Intra,
IDCT-‐based
for
Inter);
IDCT-‐based
transform
for
8×8,
16×16,
32×32
block
sizes
– Integer
transform
avoids
encoder-‐decoder
mismatch
and
driv
caused
by
slightly
different
floaBng
point
representaBons.
– Parallel
friendly
matrix
mulBplicaBon/parBal
buNerfly
implementaBon
– Transform
size
signaled
using
Residual
Quad
Tree
• Achieves
5
to
10%
increase
in
coding
efficiency
• Increased
complexity
compared
to
H.264/AVC
– 8x
more
computaBons
per
coefficient
– 16x
larger
transpose
memory
Transform
and
QuanBzaBon
many
pixels
few
coefficients
Represent
residual
of
CU
with
TU
quad
tree
35
M.
Budagavi
et
al.,
“Core
Transform
Design
in
the
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
JSTSP,
2013
Intra
Predic/on
• H.264/AVC
has
10
modes
– angular
(8
modes),
DC,
planar
• HEVC
has
35
modes
– angular
(33
modes),
DC,
planar
• Angular
predicBon
– Interpolate
from
reference
pixels
at
locaBons
based
on
angle
• DC
– Constant
value
which
is
an
average
of
neighboring
pixels
(reference
samples)
• Planar
– Average
of
horizontal
and
verBcal
predicBon
17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
0 : Intra_Planar
1 : Intra_DC
35: Intra_FromLuma
Horizontal
mode
VerBcal
mode
0:
Planar
1:
DC
2..34:
Angular
36
Intra
Predic/on
Modes
J.
Lainema,
W.-‐J.
Han,
“Intra
PredicBon
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
37
Removing
Intra
Ar/facts
(Pre-‐Processing)
w/o
pre-‐filter
w/
pre-‐filter
Image
source:
M.
Wien,
TCSVT,
July
2003
• Reference
Sample
Smoothing
– Smooth
out
neighboring
pixels
(i.e.,
reference
samples)
before
using
them
for
predicBon
– Reduce
contouring
arBfacts
caused
by
edges
in
the
reference
sample
arrays
– Two
modes
• Three-‐tap
smoothing
filter
• Strong
intra
smoothing
with
corner
reference
pixels
– ApplicaBon
of
smoothing
depends
on
PU
size
and
predicBon
mode
38
J.
Lainema,
W.-‐J.
Han,
“Intra
PredicBon
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
• Boundary
Smoothing
– Intra
predicBon
may
introduce
disconBnuiBes
along
block
boundaries
– Filter
first
predicBon
row
and
column
with
three-‐tap
filter
for
DC
predicBon,
and
two-‐tap
for
horizontal
and
verBcal
predicBon
Removing
Intra
Ar/facts
(Post-‐Processing)
Image
source:
JCTVC-‐F172,
July
2011
39
Inter
Predic/on
• MoBon
vectors
can
have
up
to
¼
pixel
accuracy
(interpolaBon
required)
• In
H.264/AVC,
luma
uses
6-‐tap
filter,
and
chroma
uses
bilinear
filter
• In
HEVC,
luma
uses
8/7-‐tap
and
chroma
uses
4-‐tap
– Different
coefficients
for
¼
and
½
posiBons
• Restricted
predicBon
on
small
PU
sizes
4×4
block
in
current
frame
Reference
block
in
previous
frame
Vector
(1,
-‐1)
Reference
block
in
previous
frame
Vector
(0.5,
-‐0.5)
40
Interpola/on
Filter
Require
integer
pixels
(highlighted
in
red)
to
interpolate
fracBonal
pixels
(highlighted
in
blue)
To
interpolate
NxN
pixels
requires
up
to
(N+7)x(N+7)
reference
pixels
41
Use
1-‐D
filters
(order
maNers
for
greater
than
8-‐bit
video)
Mode
Coding
• Predict
modes
from
neighbors
to
reduce
syntax
element
bits
– Intra
PredicBon
Mode
– Advance
MoBon
Vector
PredicBon
(AMVP),
Merge/Skip
Mode
current
PU
B
A
Current PU
A1
A0
B1
B0
B2
Co-located PU
CR
H
42
current
PU
co-‐located
PU
3
candidates
2
to
5
candidates
Merge
Mode
Moving
Object
Without
Merge
(many
extra
moBon
parameters)
With
Merge
B.
Bross
et
al.,
“Inter
PredicBon
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
43
AMVP
Merge
Skip
Syntax
elements
mvp_l0_flag,
mvp_l1_flag
merge_flag,
merge_idx
cu_skip_flag,
merge_idx
Use
of
neighbors
candidates
Predict
moBon
vector
Copy
moBon
data
(moBon
vector,
reference
index,
direcBon)
Copy
moBon
data
(moBon
vector,
reference
index,
direcBon);
no
residual
Number
of
Candidates
Up
to
2
Up
to
5
(signaled
in
slice
header)
SpaBal
Up
to
2
of
5
(scaling
if
reference
index
different)
Up
to
4
of
5
(no
scaling,
only
redundancy
check)
Temporal
Up
to
1
of
2
(if
<
2
spaBal
candidates)
Up
to
1
of
2
(always
added
to
list
if
available)
AddiBonal
Zero
moBon
vector
(if
<
2
spaBal
or
temp
candidates)
Bi-‐predicBve
candidates
and
zero
moBon
vector
AMVP,
Merge,
Skip
Mode
44
In-‐loop
Filtering:
Deblocking
Filter
• Removes
blocking
arBfacts
due
to
block
based
processing
– ComputaBonally
intensive
in
H.264/AVC
• In
H.264/AVC,
performed
on
every
4x4
block
edge
– Each
macroblock
has
128
pixel
edges,
32
edge
calculaBons
– Each
4x4
depends
on
neighboring
4x4
• In
HEVC,
performed
on
every
8x8
block
edge
– Each
16x16
CTU
has
64
pixel
edges,
8
edge
calculaBons
– All
8x8
are
independent
(can
be
processed
in
parallel)
w/o
deblocking
w/
deblocking
45
16
16
In-‐loop
Filtering:
Sample
Adap/ve
Offset
(SAO)
• Filter
to
address
local
disconBnuiBes
– Edge
Offset
and
Band
Offset
• Check
neighbors
in
one
of
4
direcBons
(0,
90,
135,
45
degrees)
• Based
on
the
values
of
the
neighbors,
apply
one
of
4
offsets
pixel index
x-1 x x+1
pi
xe
l l
ev
el
category 1
pixel index
x-1 x x+1
pi
xe
l l
ev
el
category 2
pixel index
x-1 x x+1
pi
xe
l l
ev
el
pixel index
x-1 x x+1
pi
xe
l l
ev
el
category 3
pixel index
x-1 x x+1
pi
xe
l l
ev
el
pixel index
x-1 x x+1
pi
xe
l l
ev
el
category 4
c c c c
46
In-‐loop
Filtering:
Sample
Adap/ve
Offset
(SAO)
With
SAO
Without
SAO
C.-‐M.
Fu
et
al.,
"Sample
AdapBve
Offset
in
the
HEVC
Standard,”
IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
47
Entropy
Coding
• Lossless
compression
of
syntax
elements
• HEVC
uses
Context
AdapBve
Binary
ArithmeBc
Coding
(CABAC)
– 10
to
15%
higher
coding
efficiency
compared
to
CAVLC
V.
Sze,
D.
Marpe,
“Entropy
Coding
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
48
CABAC
Throughput
Improvements
• Reduce
total
number
of
bins
• Reduce
context
coded
bins
• Reduce
context
dependencies
• Grouping
bypass
bins
• Reduce
parsing
dependencies
• Reduce
memory
requirements
bits
De-‐Binarizer
(DB)
ArithmeBc
Decoder
(AD)
Context
Memory
Context
SelecBon
(CS)
syntax
elements
Context
Modeling
(CM)
bins
probability
bypass
V.
Sze,
M.
Budagavi,
“High
Throughput
CABAC
Entropy
Coding
in
HEVC,”
IEEE
TCSVT,
2012
Total
bins
Context
bins
Bypass
bins
H.264/AVC
20861
7805
13056
HEVC
14301
884
13417
RaBo
1.5x
9x
1x
ReducBon
in
worst
case
bins
for
16x16
pixels
• 3x
reducBon
in
context
memory
• 20x
reducBon
in
line
buffer
for
context
selecBon
49
0
1
1
0
1
0
1
0
0
0
1
0
1
0
15
cycles
0
1
1
0
1
0
1
0
0
0
1
0
1
0
9
cycles
1
0
1
cycle
1
cycle
High
Level
Parallel
Tools
(Mul/-‐Core)
substream
0
substream
1
substream
2
substream
3
Ble
1
Ble
0
Ble
3
Ble
2
Wavefront
Parallel
Processing
(Interleaved
Entropy
Slices*)
Slices
(also
in
H.264/AVC)
Tiles
slice
0
slice
1
slice
2
slice
3
*D.
Finchelstein,
V.
Sze,
A.
P.
Chandrakasan,
“MulB-‐core
Processing
and
Efficient
On-‐chip
Caching
for
H.264
and
Future
Video
Decoders,”
IEEE
Trans.
CSVT,
2009
50
Addi/onal
Modes
• For
wireless
display
and
cloud
compuBng,
screen
content
coding
should
be
considered
• Screen
content
typically
has
more
edges
• Lossless
– Bypass
transform,
quanBzaBon
and
in-‐
loop
filters
• Transform
Skip
– Bypass
transform,
but
conBnue
to
perform
quanBzaBon
and
in-‐loop
filters
• I_PCM
– Signal
raw
pixels
source: www.techprollc.com
51
Profiles,
Levels,
Tiers
• Profile
defines
set
of
tools
for
different
applicaBons
– Main,
Main
10,
Main
SBll
Picture
– 8-‐bits/sample
à
16.78
million
colors
– 10-‐bits/sample
à
1.07
billion
colors
• Level
defines
the
maximum
supported
resoluBon
and
frame
rate
– e.g.
Level
4.0,
1920x1080
@
32
fps
– Level
5.0,
4096x2160
@
30
fps
• Bit-‐rates
defined
by
level
and
Ber
– Main
and
High
(professional)
52
…
…
…
Main
S/ll
Picture
(Intra
Coding
Only)
• HEVC
also
provides
improved
compression
for
sBll
images
BD-‐Rate
Reduc/on
H.264/AVC
(intra
only)
15.8%
JPEG
2000
22.6%
JPEG
XR
30.0%
Web
P
31.0%
JPEG
43.0%
T.
Nguyen,
D.
Marpe,
“Performance
Comparison
of
HM
6.0
with
ExisBng
SBll
Image
Compression
Schemes
Using
a
Test
Set
of
Popular
SBll
Images”
JCTVC-‐I0595,
2012
53
Part
III:
Video
Codec
Implementa/ons
• FuncBon
– Mapping
of
bitstream
to
pixels
fixed
by
the
standard
• ImplementaBon
Requirements
– Conformance:
Support
all
tools
for
a
given
profile
in
the
standard
– Throughput:
Real-‐Bme
processing
for
video
playback;
level
specifies
pixel-‐rate
and
bit-‐rate
Decoder
Design
Considera/ons
10101011
Decoder
bitstream
at
specified
bit-‐rate
pixels
at
specified
pixel-‐rate
55
• FuncBon
– Mapping
of
pixels
to
standard
compliant
bitstream
– Flexibility
of
selecBng
which
set
of
encoding
tools
to
use
and
how
to
use
them
(e.g.
how
to
search
for
best
compression
mode)
Encoder
Design
Considera/ons
(1)
56
10101011
Encoder
bitstream
at
specified
bit-‐rate
or
compression
ra/o
pixels
at
specified
pixel-‐rate
for
real-‐/me
applica/ons
• ImplementaBon
Requirements
– Conformance:
Must
generate
a
bitstream
that
is
decodable
by
a
standard
compliant
decoder
(for
a
given
profile)
– Throughput:
For
real-‐Bme
applicaBons,
need
to
meet
pixel-‐rate
requirements;
can
be
done
off-‐line
for
storage
applicaBons
– Bit-‐rate/Compression
Ra9o:
For
given
applicaBon,
must
meet
minimum
compression
requirements
– Compression
ra9o
vs.
Complexity:
Find
compression
mode
that
meets
compression
requirements
under
complexity
constraint
Encoder
Design
Considera/ons
(2)
Decoder
design
requires
architecture
innovaBons,
while
encoder
design
requires
both
algorithm
and
architecture
innovaBons
57
Desktop
CPU
[1]
Mobile
CPU
[1]
GPU+CPU
[2]
DSP
[3]
FPGA
[4]
ASIC
[5,6]
Flexibility
High
High
Med/High
Med
Med
Low
Development
Cost
Low
Low
Low/Med
Med
Med
High
Speed/
Throughput
Low/Med
Low
Med
Med
Med
High
Power
Consump/on
High
Med
High
Med
Med
Low
Mul/media
Plakorms
Examples
of
HEVC
implementa/ons
[1]
F.
Bossen
et
al.,
"HEVC
Complexity
and
ImplementaBon
Analysis,"
IEEE
TCSVT,
2012
[2]
INanim
Systems,
“Compute
accelerated
HEVC
decoder
on
ARM®
MaliTM-‐T600
GPUs”
[3]
F.
Pescador
et
al.,
"On
an
implementaBon
of
HEVC
video
decoders
with
DSP
technology,”
IEEE
ICCE,
2013
[4]
S.
Cho,
H.
Kim,
“ImplementaBon
of
a
HEVC
Hardware
Decoder,”
JCTVC-‐L0098,
2013
[5]
C.-‐T.
Huang
et
al.
"A
249Mpixel/s
HEVC
video-‐decoder
chip
for
Quad
Full
HD
applicaBons,”
IEEE
ISSCC,
2013.
[6]
S.-‐F.
Tsai
et
al.
"A
1062Mpixels/s
8192×
4320p
High
Efficiency
Video
Coding
(H.265)
encoder
chip,”
IEEE
VLSIC,
2013.
58
• Throughput
– Achieve
target
pixel-‐rate
and
bit-‐rate
for
real-‐Bme
applicaBons
– Reduce
latency
of
bits
to
pixels
and
pixels
to
bits
for
interacBve
applicaBons
– Techniques:
parallelism,
pipelining,
eliminate
stalls
• Energy
and
Power
ConsumpBon
– Minimize
energy
consumpBon
to
extend
baNery
life
for
portable
devices
– Minimize
power
consumpBon
to
reduce
heat
dissipaBon
– Techniques:
voltage
scaling,
frequency
scaling,
power
gaBng,
number
of
ops
• Plauorm
Cost
– Reduce
amount
of
data
to
be
stored
in
memory
and
amount
of
logic
(e.g.
gates
in
ASIC,
number
of
cores
for
processors)
to
reduce
size
of
chip
– Reduce
bandwidth
requirements
such
as
reads/writes
from
memory
to
reduce
demands
on
off-‐chip
components
– Techniques:
shared
computaBons,
on-‐the-‐fly
processing,
caching
Implementa/on
Requirements
59
• ARMv7
1.3GHz
(mobile
processor)
[Bossen,
JCTVC-‐K0327,
2012]
– Dual
core,
but
decoding
on
single
thread
(other
thread
for
display)
– 1080p
@
24
fps
at
2Mbps
(16
picture
buffer
to
average
workload)
• Intel
i7
Core
2.6
GHz
(desktop
processor)
[Bossen
et
al.,
TCSVT,
2012]
– Single
core,
single
thread
– 1080p
@
60
fps
at
7Mbps
• MulB-‐thread
Intel
Core
i7
2.7
GHz
[Suzuki
et
al.,
JCTVC-‐L0098,
2013]
– 4
cores
/
4
threads
(parallel
GOPs)
– 3840x2160
@
76
fps
at
12Mbps
[cropped
8K
content]
• MulB-‐thread
Intel
X5680
3.3
GHz
[Chi
et
al.,
TCSVT,
2012]
– 2x6
cores/12
threads
(parallel
Tiles,
WPP)
– 3840x2160
@
24
fps
at
~12Mbps
(QP=37)
– 3840x2160
@
14
fps
at
~170Mbps
(QP=22)
Solware
HEVC
Decoder
60
Solware
HEVC
Decoder
Workload
for
different
modules
F.
Bossen
et
al.,
"HEVC
Complexity
and
ImplementaBon
Analysis,"
IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
61
Line Buffer for
Entropy Decoder
Coeff
In-loop
Filters
MC
Cache
Rec
DMA
Ref Pixels
Line Buffer for
Prediction and In-loop Filters
Line Buffers
Residue
Inverse
Transform
Prediction
MV Info Group II
Memory Interface Arbiter
Top
Control
ColMV
ColMV
DMA
Group I
Entropy
Decoder
MV
Dispatch
VPB/Top Info
Pixel flow
Info flow
SRAM
Processing
Engine
DMA flow
Legend
Hardware
HEVC
Decoder
Architecture
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
62
• Variable-‐size
pipelining
to
support
a
diverse
set
of
CTU,
CU,
and
PU
sizes
(select
size
to
balance
memory
cost
vs.
data
reuse)
Pipelining
HEVC
Decoder
CTU
64x64
64x32
64x16
64x64
32x32
16x16
Variable-‐size
Pipeline
Block
(VPB)
Source:
C.-‐T.
Huang
et
al.,
“A
249Mpixels/s
HEVC
Video
Decoder
Chip
for
Quad
Full
HD
ApplicaBons,”
IEEE
ISSCC,
2013.
PPB
0
PPB
1
PPB
2
PPB
3
PPB
0
PPB
1
PPB
0
VPB
64x64
64x32
64x16
PPB
(Stage 1)
Sub-PPB
(Stage 2)
0
1
2
3
4
5
Y
U/V
0
1
2
3
4
5
System
level
pipeline
(between
Inv.
Transform,
PredicBon
and
In-‐Loop
Filters)
Predic/on
level
pipeline
(within
PredicBon
module)
16x16
Pipeline
63
• Workload
of
entropy
decoding
based
on
bit-‐rate
(bin-‐rate),
while
rest
of
decoder
depends
on
pixel-‐rate
• Use
FIFO
to
absorb
variaBons
in
workload
– Higher
FIFO
depth
results
in
less
stalls
due
to
averaging,
but
longer
latency
and
higher
memory
cost
Decoupling
Entropy
Coding
Entropy Decoder
MC Dispatch
0
1
2
3
0
1
2
0
1
2
3
0
1
2
0
1
0
Inverse Transform
Prediction
Deblock
REC DMA
G
ro
u
p
I
G
ro
u
p
I
I
Coefficients
in
TU
FIFO
64
Source:
C.-‐T.
Huang
et
al.,
“A
249Mpixels/s
HEVC
Video
Decoder
Chip
for
Quad
Full
HD
ApplicaBons,”
IEEE
ISSCC,
2013.
Intra
Predic/on
• Reference
sample
processing
– Reference
pixel
buffer
to
store
neighboring
pixels
(padding
when
not
available)
– Apply
smoothing
filter
on
pixels
depending
on
mode
• Feedback
loop
at
TU
granularity
– Update
reference
pixel
buffer
accordingly
Intra
Prediction
Inverse
Transform
+
Intra reference
pixels
Inter
Prediction
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,"
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
65
TU
granularity
feedback
• Read
samples
from
reference
picture
(typically
stored
in
off-‐chip
picture
buffer)
– Use
cache
to
reduce
off-‐chip
memory
bandwidth
• InterpolaBon
pixels
used
a
2-‐D
separable
filter
for
fracBonal
moBon
vectors
– MulBple
pixels
can
be
interpolated
in
parallel
(share
input
pixels)
• Smaller
blocks
have
larger
read
overhead
(for
fracBonal
mv)
– NxN
requires
(N+7)x(N+7)
pixel
reads
à
4x4
inter-‐PU
not
supported
in
HEVC
Inter
Predic/on
Dispatch MC Cache Fetch 2-D Filter
To Reference Picture Buffer
(on-chip SRAM/external DRAM)
Motion Vectors
from
Entropy Decoder
Inter Predicted
Pixels
66
• Minimize
redundant
reads
from
off-‐chip
memory
(DRAM)
• MC
Cache
design
consideraBons
– Sufficient
throughput
to
support
worst
case
PU
– Detect
redundant
reads
and
handle
latency
of
DRAM
• Store
pixels
in
DRAM
to
minimize
row
changes
(cycle
overhead)
– Avoid
reading
two
rows
from
same
bank
for
a
given
reference
region
MC
Cache
and
Picture
Buffer
20%
reducBon
in
overhead
cycles
0 1
2
4 5
6 7
0 1 4 5 0 1
2 3
0 1
2 3
1
2 3
4 5
6 7
4 5
6 7
0
2 3
5
7
0 1
2 3
4 5
6 7
1
0
2 3 6 7
4
6
#
=
bank
in
DRAM
67
3
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,"
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
67
• Larger
transform
à
More
computaBon
– Share
coefficients
across
transform
sizes
and
within
transform
to
reduce
area
cost
Inverse
Transform
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
2x22x2Partial
4x4
Partial
8x8Partial16x16
Even-Odd Index Sort
4x4
add-sub
add-sub
add-sub
2
IDCT8
IDCT16
add-sub
4
IDCT4
IDST4 IDCT32
2 222
4
4
4
8
32
1616
4
y0 y1 y2 y3
i
18
50
75
89
-50
-89
-18
75
75
18
-89
50
-89
75
-50
18
ui
LUT
MAC
30%
reducBon
in
area
cost
68
• Larger
transform
à
Larger
transpose
memory
– Use
SRAM
rather
than
registers
to
reduce
area
cost
– SRAM
has
limited
read/write
ports
(requires
careful
mapping)
Inverse
Transform
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0
0 0 0 0 1 1 1 0
8 8 8 8 9 9 9 0
16
0
0
1
9
0
0
0
0
0
0
2
10
24
32
016
24
32
016
24
32
016
24
32
017
25
33
017
25
33
017
25
33
017
25
33
018
26
34
3
2
p
ix
e
ls
32 pixels
0
0
0
Bank 0
Bank 1
Bank 2
Bank 3
0
0
7
15
023
31
39
120 120 120 120 121 121 121 121 122
Transform
Transpose
Memory
Dequantize
Residue
Coeffs
row/column
select
4
4
4
4
4
pixels/cycle
throughput
per
1-‐D
transform
4x4
blocks
69
Video
Coding
Standard
HEVC
(HM4)
Technology
TSMC
40-‐nm
Core
Area
1.33
x
1.33
mm
Gate
Count
715k
On-‐Chip
Memory
(SRAM)
124
kB
Resolu/on
/
Frame
Rate
4kx2k
@
30fps
(3840x2160)
Frequency
200
MHz
Core
Voltage
0.9
V
Power
76
mW
Hardware
HEVC
Decoder
D
is
pa
tc
h
/M
C
Ca
ch
e
En
tr
op
y
D
ec
od
er
Predic/on
Inverse
Transform
Deblock
SRAM
2.18
mm
2.
18
m
m
C.-‐T.
Huang
et
al.,
“A
249Mpixels/s
HEVC
Video
Decoder
Chip
for
Quad
Full
HD
ApplicaBons,”
IEEE
ISSCC,
2013
70
Area
Breakdown
MC cache
126
Deblock
49.9
Entropy Decoder
94.5
Inverse Transform
121.1
Memory Interface Arbiter
13.7
Prediction
191.9
RegFiles
75.5
Others
42
Pipeline Buffers
447.3
MC-related SRAM
200.4
Line Buffers
337
Others
32.8
Logic
Memory
(SRAM)
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
71
[kgates]
[kbits]
Power
Breakdown
Prediction
23%
Deblocking
3%
MC Cache
26%
Inverse Transform
17%
Memory Interface Arbiter
2%
Entropy Decoder
3%
Line Buffers
2%
Pipeline Buffers
10%
Others
13%
M.
Tikekar
et
al.,
“Decoder
Hardware
Architecture
for
HEVC,"
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
72
Hardware
vs.
Solware
Prediction
23%
Deblocking
3%
MC Cache
26%
Inverse Transform
17%
Memory Interface Arbiter
2%
Entropy Decoder
3%
Line Buffers
2%
Pipeline Buffers
10%
Others
13%
Hardware
(power)
Solware
(cycles)
73
This Work ISSCC'12 [2] ISSCC'10 [3] ISSCC'06 [4]
Standard
HEVC ("H.265")
WD4
H.264/AVC
HP/MVC
H.264/AVC
HP/SVC/MVC
H.264/AVC
MP
Max Specification
3840x2160
@30fps
7680x4320
@60fps
4096x2160
@24fps
1920x1080
@30fps
Gate Count 715K 1338K 414K 160K
On-Chip SRAM 124KB 80KB 9KB 5KB
Technology 40nm/0.9V 65nm/1.2V 90nm/1.0V 0.18µm/1.8V
Normalized Core Power* 0.31nJ/pixel 0.21nJ/pixel 0.28nJ/pixel 5.11nJ/pixel
Normalized DRAM Power* 0.88nJ/pixel 1.27nJ/pixel N/A N/A
Normalized System Power*** 1.19nJ/pixel 1.48nJ/pixel N/A N/A
DRAM Configuration 32b DDR3 64b DDR2 N/A 32b DDR +
32b SDR
**
ASIC
Decoder
Comparison
Power for max specification
Modeled by [5]
System Power = Core Power + DRAM Power
*
**
***
Slide
Source:
C.-‐T.
Huang
et
al.,
“A
249Mpixels/s
HEVC
Video
Decoder
Chip
for
Quad
Full
HD
ApplicaBons,”
IEEE
ISSCC,
2013.
74
0.0
0.5
1.0
1.5
2.0
2.5
2006
2008
2010
2012
2014
En
er
gy
p
er
p
ix
el
(n
J)
Year
H.264/AVC
H.265/HEVC
D
is
pa
tc
h
/M
C
Ca
ch
e
En
tr
op
y
D
ec
od
er
Predic/on
Inverse
Transform
Deblock
H.265/HEVC
[WD4]
Decoder
(76mW)
C.T.
Huang
et
al.
(MIT),
ISSCC
2013
H.264/AVC
Decoder
(51mW)
P.K.
Tsung
et
al.
(NTU),
ISSCC
2011
TSMC
40nm,
0.9V
Ultra-‐HD
4K
@
30
fps
3.
3
m
m
3.3 mm
MEMORY CONTROLLER
DOMAIN
CORE
DOMAIN
SRAM
176 I/O PADS
0.7-‐V
720p-‐HD
@
30
fps
H.264/AVC
Decoder
(2mW)
Sze
et
al.
(MIT),
JSSC
2009
Decoder
Power
Comparison
75
Low
Power
Approaches
• Operate
at
voltage
near
minimum
energy
point
• UBlize
parallelism
and
pipelining
to
achieve
performance
• AdapBve/Dynamic
voltage
frequency
scaling
• OpBmize
access
paNerns
to
reduce
memory
power
Reduce
Cycles
à
Reduce
Freq.
à
Reduce
Voltage
à
Reduce
Power
Delay
Energy
per
operaBon
Supply
Voltage
T
2T
76
V.
Sze
et
al.,
“A
0.7-‐V
1.8-‐mW
H.264/AVC
720p
Video
Decoder,”
IEEE
Journal
of
Solid
State
Circuits,
2009.
• Encoder
must
search
for
mode
that
gives
the
“best”
compression.
Some
of
the
key
decisions
include
– CU
and
PU
size
– Inter
or
Intra
CU
– MoBon
Vector
– Intra
PredicBon
Mode
• “Best”
compression
is
defined
using
a
rate-‐distorBon
cost
• where
– D
is
the
distorBon
between
the
original
and
the
compressed
image
(a
measure
of
the
visual
quality
of
the
compression)
– R
is
a
measure
of
the
number
of
bits
required
to
signal
the
compressed
image
– λ
is
the
Lagrangian
mulBplier
that
weights
the
distorBon
and
rate
costs
Encoder
Decisions
D+λ ⋅R
77
Perform
rate-‐distor/on
op/miza/on
(RDO)
• Full
RDO
– DistorBon
based
on
sum
of
squared
differences
(SSD),
includes
quanBzaBon
– Rate
based
on
entropy
coded
bits
of
predicBon
info
and
quanBzed
coefficients
• Fast
RDO
– DistorBon
approximaBon
based
on
sum
of
absolute
differences
(SAD)
or
sum
of
absolute
transformed
differences
(SATD)
– Rate
approximaBon
based
on
predicBon
info
bits
(intra
mode
or
moBon
vector);
Can
include
number
of
non-‐zero
coefficients
to
predict
coefficient
bits
Full
vs.
Fast
RDO
Intra
Prediction
Motion
Estimation
Full RDO Pass
Q
CABAC
Rate
T
Final
Mode
Decision
T/Q: Transform/Quantization IT/IQ: Inverse Transform / Quantization
Fast RDO
(30+ modes)
ITIQ SSD
S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,"
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
RDO
Flow
in
HM
78
• The
encoder
must
decide
to
how
best
divide
a
CTU
into
CU,
and
how
to
divide
the
CUs
into
PUs
(based
on
full
RDO
in
HM)
• For
CTU
of
64x64
– CU
opBons:
64x64,
32x32,
16x16,
8x8
• For
Inter-‐coded
CU
– PU
opBons
• For
Intra-‐coded
CU
–
PU
opBons
CU
and
PU
decisions
79
N
N
N
N/2
N/2
N/2
N/2
N
3N/4
N/4
N
N
3N/4
N/4
N
N
3N/4
N/4
3N/4
N/4
N/2
N/2
N
N
• Search
for
block
in
reference
frame(s)
to
predict
current
block
with
least
rate-‐distorBon
cost
– Signal
block
in
previous
frame
using
a
moBon
vector
• Typically
most
computaBonally
intensive
funcBon
in
encoder
Mo/on
Es/ma/on
Search
algorithm
considera/ons
1. Number
of
candidates
– Number
of
computaBons
– Number
of
memory
accesses
2. Off-‐chip
bandwidth
3. On-‐chip
bandwidth
80
• Integer
pixel
moBon
esBmaBon
– Rate
is
the
bits
required
to
transmit
the
moBon
data
(including
impact
of
moBon
predictor)
– DistorBon
is
calculated
from
the
SAD
of
original
and
moBon-‐
compensated
predicBon
(subsampled
when
block
size
>
8)
where
– MV
=
moBon
vector
(include
impact
of
advanced
mv
predictor)
– REF
=
reference
index
Mo/on
Es/ma/on
in
HM
argmin
MV , REF
Diff (i, j)
i, j
∑ +λ ⋅R(MV, REF)
K.
McCann
et
al
“High
Efficiency
Video
Coding
(HEVC)
Test
Model
14
(HM
14)
Encoder
DescripBon,”
JCTVC-‐P1002,
2014
Current PU
A1
A0
B1
B0
B2
Co-located PU
CR
H
81
• Integer
pixel
moBon
esBmaBon
– Search
Strategy
1. Search
center
is
moBon
vector
predictor
2. Diamond
search
around
center
(search
range
=
64
à
7
steps
[1,
2,
4..
64]);
early
terminaBon
if
best
candidate
doesn’t
change
in
3
steps.
3. If
best
candidate
>
5
pixels
away
from
search
center,
do
raster
scan
search
(5
pixel
steps).
4. Perform
diamond
search
around
best
candidate
from
step
2
or
3.
If
new
best
candidate
found
repeat
4.
Mo/on
Es/ma/on
in
HM
Reference
• K.
McCann
et
al
“High
Efficiency
Video
Coding
(HEVC)
Test
Model
14
(HM
14)
Encoder
DescripBon,”
JCTVC-‐P1002,
2014
• M.
Sinangil,
PhD
Thesis,
MIT,
2012
Image
Source:
N.
Purnachand
et
al.,
IEEE
ICCE-‐Berlin,
2012
82
• Half
pixel
moBon
esBmaBon
– Rate
is
the
bits
required
to
transmit
the
moBon
data
(including
impact
of
moBon
predictor)
– DistorBon
is
calculated
from
SATD
• Block-‐wise
4×4
or
8×8
Hadamard
transform
on
difference
between
original
and
moBon-‐compensated
predicBon,
and
sum
absolute
coefficients
– Search
8
points
surrounding
best
integer
moBon
vector
• Quarter
pixel
moBon
esBmaBon
– Same
rate
and
distorBon
calculaBon
as
half
pixel
– Search
8
points
surrounding
best
half
pixel
moBon
vector
• Also
do
search
for
merge/skip
candidates
Mo/on
Es/ma/on
in
HM
K.
McCann
et
al
“High
Efficiency
Video
Coding
(HEVC)
Test
Model
14
(HM
14)
Encoder
DescripBon,”
JCTVC-‐P1002,
2014
83
Mul/ple
Searches
in
Parallel
M.
E.
Sinangil
et
al.,
“Cost
and
Coding
Efficient
MoBon
EsBmaBon
Design
ConsideraBons
for
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
Journal
of
Selected
Topics
in
Signal
Processing,
2013.
Compared
to
HM
• 2x
fewer
candidates
• 1%-‐3%
coding
loss
84
• Perform
moBon
esBmaBon
for
each
PU
in
inter-‐coded
CU
• Process
CUs
in
parallel
to
increase
throughput
– Share
search
pixels
across
engines
to
reduce
memory
bandwidth
by
8x
Parallel
Mo/on
Es/ma/on
M.
E.
Sinangil
et
al.,
“Cost
and
Coding
Efficient
MoBon
EsBmaBon
Design
ConsideraBons
for
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
Journal
of
Selected
Topics
in
Signal
Processing,
2013.
85
Reduce
Number
of
PUs
Processed
M.
E.
Sinangil
et
al.,
“Cost
and
Coding
Efficient
MoBon
EsBmaBon
Design
ConsideraBons
for
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
Journal
of
Selected
Topics
in
Signal
Processing,
2013.
86
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8
C
od
in
g
Lo
ss
(
B
D
-r
at
e)
Area Savings (Mgates)
Number
of
Par//on
Units
1
2
4
11
5
8
3
6
7
Smallest
slope
provides
best
trade-‐off:
#3
Trade-‐off
between
coding
efficiency
(BD-‐rate)
and
complexity
(area
cost)
for
different
number
of
inter
predicted
parBBons
units
M.
E.
Sinangil
et
al.,
“Cost
and
Coding
Efficient
MoBon
EsBmaBon
Design
ConsideraBons
for
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
Journal
of
Selected
Topics
in
Signal
Processing,
2013.
Only
Square
PUs
87
9
10
• In
HM,
moBon
esBmaBon
done
serially
for
PU
within
CU
to
get
AMVP
for
accurate
rate
esBmate
Mo/on
Es/ma/on
with
CU
PU2
PU1
Can’t
process
PU1
and
PU2
in
parallel
Current
PU
A
1
A0
B
1
B0
B2
Co
-‐
located
PU
CR
H
88
Parallel
Mo/on
Es/ma/on
• HEVC
has
“Parallel
MoBon
EsBmaBon”
feature
to
turn
off
dependency
within
an
MoBon
EsBmaBon
Region
(MER)
– PU
within
region
cannot
use
data
from
other
PU
in
region
– All
PUs
in
region
can
be
processed
in
parallel
at
encoder
PU2
PU1
MER
Can
process
PU1
and
PU2
in
parallel
MER0
MER1
MER2
MER3
X
X
X
X
X
MulBple
MERs
per
CTU
M.
Zhou,
“Parallelized
merge/skip
mode
for
HEVC,”
JCTVC-‐F069,
2011
89
• In
HM,
CTU
processed
in
raster
scan
order
• Change
CTU
Processing
Order
to
reduce
reads
from
picture
buffer
(off-‐chip
memory
bandwidth)
due
to
increased
data
locality
• Requires
frame
decoupling
with
entropy
encoder
(as
entropy
encoder
must
generate
bitstream
in
raster
scan
order
to
be
standard
compliant)
CTU
Processing
Order
n=4
m=2
S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
Raster
Scan
Alterna/ve
Scan
90
Addi/onal
Complexity
Reduc/ons
• BoNoms
up
approach
– Derive
distorBon
cost
for
PU
from
sub-‐PUs
(e.g.
compute
distorBon
of
16×16
PU
from
four
8×8
PU)
– Requires
storage
of
SAD
sub-‐PUs
• Reduce
bit-‐width
for
distorBon
calculaBon
• Use
bilinear
interpolaBon
for
fracBonal
moBon
esBmaBon
91
SAD16(X)
=
SAD8(A)
+
SAD8(B)
+
SAD8(C)
+
SAD8(D)
A B
C D
16
8
X
• Rough
mode
decision:
select
N
best
mode
out
of
35
– N
equals
8
for
4×4,
8×8
– N
equals
4
for
16×16,
32×32,
64×64
– Hadamard
Cost
Ranking
(SATD
distorBon
and
mode
bits
for
rate)
• Determine
three
Most
Probable
Modes
(MPM)
– SpaBal
neighbors
to
the
lev
(A)
and
above
(B)
– If
neighbors
not
available
or
redundant
(A=B),
use
DC,
Planar,
verBcal
or
adjacent
angles
(+/-‐
1)
• Decide
between
rough
mode
+
MPM
candidates
– Full
RDO
(SSD
for
distorBon
and
mode
+
coefficient
bits
for
rate)
Intra
Predic/on
Search
in
HM
current
PU
B
A
Y.
Piao
et
al.,
“Encoder
Improvement
of
Unified
Intra
PredicBon,”
JCTVC-‐C207,
Oct.
2010.
92
• To
reduce
search
space,
use
coarse
search
with
angular
predicBon,
then
refinement
around
coarse
angles
• Skip
64×64
PU
size
– Since
max
TU
is
32×32,
predicBon
done
at
32×32;
thus
only
benefit
of
64×64
intra-‐PU
is
signaling
• To
increase
throughput,
use
original
pixels
for
intra
predicBon
(rather
than
reconstructed
pixels)
to
avoid
dependence
on
reconstrucBon
feedback
loop
Addi/onal
Complexity
Reduc/on
Above
techniques
have
cumulaBve
coding
loss
of
1%
S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
93
Hardware-‐Friendly
RDO
Pipeline
S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,
Springer,
2014.
Only
do
full
RDO
on
best
Inter
and
Intra
mode
for
each
CU-‐depth
(6%
coding
loss)
?
?
Fi
na
l M
od
e
D
ec
is
io
n
CU0
CU1 CU1 CU1 CU1
CU2 CU2CU2 CU2CU2 CU2 CU2 CU2
32X32 CU
64X64 CU
16X16 CU
HCMD
Cost
HCMD
Cost
HCMD
Cost
Intra Pred Dirs.
Inter PU Sizes & MVs
PU-Mode Pre-decision CU-Layer High Complexity Mode Decision
Full
RDO
Full
RDO
Full
RDO
Full
RDO
Fast
RDO
94
Hardware
HEVC
Encoder
S.-‐F.
Tsai
et
al.
,
“A
1062Mpixels/s
8192×4320p
High
Efficiency
Video
Coding
(H.265)
encoder
chip,”
IEEE
VLSIC,
2013
Video
Coding
Standard
HEVC
(WD4)
Technology
TSMC
28-‐nm
HPM
Core
Area
5x5mm2
Gate
Count
8350k
On-‐Chip
Memory
(SRAM)
7.14
MB
Resolu/on
/
Frame
Rate
8192×4320@
30fps
Frequency
312
MHz
Power
708
mW
95
ASIC
Encoder
Comparison
S.-‐F.
Tsai
et
al.
,
“A
1062Mpixels/s
8192×4320p
High
Efficiency
Video
Coding
(H.265)
encoder
chip,”
2013
Symposium
on
VLSIC,
2013
96
Part
IV:
Emerging
applica/ons
and
HEVC
extensions
What’s
Next
• More
compression
efficiency
– Yes,
in
5-‐10
years.
Especially
since
video
delivery
is
moving
from
tradiBonal
broadcast
model
to
IP
delivery
and
one-‐to-‐one
streaming
– Analogy:
Public
transport
versus
individual
cars
• Other
consideraBons
have
become
important
too:
– Power
consumpBon,
complexity,
throughput
– Ability
to
support
new
funcBonaliBes,
modaliBes
etc.
Dallas
High
Five
98
• Need
for
supporBng
diverse
clients
with
varying
capabiliBes
(resoluBon,
computaBonal
power
etc.)
Changing
Landscape
of
Video
Coding
Applica/ons
(1)
99
Image source: Samsung, Youtube
• Immersive
experience
– MulBple
cameras
and
at
higher
video
resoluBons
(1080p
è
4K
è
8K)
– MulBple
displays,
Bigger
displays
(1080p
è
4K
è
8K)
– Free-‐viewpoint
video,
360degree
video,
augmented
reality,
3D
movies
– Demos
• hNp://replay-‐technologies.com/
• hNp://www.kolor.com/video
100
Changing
Landscape
of
Video
Coding
Applica/ons
(2)
Image source: Cisco, Kolor
• Growing
requirement
to
support
mixed
format
content
consisBng
of
natural
video
+
graphics/text
101
Changing
Landscape
of
Video
Coding
Applica/ons
(3)
Scalable
Video
Coding
Suppor/ng
Diverse
Clients
-‐
Simulcas/ng
103
Encode
640×480
1280×960
2560×19200
Encode
Encode
Client
Server
Bitstream
1
Bitstream
3
Bitstream
2
Can we do better?
Scalable
Video
Coding
Quality
(SNR)
scalability
Temporal
scalability
SpaBal
scalability
Single Bitstream
… 0110111 …
104
Spa/al
Scalability
Figure source: T. Wiegand, JVT-W132 [1].
Layer
N
–
E.g.
640×480
(Base
layer)
Layer
N+1
–
1280×960
(Enhancement
layer)
• Layered
coding
• Higher
layers
have
higher
spaBal
resoluBon
when
compared
to
lower
layers
• Upper
layers
re-‐uses
data
from
lower
layers
105
Temporal
Scalability
I P P P P P P P P
P I B B P I B B B B P
I p P p P p P p P
IPPP
coding
IBBP
coding
Hierarchical
B-‐frames
I b B b P b B b P
Hierarchical
P-‐frames
• p,
b
–
Non-‐reference
frames
106
HEVC
Scalable
Extension
(SHVC)
Base
layer
decoder
BL
Bitstream
BL
decoded
pictures
BL
Frame
buffer
Enhancement
layer
decoder
EL
Bitstream
EL
decoded
pictures
Upsampler
EL
Frame
buffer
• SHVC:
Scalable
extension:
Expected
July
2014
• EL
–
Enhancement
layer,
BL
–
Base
layer
107
SHVC
Performance
D.-K. Kwon, M. Budagavi, “Combined scalable and mutiview extension of High Efficiency
Video Coding (HEVC),” IEEE Picture Coding Symposium, pp. 414 – 417, 2013.
• 2x
scalability
(i.e.
base
layer
is
half
the
size
of
enhancement
layer)
compared
to
simulcast
• Quality
(SNR)
scalability
compared
to
simulcast
Coding
configuraBon
BD-‐Rate
savings
All
Intra
coding
23%
Random
access
(Hierarchical-‐B)
16%
Coding
configuraBon
BD-‐Rate
savings
All
Intra
coding
28%
Random
access
(Hierarchical-‐B)
20%
108
Mul/view
Video
Coding
Mul/view
Video
Capture
110
Stereo,
3D
video
360degree
video
Free
viewpoint
video
Image source: Fuji, Kolor
Stereoscopic
Video
Coding
Stereo
Video
encoding
Stereo
video
bitstream
Camera
modules
Lev
View
Right
View
Stereo
video
bitstream
Stereo
Video
decoding
Lev
View
Right
View
3D
display
Image source: Samsung
Redundancy
in
Stereo
Video
Lev
view
Right
view
112
Mul/view
Video
Coding
–
Picture
Predic/on
Structures
(1)
• Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7
Simulcast
113
Interview
predicBon
of
anchor
frames
Mul/view
Video
Coding
–
Picture
Predic/on
Structures
(1)
• Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7
114
Both
anchor
and
non-‐anchor
views
predicted
from
other
views
• Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7
Mul/view
Video
Coding
–
Picture
Predic/on
Structures
(1)
115
HEVC
Mul/view
Extension
(MV-‐HEVC)
116
View
0
decoder
View
0
Bitstream
View
0
Framebuffer
View
1
decoder
View
1
Bitstream
View
1
decoded
pictures
View
1
Framebuffer
View
0
decoded
pictures
3D
display
• MV-‐HEVC
:
MulBview
extension:
Expected
July
2014
• View
0:
Lev
view,
View
1:
Right
view
Combined
Scalable
and
Mu/view
Extension
of
HEVC
D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.
• ApplicaBons
of
the
combined
scalable
and
mulBview
HEVC
coding
include:
– Scalable
stereoscopic
video
(e.g.
1080p
stereo
to
the
emerging
4K
stereo),
– Mixed
resoluBon
mulBview
coding
• H.264/AVC
does
not
support
combined
scalable
and
mulBview
coding
• HEVC
allows
for
combined
scalable
and
mulBview
coding
117
Combined
Scalable
and
Mu/view
Extension
of
HEVC
D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.
118
Combined
Scalable
and
Mu/view
Extension
of
HEVC
D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.
119
MV-‐HEVC
+
Depth
(3D-‐HTM)
Lev
view
Depth
map
Synthesized
right
view
• StandardizaBon
in
on-‐going
120
MV-‐HEVC
+
Depth
Encoding
Depth
esBmaBon
Depth
coding
View
coding
N
views
+
M
depth
maps
• Views
that
are
transmiNed
will
be
coded
using
MV-‐
HEVC
• Expect
addiBonal
20%
gain
121
MV-‐HEVC
+
Depth
Decoding
View
synthesis
Depth
decoding
View
decoding
MulBple
views
122
Screen
Content
Video
Coding
Screen
Content
Coding
• ApplicaBons
such
as
automoBve
infotainment,
wireless
displays,
remote
desktop,
remote
gaming,
cloud
compuBng
etc.
are
becoming
popular
• Video
in
these
applicaBons
oven
has
mixed
content
consisBng
of
natural
video,
text,
graphics
etc.
– In
text
and
graphics
regions,
paNerns
(e.g.
text
characters,
icons,
lines
etc.)
can
repeat
within
a
picture
– Also
blocks
with
limited
set
of
colors
are
possible
124
Intra
Block
Copy
current CU
Search
area
LCU
(64×64)
current CU
Search
area
LCU
(64×64)
Intra Randomaccess Low delay
SC RGB 444 27.0% 21.5% 17.0%
SC YUV 444 23.5% 20.2% 15.9%
Bit-rate savings
M.
Budagavi,
D.-‐K.
Kwon,
“Intra
moBon
compensaBon
and
entropy
coding
improvements
for
HEVC
screen
content
coding”,
IEEE
Picture
Coding
Symposium,
2013.
125
Pale_e
Coding
• Input
video:
– 8
bits
per
pixel,
per
color
component
– 4×4
block:
8*3*16
=
384
bits
• PaleNe
coding:
– Color
paleNe:
2
Colors
in
our
example:
2*24
=
48
bits
–
Color
index:
1
bit
per
pixel
in
our
example:
16
bits
– Total
bits:
64
bits
• Note:
This
slide
shows
a
very
simple
example
for
explaining
purposes.
Techniques
being
evaluated
currently
cab
use
more
colors
in
paleNe
and
more
bits
for
color
index.
Color 0
Color 1
i12 i13 i14 i15
i8 i9 i10 i11
i4 i5 i6 i7
i0 i1 i2 i3
126
HEVC
Screen
Content
coding
• HEVC
Screen
content
coding
acBvity
– Started
in
April
2014
– Expected
compleBon
early-‐mid
2015
• Key
tools
being
studied
– Intra
Block
Copy
with
extended
search
area
– PaleNe
based
coding
127
Summary
• Video
content
conBnues
to
impose
a
severe
burden
on
today’s
global
networks
– Rapid
growth
in
the
usage
and
diversity
of
video
applicaBons
and
services
– Increasing
popularity
of
HD
video
and
emergence
of
beyond-‐HD
formats
accompanied
by
stereo
and
mulB-‐view
content
• HEVC
is
the
latest
video
coding
standard,
which
gives
50%
improvement
in
coding
efficiency,
and
is
expected
to
support
video
applicaBons
for
the
next
decade.
• In
addiBon
to
improving
coding
efficiency,
implementaBon
challenges
were
also
considered
to
maximize
processing
speed
and
minimize
hardware
cost.
128
• V.
Sze,
M.
Budagavi,
G.
J.
Sullivan
(Editors),
“High
Efficiency
Video
Coding
(HEVC):
Algorithms
and
Architectures,”
Springer,
2014
• G.
J.
Sullivan,
et
al.
“Overview
of
the
High
Efficiency
Video
Coding
(HEVC)
standard,”
IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
• J.
Ohm
et
al.,
“Comparison
of
the
Coding
Efficiency
of
Video
Coding
Standards—Including
High
Efficiency
Video
Coding
(HEVC),”IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012
References
129
• IntroducBon
• High-‐Level
Syntax
in
HEVC
• Block
Structures
and
Parallelism
Features
in
HEVC
• Intra-‐Picture
PredicBon
in
HEVC
• Inter-‐Picture
PredicBon
in
HEVC
• Transform
and
QuanBzaBon
in
HEVC
• In-‐Loop
Filters
in
HEVC
• Entropy
Coding
in
HEVC
• Compression
Performance
Analysis
in
HEVC
• Decoder
Hardware
Architecture
in
HEVC
• Encoder
Hardware
Architecture
in
HEVC
HEVC
Book
130
http://www.springer.com/engineering/signals/book/978-3-319-06894-7
The
book
serves
the
video
engineering
community
by:
• Providing
video
applicaBon
developers
an
invaluable
reference
to
the
latest
video
standard,
High
Efficiency
Video
Coding
(HEVC);
• Serving
as
a
companion
reference
that
is
complementary
to
the
HEVC
standards
document
produced
by
the
JCT-‐VC
–
a
joint
team
of
ITU-‐T
VCEG
and
ISO/IEC
MPEG;
• Including
in-‐depth
discussion
of
algorithms
and
architectures
for
HEVC
by
some
of
the
key
video
experts
who
have
been
directly
involved
in
developing
and
deploying
the
standard;
• Giving
insight
into
the
reasoning
behind
the
development
of
the
HEVC
feature
set,
which
will
aid
in
understanding
the
standard
and
how
to
use
it.
HEVC
Book
131