程序代写代做代考 scheme flex ER algorithm cache Design

Design
and
Implementa/on
of
Next

Genera/on
Video
Coding
Systems

(H.265/HEVC
Tutorial)

Vivienne
Sze
(sze@mit.edu)

Madhukar
Budagavi
(m.budagavi@samsung.com)

ISCAS
Tutorial
2014

•  Vivienne
Sze
(Assistant
Professor
at
MIT)

–  Involved
with
video
implementaBon
research
and
standards
for
7+
years

•  Contributed
over
70
technical
documents
to
HEVC.

•  Within
JCT-‐VC
CommiNee,
Primary
Coordinator
of
the
core
experiments
on

coefficient
scanning
and
coding;
chairman
of
ad
hoc
groups
on
topics
related
to

entropy
coding
and
parallel
processing.

•  Published
over
25
journal
and
conference
papers.

•  Madhukar
Budagavi
(Research
Director
at
Samsung
Research

America)

–  Involved
with
video
standards
and
product
development
for
15+
years

•  Contributed
over
100
technical
documents
to
HEVC.

•  Within
JCT-‐VC
CommiNee,
Chaired
and
co-‐chaired
sub-‐group
acBviBes
on
spaBal

transforms,
quanBzaBon,
entropy
coding,
in-‐loop
filtering,
intra
predicBon,

screen
content
coding
and
scalable
HEVC
(SHVC).

•  Published
over
40
journal
and
conference
papers,
book
chapters.

Instructors

•  Part
I:
Overview
of
current
video
coding
technology
and

systems

•  Part
II:
High
Efficiency
Video
Coding
(HEVC)

•  Part
III:
Video
Codec
ImplementaBons

•  Part
IV:
Emerging
ApplicaBons
and
HEVC
Extensions

Outline
of
Tutorial

Part
I:
Overview
of
current
video

coding
technology
and
systems

Growing
Demand
for
Video

•  Video
exceeds
half
of
internet
traffic
and
will

grow
to
86
percent
by
2016.
Increase
in

applicaBons,
content,
fidelity,
etc.
à
Need

higher
coding
efficiency!

•  Ultra-‐HD
4K
broadcast
expected
for
Japan
in

2014.
London
Olympics
Opening
and
Closing

Ceremonies
shot
in
Ultra-‐HD
8K.
à
Need

higher
throughput!

•  25x
increase
in
mobile
data
traffic
over
next

five
years.
Video
is
a
“must
have”
on

portable
devices.
à
Need
lower
power!

Sources:
Cisco
Visual
Networking
Index

Cisco
Visual
Networking
Index:
Global
Mobile
Data
Traffic
Forecast
Update
5

Digital
Video

=
Y
Cb
Cr

W HW ×
22
HW

×
22
HW

0 1
2
3

6

4:2:0

Video
Compression

•  Uncompressed
1080p
high
definiBon
(HD)
video
at
24
frames/
second

– Pixels
per
frame:
1920×1080

– Bits
per
pixel:
8-‐bits
x
3
(RGB)

– 1.5
hours:
806
GB

– Bit-‐rate:
1.2
Gbits/s

•  Blu-‐Ray
DVD

– Capacity:
25
GB
(single
layer)

– Read
rate:
36
Mbits/s

•  Video
Streaming
or
TV
Broadcast

– 1
Mbits/s
to
20
Mbits/s

•  Require
30x
to
1200x
compression

•  Compression
is
achieved
by
removing
redundant

informaBon
from
the
video
sequence

•  Types
of
redundancies
in
video
sequences

–  SpaBal
redundancy

– Perceptual
redundancy

–  StaBsBcal
redundancy

– Temporal
redundancy

Video
Compression
Basics

0 1
2
3

•  Intra
predicBon

Spa/al
Redundancy
Removal
(1)

Frame

0

current
block

to
be
coded

horizontally

predicted
block

previous

block

Intra

predicBon

encode

difference

•  Block
Transforms

–  Typically
matrix
operaBons

–  Used
for
correlaBon

reducBon
and
energy

compacBon
in
the
block

Spa/al
Redundancy
Removal
(2)

151 149 145 140 136 133 128 120

150 147 144 140 136 132 127 118

149 145 142 138 135 129 122 116

147 143 139 136 131 126 120 113

141 139 137 132 127 124 116 109

138 135 133 130 125 120 113 106

135 131 130 128 123 117 111 105

132 130 129 126 120 115 109 105

1037 80 0 9 0 4 0 0

49 1 3 3 0 0 0 1

0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0

0 0 1 0 0 0 0 0

1 1 1 1 2 0 0 0

0 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0

8×8
2D
Discrete

Cosine
Transform

(DCT)

•  Not
all
video
data
are
equally
significant
from
a
perceptual

point
of
view

•  Make
use
of
the
properBes
of
the
Human
Visual
System
(HVS)

–  HVS
is
more
sensiBve
to
low
frequency
informaBon

Perceptual
Redundancy
Removal
(1)

Low

Frequency

High

Frequency

•  QuanBzaBon
is
a
good
tool
for
perceptual
redundancy

removal

–  Most
significant
bits
(MSBs)
are
perceptually
more
important
than
least

significant
bits
(LSBs)

–  Coefficient
dropping
(quanBzaBon
with
zero
bits)
example:

Perceptual
Redundancy
Removal
(2)

Original
frame
Image
obtained
by
retaining
36
DCT

coefficients
for
each
8×8
block

•  Not
all
pixel
values
in
an
image
(or
in
the
transformed
image)

occur
with
equal
probability

•  Use
entropy
coding
(e.g.
variable
length
coding)

–  Shorter
codewords
used
to
represent
more
frequent
values

–  Longer
codewords
used
to
represent
less
frequent
value

Sta/s/cal
Redundancy
Removal
(1)

•  Original
image:
8
bits/pixel,
Entropy
coding:
7.14
bits/pixel

•  Results
more
dramaBc
when
entropy
coding
is
applied
on

transformed
and
quanBzed
image:
1.82
bits/pixel

Sta/s/cal
Redundancy
Removal
(2)

Histogram

0 50 100 150 200 250
0

200

400

600

800

1000

1200

1400

1600

1800

-500 0 500 1000 1500 2000
0

0.5

1.5

2.5
x 104

Histogram

•  Inter
predicBon

•  Frame
difference
coding

–  Difference
can
be
encoded

using
DCT
+
QuanBzaBon
+

Entropy
Coding

Temporal
Redundancy
Removal
(1)

Frame
3

Frame
4
–
Frame
3

Frame
4

Temporal
Redundancy
Removal
(2)

•  Inter
predicBon
using
MoBon
compensated
predicBon

–  Divide
the
frame
into
blocks
and
apply
block
moBon
esBmaBon/
compensaBon

–  For
each
block
find
out
the
relaBve
moBon
between
the
current
block

and
a
matching
block
of
the
same
size
in
the
previous
frame

–  Transmit
the
moBon
vector(s)
for
each
block

Frame
t-‐1
Frame
t

•  Intra
Picture
(I)

– Picture
is
coded
without
reference
to
other
pictures

•  Inter
picture
(P,
B,
b)

– Uni-‐direcBonally
predicted
(P)
Picture

•  Picture
is
predicted
from
one
prior
coded
picture

– Bi-‐direcBonally
predicted
(B,
b)
Picture

•  Picture
is
coded
from
two
prior
coded
pictures

Temporal
Predic/on
and

Picture
Coding
Types

I
b B Pb

Summary
of
Key
Steps
in
Video
Coding

•  Intra
PredicBon
and
Inter
PredicBon

Transform

and

QuanBzaBon

many

pixels*

few

coefficients

•  Transform
and
QuanBzaBon
of
residual
(predicBon
error)

•  Entropy
coding
on
syntax
elements

e.g.
predicBon
modes,
moBon
vectors,
coefficients

previous
current

moBon

vector

predicBon

mode

Inter
PredicBon

(MoBon

CompensaBon)

Intra

PredicaBon

•  In-‐loop
filtering
to
reduce
coding
arBfacts

*
Residual
figure
from
J.
Apostolopoulos,

“Video
Compression,”
MIT
6.344
Lecture,
Spring
2004

Video
Compression
Standards

•  Ensures
inter-‐operability
between
encoder
and
decoder

•  Support
mulBple
use
cases
and
applicaBons

– Levels
and
Profiles

•  Video
coding
standard
specifies
decoder:
mapping
of
bits
to
pixels

•  ~2x
improvement
in
compression
every
decade

Pre-‐Processing
Encoding

Source

DesBnaBon

Post-‐Processing
Decoding

Scope
of
Standard

1994
2003
2013

MPEG-‐2

H.264/AVC

HEVC

bit-‐rate

19
19

• MPEG:

Moving
Picture
Experts
Group
(ISO/IEC)

• VCEG:
Video
Coding
Experts
Group
(ITU-‐T)

• Other
standards:
VC1,
VP8/VP9,
China
AVS,
RealVideo

History
of
Video
Coding
Standards

1984

VCEG

MPEG/

VCEG

MPEG

1986
1988
1990
1992
1994
1996
1998
2000
2002
2004

MPEG-‐1
MPEG-‐4

MPEG-‐2/

H.262

H.264/

MPEG-‐4
Part
10-‐AVC

H.261
H.263
H.263+
H.263++

20
20

Video
Coding
Progress

Source:
T.
Wiegand,
JVT-‐W132,
2007
21
21

H.264/MPEG-‐4
AVC

•  Completed
(version
1)
in
May
2003

•  H.264/AVC
is
the
most
popular
video
standard
in

market

–  80%
of
video
on
the
internet
is
encoded
with
H.264/AVC

•  ApplicaBons
include

–  HDTV
broadcast
satellite,
cable,
and
terrestrial

–  video
content
acquisiBon
and
ediBng

–  camcorders,
security
applicaBons,
Internet
and
mobile

network
video,
Blu-‐ray
Discs

–  real-‐Bme
video
chat,
video
conferencing,
and
telepresence

•  ~50%
higher
coding
efficiency
than
MPEG-‐2
(used

in
DVD,
US
terrestrial
broadcast)

•  PredicBon

–  Intra
predicBon
using
neighboring
samples

–  Temporal
predicBon
using
mulBple
frames

–  MoBon
compensaBon
on
variable
block
size,
quarter-‐pel

•  Transform

–  4×4/8×8
Integer
transform,
2×2/4×4
Secondary
Hadamard

•  QuanBzaBon

–  Finer
quanBzaBon
supported

•  Entropy
coding

–  Context
adapBve
variable
length
coding
(CAVLC)
and
arithmeBc
coding

(CABAC)

•  In-‐loop
deblocking
filter

Improvements
of
H.264/MPEG-‐4

AVC
over
previous
standards

Part
II:
High
Efficiency
Video
Coding

(HEVC)

•  Achieves
2x
higher
compression
compared
to
H.264/AVC

•  High
throughput
(Ultra-‐HD
8K
@
120fps)
&
low
power

–  ImplementaBon
friendly
features
(e.g.
built-‐in
parallelism)

•  Benefits
include

–  reduce
the
burden
on
global
networks

– easier
streaming
of
HD
video
to
mobile
devices

– account
for
advancing
screen
resoluBons
(e.g.
Ultra-‐HD)

High
Efficiency
Video
Coding
(HEVC)

“HEVC
will
provide
a
flexible,

reliable
and
robust
solu9on,

future-‐proofed
to
support
the

next
decade
of
video”

-‐
ITU-‐T
Press
Release
(2013)

Samsung

Galaxy
S4

Live
delivery
of

French
Open

Neulix

Ultra-‐HD
4K

Samsung
TV

Ultra-‐HD
4K

Ac/vity
in
JCT-‐VC
Commi_ee

•  Chairs

–  G.
J.
Sullivan
(Microsov)

–  J.
R.
Ohm
(Aachen
University)

•  Meet
Quarterly

–  1st
meeBng
(A)
[January
2010]

…..

–  12th
meeBng
(L)
[January
2013]

•  ~250
aNendees
per
meeBng

represenBng
~70
companies

•  Several
hundred
contribuBons
per

meeBng

•  Each
meeBng
is
around
9
-‐
10
days

(14+
hours/day)

•  MulBple
parallel
tracks

200

400

600

800

1000

1200

A B C D E F G H I J

Attendees Contributions

•  MeeBng
ContribuBons

–  hNp://phenix.int-‐evry.fr/jct/

•  SpecificaBon

–  hNp://www.itu.int/ITU-‐T/recommendaBons/rec.aspx?rec=11885

•  Reference
Sovware
(HM)

–  hNps://hevc.hhi.fraunhofer.de/svn/svn_HEVCSovware/

HEVC
Reference
Documents

•  References

–  G.
J.
Sullivan,
et
al.
“Overview
of
the
High
Efficiency

Video
Coding
(HEVC)
standard,”
IEEE
Transac9ons

on
Circuits
and
Systems
for
Video
Technology,
2012

–  V.
Sze,
M.
Budagavi,
G.
J.
Sullivan
(Editors),
“High

Efficiency
Video
Coding
(HEVC):
Algorithms
and

Architectures,”
Springer,
2014
hNp://www.springer.com/engineering/signals/book/
978-‐3-‐319-‐06894-‐7
27

Coding
Efficiency
of
HEVC
(Objec/ve)

J.
R.
Ohm
et
al.,
“Comparison
of
the

Coding
Efficiency
of
Video
Coding

Standards—Including
High
Efficiency

Video
Coding
(HEVC),”IEEE

Transac9ons
on
Circuits
and
Systems

for
Video
Technology,
2012

PSNR =10 log10
(2bitdepth −1)2 *W *H

{Oi −Di}
2

i
∑

Coding
Efficiency
of
HEVC
(Subjec/ve)

J.
Ohm
et
al.,
“Comparison
of
the
Coding
Efficiency
of
Video
Coding
Standards—Including
High
Efficiency

Video
Coding
(HEVC),”IEEE
Transac9ons
on
Circuits
and
Systems
for
Video
Technology,
2012

Sequences
Bit-‐rate
Savings

BQ
Terrace
63.1%

Basketball
Drive
66.6%

Kimono1
55.2%

Park
Scene
49.7%

Cactus
50.2%

BQ
Mall
41.6%

Basketball
Drill
44.9%

Party
Scene
29.8%

Race
Horse
42.7%

Average
49.3%

SubjecBve
Tests
for
Entertainment
ApplicaBons

(Random
Access)

H.265/HEVC
vs.
H.264/AVC
Decoder

Entropy

Decoder

Q-‐1
+T-‐1

Intra

PredicBon

MoBon

Comp.

+
Deblocking

Filter

Picture

Buffer

Encoded

bitstream

Decoded

pixels

In-‐loop
Filter

Sample

AdapBve

Offset

High
Throughput

CABAC
&

Advanced
MoBon

Vector
PredicBon
Larger
Transforms

and
More
Sizes

More

PredicBon

Modes

Larger

InterpolaBon

Filter
Fewer

Edges

Larger
and
Flexible
Coding

Block
Size

64×64

Key
Features
In
HEVC

High
Coding

Efficiency

High
Throughput
/

Low
Power

Larger
and
Flexible
Coding
Block
Size
X

More
SophisBcated
Intra
PredicBon
X

Larger
InterpolaBon
Filter
for
MoBon

CompensaBon

Larger
Transform
Size
X

Parallel
Deblocking
Filter
X

Sample
AdapBve
Offset
X

High
Throughput
CABAC
X
X

High
Level
Parallel
Tools
X

Parallel
Merge/Skip
X

M.
Zhou,
V.
Sze,
M.
Budagavi,
“Parallel
Tools
in
HEVC
for
High-‐Throughput
Processing,”
SPIE

Op9cal
Engineering
+
Applica9ons,
Applica9ons
of
Image
Processing
XXXV,
2012.

Larger
Coding
Blocks

•  Each
frame
is
broken
up
into
blocks

•  Large
block
sizes
reduce
signaling
overhead

•  In
H.264/AVC,
macroblock
is
always
16×16
pixels

–  Each
macroblock
is
either
inter
or
intra
coded

•  In
HEVC,
Coding
Tree
Unit
(CTU)
can
have
up
to
64×64
pixels

–  CTU
can
have
a
combinaBon
of
inter
and
intra
coded
blocks

N=16,
32,
or
64

Flexible
Coding
Block
Structure

•  BeNer
adaptaBon
to
different
video
content

•  CTU
divided
into
Coding
Units
(CU)
with
Quad
tree

•  Coding
units
divided
into
predicBon
units
(PU)

•  PU
have
different
moBon
data
or
predicBon
modes

Coding

Tree
Unit

(CTU)

PredicBon
Unit

(PU)

skip

Coding
Tree

composed
of
Coding

Units
(CU)

Asymmetric

MoBon

ParBBon
33

•  Intra-‐Coded
CU
can
only
be
divided
into
square
parBBon
units

–  For
a
CU,
make
decision
to
split
into
four
PU
(8×8
CUs
only)
or
single
PU

•  Inter-‐Coded
CU
can
be
divide
into
square
and
non-‐square
PU

as
long
as
one
side
is
at
least
4
pixels
wide
(note:
no
4×4
PU)

Predic/on
Units

Two
methods
of

parBBoning
for

intra-‐coded
CU

Eight
methods
of

parBBoning
for

inter-‐coded
CU

N N/2

N/2

N
N
N/2
N/2

N/2
N/2
N

3N/4

N/4

N
N

3N/4

N/4

N
N

3N/4
N/4
3N/4
N/4

Large
Transforms

•  HEVC
supports
4×4,
8×8,
16×16,
32×32
integer
transforms

–  Two
types
of
4×4
transforms
(IDST-‐based
for
Intra,
IDCT-‐based
for
Inter);

IDCT-‐based
transform
for
8×8,
16×16,
32×32
block
sizes

–  Integer
transform
avoids
encoder-‐decoder
mismatch
and
driv
caused
by

slightly
different
floaBng
point
representaBons.

–  Parallel
friendly
matrix
mulBplicaBon/parBal
buNerfly
implementaBon

–  Transform
size
signaled
using
Residual
Quad
Tree

•  Achieves
5
to
10%
increase
in
coding
efficiency

•  Increased
complexity
compared
to
H.264/AVC

–  8x
more
computaBons
per
coefficient

–  16x
larger
transpose
memory

Transform
and

QuanBzaBon

many

pixels

few

coefficients

Represent
residual
of

CU
with
TU
quad
tree

35
M.
Budagavi
et
al.,
“Core
Transform
Design
in
the
High
Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
JSTSP,
2013

Intra
Predic/on

•  H.264/AVC
has
10
modes

–  angular
(8
modes),
DC,
planar

•  HEVC
has
35
modes

–  angular
(33
modes),
DC,
planar

•  Angular
predicBon

–  Interpolate
from
reference
pixels

at
locaBons
based
on
angle

•  DC

–  Constant
value
which
is
an

average
of
neighboring
pixels

(reference
samples)

•  Planar

–  Average
of
horizontal
and

verBcal
predicBon

17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

0 : Intra_Planar
1 : Intra_DC
35: Intra_FromLuma

Horizontal

mode

VerBcal
mode

0:
Planar

1:
DC

2..34:
Angular

Intra
Predic/on
Modes

J.
Lainema,
W.-‐J.
Han,
“Intra
PredicBon
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and

Architectures,
Springer,
2014.

37

Removing
Intra
Ar/facts
(Pre-‐Processing)

w/o
pre-‐filter
w/
pre-‐filter

Image
source:
M.
Wien,

TCSVT,
July
2003

•  Reference
Sample
Smoothing

–  Smooth
out
neighboring
pixels
(i.e.,
reference

samples)
before
using
them
for
predicBon

–  Reduce
contouring
arBfacts
caused
by
edges
in

the
reference
sample
arrays

–  Two
modes

•  Three-‐tap
smoothing
filter

•  Strong
intra
smoothing
with
corner
reference

pixels

–  ApplicaBon
of
smoothing
depends
on
PU
size

and
predicBon
mode

J.
Lainema,
W.-‐J.
Han,
“Intra

PredicBon
in
HEVC,”
High
Efficiency

Video
Coding
(HEVC):
Algorithms

and
Architectures,
Springer,
2014.

•  Boundary
Smoothing

–  Intra
predicBon
may
introduce
disconBnuiBes
along
block
boundaries

–  Filter
first
predicBon
row
and
column
with
three-‐tap
filter
for
DC

predicBon,
and
two-‐tap
for
horizontal
and
verBcal
predicBon

Removing
Intra
Ar/facts
(Post-‐Processing)

Image
source:
JCTVC-‐F172,
July
2011
39

Inter
Predic/on

•  MoBon
vectors
can
have
up
to
¼
pixel
accuracy
(interpolaBon
required)

•  In
H.264/AVC,
luma
uses
6-‐tap
filter,
and
chroma
uses
bilinear
filter

•  In
HEVC,
luma
uses
8/7-‐tap
and
chroma
uses
4-‐tap

– Different
coefficients
for
¼
and
½
posiBons

•  Restricted
predicBon
on
small
PU
sizes

4×4
block
in
current

frame

Reference
block

in
previous
frame

Vector
(1,
-‐1)

Reference
block

in
previous
frame

Vector
(0.5,
-‐0.5)

Interpola/on
Filter

Require
integer

pixels
(highlighted
in

red)
to
interpolate

fracBonal
pixels

(highlighted
in
blue)

To
interpolate
NxN

pixels
requires
up
to

(N+7)x(N+7)

reference
pixels

Use
1-‐D
filters

(order
maNers
for

greater
than
8-‐bit

video)

Mode
Coding

•  Predict
modes
from
neighbors
to
reduce
syntax
element
bits

– Intra
PredicBon
Mode

– Advance
MoBon
Vector
PredicBon
(AMVP),
Merge/Skip
Mode

current

PU

Current PU
A1

B1
B0

Co-located PU

current

PU

co-‐located
PU

3
candidates

2
to
5
candidates

Merge
Mode

Moving
Object
Without
Merge

(many
extra
moBon
parameters)

With
Merge

B.
Bross
et
al.,
“Inter
PredicBon
in
HEVC,”
High
Efficiency
Video
Coding
(HEVC):
Algorithms
and

Architectures,
Springer,
2014.

43

AMVP
Merge
Skip

Syntax

elements

mvp_l0_flag,

mvp_l1_flag

merge_flag,

merge_idx

cu_skip_flag,

merge_idx

Use
of

neighbors

candidates

Predict
moBon

vector

Copy
moBon
data

(moBon
vector,

reference
index,

direcBon)

Copy
moBon
data

(moBon
vector,

reference
index,

direcBon);
no
residual

Number
of

Candidates

Up
to
2
Up
to
5
(signaled
in
slice
header)

SpaBal

Up
to
2
of
5

(scaling
if
reference

index
different)

Up
to
4
of
5
(no
scaling,
only
redundancy

check)

Temporal
Up
to
1
of
2
(if
< 2 spaBal candidates) Up to 1 of 2 (always added to list if available) AddiBonal Zero moBon vector (if < 2 spaBal or temp candidates) Bi-‐predicBve candidates and zero moBon vector AMVP, Merge, Skip Mode 44 In-‐loop Filtering: Deblocking Filter •  Removes blocking arBfacts due to block based processing –  ComputaBonally intensive in H.264/AVC •  In H.264/AVC, performed on every 4x4 block edge –  Each macroblock has 128 pixel edges, 32 edge calculaBons –  Each 4x4 depends on neighboring 4x4 •  In HEVC, performed on every 8x8 block edge –  Each 16x16 CTU has 64 pixel edges, 8 edge calculaBons –  All 8x8 are independent (can be processed in parallel) w/o deblocking w/ deblocking 45 16 16 In-‐loop Filtering: Sample Adap/ve Offset (SAO) •  Filter to address local disconBnuiBes –  Edge Offset and Band Offset •  Check neighbors in one of 4 direcBons (0, 90, 135, 45 degrees) •  Based on the values of the neighbors, apply one of 4 offsets pixel index x-1 x x+1 pi xe l l ev el category 1 pixel index x-1 x x+1 pi xe l l ev el category 2 pixel index x-1 x x+1 pi xe l l ev el pixel index x-1 x x+1 pi xe l l ev el category 3 pixel index x-1 x x+1 pi xe l l ev el pixel index x-1 x x+1 pi xe l l ev el category 4 c c c c 46 In-‐loop Filtering: Sample Adap/ve Offset (SAO) With SAO Without SAO C.-‐M. Fu et al., "Sample AdapBve Offset in the HEVC Standard,” IEEE Transac9ons on Circuits and Systems for Video Technology, 2012 47 Entropy Coding •  Lossless compression of syntax elements •  HEVC uses Context AdapBve Binary ArithmeBc Coding (CABAC) –  10 to 15% higher coding efficiency compared to CAVLC V. Sze, D. Marpe, “Entropy Coding in HEVC,” High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 48 CABAC Throughput Improvements •  Reduce total number of bins •  Reduce context coded bins •  Reduce context dependencies •  Grouping bypass bins •  Reduce parsing dependencies •  Reduce memory requirements bits De-‐Binarizer (DB) ArithmeBc Decoder (AD) Context Memory Context SelecBon (CS) syntax elements Context Modeling (CM) bins probability bypass V. Sze, M. Budagavi, “High Throughput CABAC Entropy Coding in HEVC,” IEEE TCSVT, 2012 Total bins Context bins Bypass bins H.264/AVC 20861 7805 13056 HEVC 14301 884 13417 RaBo 1.5x 9x 1x ReducBon in worst case bins for 16x16 pixels •  3x reducBon in context memory •  20x reducBon in line buffer for context selecBon 49 0 1 1 0 1 0 1 0 0 0 1 0 1 0 15 cycles 0 1 1 0 1 0 1 0 0 0 1 0 1 0 9 cycles 1 0 1 cycle 1 cycle High Level Parallel Tools (Mul/-‐Core) substream 0 substream 1 substream 2 substream 3 Ble 1 Ble 0 Ble 3 Ble 2 Wavefront Parallel Processing (Interleaved Entropy Slices*) Slices (also in H.264/AVC) Tiles slice 0 slice 1 slice 2 slice 3 *D. Finchelstein, V. Sze, A. P. Chandrakasan, “MulB-‐core Processing and Efficient On-‐chip Caching for H.264 and Future Video Decoders,” IEEE Trans. CSVT, 2009 50 Addi/onal Modes •  For wireless display and cloud compuBng, screen content coding should be considered •  Screen content typically has more edges •  Lossless –  Bypass transform, quanBzaBon and in-‐ loop filters •  Transform Skip –  Bypass transform, but conBnue to perform quanBzaBon and in-‐loop filters •  I_PCM –  Signal raw pixels source: www.techprollc.com 51 Profiles, Levels, Tiers •  Profile defines set of tools for different applicaBons –  Main, Main 10, Main SBll Picture –  8-‐bits/sample à 16.78 million colors –  10-‐bits/sample à 1.07 billion colors •  Level defines the maximum supported resoluBon and frame rate –  e.g. Level 4.0, 1920x1080 @ 32 fps –  Level 5.0, 4096x2160 @ 30 fps •  Bit-‐rates defined by level and Ber –  Main and High (professional) 52 … … … Main S/ll Picture (Intra Coding Only) •  HEVC also provides improved compression for sBll images BD-‐Rate Reduc/on H.264/AVC (intra only) 15.8% JPEG 2000 22.6% JPEG XR 30.0% Web P 31.0% JPEG 43.0% T. Nguyen, D. Marpe, “Performance Comparison of HM 6.0 with ExisBng SBll Image Compression Schemes Using a Test Set of Popular SBll Images” JCTVC-‐I0595, 2012 53 Part III: Video Codec Implementa/ons •  FuncBon –  Mapping of bitstream to pixels fixed by the standard •  ImplementaBon Requirements –  Conformance: Support all tools for a given profile in the standard –  Throughput: Real-‐Bme processing for video playback; level specifies pixel-‐rate and bit-‐rate Decoder Design Considera/ons 10101011 Decoder bitstream at specified bit-‐rate pixels at specified pixel-‐rate 55 •  FuncBon –  Mapping of pixels to standard compliant bitstream –  Flexibility of selecBng which set of encoding tools to use and how to use them (e.g. how to search for best compression mode) Encoder Design Considera/ons (1) 56 10101011 Encoder bitstream at specified bit-‐rate or compression ra/o pixels at specified pixel-‐rate for real-‐/me applica/ons •  ImplementaBon Requirements –  Conformance: Must generate a bitstream that is decodable by a standard compliant decoder (for a given profile) –  Throughput: For real-‐Bme applicaBons, need to meet pixel-‐rate requirements; can be done off-‐line for storage applicaBons –  Bit-‐rate/Compression Ra9o: For given applicaBon, must meet minimum compression requirements –  Compression ra9o vs. Complexity: Find compression mode that meets compression requirements under complexity constraint Encoder Design Considera/ons (2) Decoder design requires architecture innovaBons, while encoder design requires both algorithm and architecture innovaBons 57 Desktop CPU [1] Mobile CPU [1] GPU+CPU [2] DSP [3] FPGA [4] ASIC [5,6] Flexibility High High Med/High Med Med Low Development Cost Low Low Low/Med Med Med High Speed/ Throughput Low/Med Low Med Med Med High Power Consump/on High Med High Med Med Low Mul/media Plakorms Examples of HEVC implementa/ons [1] F. Bossen et al., "HEVC Complexity and ImplementaBon Analysis," IEEE TCSVT, 2012 [2] INanim Systems, “Compute accelerated HEVC decoder on ARM® MaliTM-‐T600 GPUs” [3] F. Pescador et al., "On an implementaBon of HEVC video decoders with DSP technology,” IEEE ICCE, 2013 [4] S. Cho, H. Kim, “ImplementaBon of a HEVC Hardware Decoder,” JCTVC-‐L0098, 2013 [5] C.-‐T. Huang et al. "A 249Mpixel/s HEVC video-‐decoder chip for Quad Full HD applicaBons,” IEEE ISSCC, 2013. [6] S.-‐F. Tsai et al. "A 1062Mpixels/s 8192× 4320p High Efficiency Video Coding (H.265) encoder chip,” IEEE VLSIC, 2013. 58 •  Throughput –  Achieve target pixel-‐rate and bit-‐rate for real-‐Bme applicaBons –  Reduce latency of bits to pixels and pixels to bits for interacBve applicaBons –  Techniques: parallelism, pipelining, eliminate stalls •  Energy and Power ConsumpBon –  Minimize energy consumpBon to extend baNery life for portable devices –  Minimize power consumpBon to reduce heat dissipaBon –  Techniques: voltage scaling, frequency scaling, power gaBng, number of ops •  Plauorm Cost –  Reduce amount of data to be stored in memory and amount of logic (e.g. gates in ASIC, number of cores for processors) to reduce size of chip –  Reduce bandwidth requirements such as reads/writes from memory to reduce demands on off-‐chip components –  Techniques: shared computaBons, on-‐the-‐fly processing, caching Implementa/on Requirements 59 •  ARMv7 1.3GHz (mobile processor) [Bossen, JCTVC-‐K0327, 2012] –  Dual core, but decoding on single thread (other thread for display) –  1080p @ 24 fps at 2Mbps (16 picture buffer to average workload) •  Intel i7 Core 2.6 GHz (desktop processor) [Bossen et al., TCSVT, 2012] –  Single core, single thread –  1080p @ 60 fps at 7Mbps •  MulB-‐thread Intel Core i7 2.7 GHz [Suzuki et al., JCTVC-‐L0098, 2013] –  4 cores / 4 threads (parallel GOPs) –  3840x2160 @ 76 fps at 12Mbps [cropped 8K content] •  MulB-‐thread Intel X5680 3.3 GHz [Chi et al., TCSVT, 2012] –  2x6 cores/12 threads (parallel Tiles, WPP) –  3840x2160 @ 24 fps at ~12Mbps (QP=37) –  3840x2160 @ 14 fps at ~170Mbps (QP=22) Solware HEVC Decoder 60 Solware HEVC Decoder Workload for different modules F. Bossen et al., "HEVC Complexity and ImplementaBon Analysis," IEEE Transac9ons on Circuits and Systems for Video Technology, 2012 61 Line Buffer for Entropy Decoder Coeff In-loop Filters MC Cache Rec DMA Ref Pixels Line Buffer for Prediction and In-loop Filters Line Buffers Residue Inverse Transform Prediction MV Info Group II Memory Interface Arbiter Top Control ColMV ColMV DMA Group I Entropy Decoder MV Dispatch VPB/Top Info Pixel flow Info flow SRAM Processing Engine DMA flow Legend Hardware HEVC Decoder Architecture M. Tikekar et al., “Decoder Hardware Architecture for HEVC,” High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 62 •  Variable-‐size pipelining to support a diverse set of CTU, CU, and PU sizes (select size to balance memory cost vs. data reuse) Pipelining HEVC Decoder CTU 64x64 64x32 64x16 64x64 32x32 16x16 Variable-‐size Pipeline Block (VPB) Source: C.-‐T. Huang et al., “A 249Mpixels/s HEVC Video Decoder Chip for Quad Full HD ApplicaBons,” IEEE ISSCC, 2013. PPB 0 PPB 1 PPB 2 PPB 3 PPB 0 PPB 1 PPB 0 VPB 64x64 64x32 64x16 PPB (Stage 1) Sub-PPB (Stage 2) 0 1 2 3 4 5 Y U/V 0 1 2 3 4 5 System level pipeline (between Inv. Transform, PredicBon and In-‐Loop Filters) Predic/on level pipeline (within PredicBon module) 16x16 Pipeline 63 •  Workload of entropy decoding based on bit-‐rate (bin-‐rate), while rest of decoder depends on pixel-‐rate •  Use FIFO to absorb variaBons in workload –  Higher FIFO depth results in less stalls due to averaging, but longer latency and higher memory cost Decoupling Entropy Coding Entropy Decoder MC Dispatch 0 1 2 3 0 1 2 0 1 2 3 0 1 2 0 1 0 Inverse Transform Prediction Deblock REC DMA G ro u p I G ro u p I I Coefficients in TU FIFO 64 Source: C.-‐T. Huang et al., “A 249Mpixels/s HEVC Video Decoder Chip for Quad Full HD ApplicaBons,” IEEE ISSCC, 2013. Intra Predic/on •  Reference sample processing –  Reference pixel buffer to store neighboring pixels (padding when not available) –  Apply smoothing filter on pixels depending on mode •  Feedback loop at TU granularity –  Update reference pixel buffer accordingly Intra Prediction Inverse Transform + Intra reference pixels Inter Prediction M. Tikekar et al., “Decoder Hardware Architecture for HEVC," High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 65 TU granularity feedback •  Read samples from reference picture (typically stored in off-‐chip picture buffer) –  Use cache to reduce off-‐chip memory bandwidth •  InterpolaBon pixels used a 2-‐D separable filter for fracBonal moBon vectors –  MulBple pixels can be interpolated in parallel (share input pixels) •  Smaller blocks have larger read overhead (for fracBonal mv) –  NxN requires (N+7)x(N+7) pixel reads à 4x4 inter-‐PU not supported in HEVC Inter Predic/on Dispatch MC Cache Fetch 2-D Filter To Reference Picture Buffer (on-chip SRAM/external DRAM) Motion Vectors from Entropy Decoder Inter Predicted Pixels 66 •  Minimize redundant reads from off-‐chip memory (DRAM) •  MC Cache design consideraBons –  Sufficient throughput to support worst case PU –  Detect redundant reads and handle latency of DRAM •  Store pixels in DRAM to minimize row changes (cycle overhead) –  Avoid reading two rows from same bank for a given reference region MC Cache and Picture Buffer 20% reducBon in overhead cycles 0 1 2 4 5 6 7 0 1 4 5 0 1 2 3 0 1 2 3 1 2 3 4 5 6 7 4 5 6 7 0 2 3 5 7 0 1 2 3 4 5 6 7 1 0 2 3 6 7 4 6 # = bank in DRAM 67 3 M. Tikekar et al., “Decoder Hardware Architecture for HEVC," High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 67 •  Larger transform à More computaBon –  Share coefficients across transform sizes and within transform to reduce area cost Inverse Transform M. Tikekar et al., “Decoder Hardware Architecture for HEVC,” High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 2x22x2Partial 4x4 Partial 8x8Partial16x16 Even-Odd Index Sort 4x4 add-sub add-sub add-sub 2 IDCT8 IDCT16 add-sub 4 IDCT4 IDST4 IDCT32 2 222 4 4 4 8 32 1616 4 y0 y1 y2 y3 i 18 50 75 89 -50 -89 -18 75 75 18 -89 50 -89 75 -50 18 ui LUT MAC 30% reducBon in area cost 68 •  Larger transform à Larger transpose memory – Use SRAM rather than registers to reduce area cost –  SRAM has limited read/write ports (requires careful mapping) Inverse Transform M. Tikekar et al., “Decoder Hardware Architecture for HEVC,” High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 8 8 8 8 9 9 9 0 16 0 0 1 9 0 0 0 0 0 0 2 10 24 32 016 24 32 016 24 32 016 24 32 017 25 33 017 25 33 017 25 33 017 25 33 018 26 34 3 2 p ix e ls 32 pixels 0 0 0 Bank 0 Bank 1 Bank 2 Bank 3 0 0 7 15 023 31 39 120 120 120 120 121 121 121 121 122 Transform Transpose Memory Dequantize Residue Coeffs row/column select 4 4 4 4 4 pixels/cycle throughput per 1-‐D transform 4x4 blocks 69 Video Coding Standard HEVC (HM4) Technology TSMC 40-‐nm Core Area 1.33 x 1.33 mm Gate Count 715k On-‐Chip Memory (SRAM) 124 kB Resolu/on / Frame Rate 4kx2k @ 30fps (3840x2160) Frequency 200 MHz Core Voltage 0.9 V Power 76 mW Hardware HEVC Decoder D is pa tc h /M C Ca ch e En tr op y D ec od er Predic/on Inverse Transform Deblock SRAM 2.18 mm 2. 18 m m C.-‐T. Huang et al., “A 249Mpixels/s HEVC Video Decoder Chip for Quad Full HD ApplicaBons,” IEEE ISSCC, 2013 70 Area Breakdown MC cache 126 Deblock 49.9 Entropy Decoder 94.5 Inverse Transform 121.1 Memory Interface Arbiter 13.7 Prediction 191.9 RegFiles 75.5 Others 42 Pipeline Buffers 447.3 MC-related SRAM 200.4 Line Buffers 337 Others 32.8 Logic Memory (SRAM) M. Tikekar et al., “Decoder Hardware Architecture for HEVC,” High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 71 [kgates] [kbits] Power Breakdown Prediction 23% Deblocking 3% MC Cache 26% Inverse Transform 17% Memory Interface Arbiter 2% Entropy Decoder 3% Line Buffers 2% Pipeline Buffers 10% Others 13% M. Tikekar et al., “Decoder Hardware Architecture for HEVC," High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. 72 Hardware vs. Solware Prediction 23% Deblocking 3% MC Cache 26% Inverse Transform 17% Memory Interface Arbiter 2% Entropy Decoder 3% Line Buffers 2% Pipeline Buffers 10% Others 13% Hardware (power) Solware (cycles) 73 This Work ISSCC'12 [2] ISSCC'10 [3] ISSCC'06 [4] Standard HEVC ("H.265") WD4 H.264/AVC HP/MVC H.264/AVC HP/SVC/MVC H.264/AVC MP Max Specification 3840x2160 @30fps 7680x4320 @60fps 4096x2160 @24fps 1920x1080 @30fps Gate Count 715K 1338K 414K 160K On-Chip SRAM 124KB 80KB 9KB 5KB Technology 40nm/0.9V 65nm/1.2V 90nm/1.0V 0.18µm/1.8V Normalized Core Power* 0.31nJ/pixel 0.21nJ/pixel 0.28nJ/pixel 5.11nJ/pixel Normalized DRAM Power* 0.88nJ/pixel 1.27nJ/pixel N/A N/A Normalized System Power*** 1.19nJ/pixel 1.48nJ/pixel N/A N/A DRAM Configuration 32b DDR3 64b DDR2 N/A 32b DDR + 32b SDR ** ASIC Decoder Comparison Power for max specification Modeled by [5] System Power = Core Power + DRAM Power * ** *** Slide Source: C.-‐T. Huang et al., “A 249Mpixels/s HEVC Video Decoder Chip for Quad Full HD ApplicaBons,” IEEE ISSCC, 2013. 74 0.0 0.5 1.0 1.5 2.0 2.5 2006 2008 2010 2012 2014 En er gy p er p ix el (n J) Year H.264/AVC H.265/HEVC D is pa tc h /M C Ca ch e En tr op y D ec od er Predic/on Inverse Transform Deblock H.265/HEVC [WD4] Decoder (76mW) C.T. Huang et al. (MIT), ISSCC 2013 H.264/AVC Decoder (51mW) P.K. Tsung et al. (NTU), ISSCC 2011 TSMC 40nm, 0.9V Ultra-‐HD 4K @ 30 fps 3. 3 m m 3.3 mm MEMORY CONTROLLER DOMAIN CORE DOMAIN SRAM 176 I/O PADS 0.7-‐V 720p-‐HD @ 30 fps H.264/AVC Decoder (2mW) Sze et al. (MIT), JSSC 2009 Decoder Power Comparison 75 Low Power Approaches •  Operate at voltage near minimum energy point •  UBlize parallelism and pipelining to achieve performance •  AdapBve/Dynamic voltage frequency scaling •  OpBmize access paNerns to reduce memory power Reduce Cycles à Reduce Freq. à Reduce Voltage à Reduce Power Delay Energy per operaBon Supply Voltage T 2T 76 V. Sze et al., “A 0.7-‐V 1.8-‐mW H.264/AVC 720p Video Decoder,” IEEE Journal of Solid State Circuits, 2009. •  Encoder must search for mode that gives the “best” compression. Some of the key decisions include –  CU and PU size –  Inter or Intra CU –  MoBon Vector –  Intra PredicBon Mode •  “Best” compression is defined using a rate-‐distorBon cost •  where –  D is the distorBon between the original and the compressed image (a measure of the visual quality of the compression) –  R is a measure of the number of bits required to signal the compressed image –  λ is the Lagrangian mulBplier that weights the distorBon and rate costs Encoder Decisions D+λ ⋅R 77 Perform rate-‐distor/on op/miza/on (RDO) •  Full RDO –  DistorBon based on sum of squared differences (SSD), includes quanBzaBon –  Rate based on entropy coded bits of predicBon info and quanBzed coefficients •  Fast RDO –  DistorBon approximaBon based on sum of absolute differences (SAD) or sum of absolute transformed differences (SATD) –  Rate approximaBon based on predicBon info bits (intra mode or moBon vector); Can include number of non-‐zero coefficients to predict coefficient bits Full vs. Fast RDO Intra Prediction Motion Estimation Full RDO Pass Q CABAC Rate T Final Mode Decision T/Q: Transform/Quantization IT/IQ: Inverse Transform / Quantization Fast RDO (30+ modes) ITIQ SSD S. -‐F. Tsai et al., “Encoder Hardware Architecture for HEVC," High Efficiency Video Coding (HEVC): Algorithms and Architectures, Springer, 2014. RDO Flow in HM 78 •  The encoder must decide to how best divide a CTU into CU, and how to divide the CUs into PUs (based on full RDO in HM) •  For CTU of 64x64 –  CU opBons: 64x64, 32x32, 16x16, 8x8 •  For Inter-‐coded CU –  PU opBons •  For Intra-‐coded CU –  PU opBons CU and PU decisions 79 N N N N/2 N/2 N/2 N/2 N 3N/4 N/4 N N 3N/4 N/4 N N 3N/4 N/4 3N/4 N/4 N/2 N/2 N N •  Search for block in reference frame(s) to predict current block with least rate-‐distorBon cost –  Signal block in previous frame using a moBon vector •  Typically most computaBonally intensive funcBon in encoder Mo/on Es/ma/on Search algorithm considera/ons 1.  Number of candidates – Number of computaBons – Number of memory accesses 2.  Off-‐chip bandwidth 3.  On-‐chip bandwidth 80 •  Integer pixel moBon esBmaBon – Rate is the bits required to transmit the moBon data (including impact of moBon predictor) – DistorBon is calculated from the SAD of original and moBon-‐ compensated predicBon (subsampled when block size >
8)

where

– MV
=
moBon
vector
(include

impact
of
advanced
mv
predictor)

– REF
=
reference
index

Mo/on
Es/ma/on
in
HM

argmin
MV , REF

Diff (i, j)
i, j
∑ +λ ⋅R(MV, REF)

K.
McCann
et
al
“High
Efficiency
Video
Coding
(HEVC)
Test
Model
14
(HM
14)
Encoder

DescripBon,”
JCTVC-‐P1002,
2014

Current PU
A1

B1
B0

Co-located PU

•  Integer
pixel
moBon
esBmaBon

–  Search
Strategy

1.  Search
center
is
moBon
vector
predictor

2.  Diamond
search
around
center
(search
range

=
64
à
7
steps
[1,
2,
4..
64]);
early

terminaBon
if
best
candidate
doesn’t
change

in
3
steps.

3.  If
best
candidate
>
5
pixels
away
from
search

center,
do
raster
scan
search
(5
pixel
steps).

4.  Perform
diamond
search
around
best

candidate
from
step
2
or
3.

If
new
best

candidate
found
repeat
4.

Mo/on
Es/ma/on
in
HM

Reference

•  K.
McCann
et
al
“High
Efficiency

Video
Coding
(HEVC)
Test
Model

14
(HM
14)
Encoder
DescripBon,”

JCTVC-‐P1002,
2014

•  M.
Sinangil,
PhD
Thesis,
MIT,
2012

Image
Source:
N.
Purnachand

et
al.,
IEEE
ICCE-‐Berlin,
2012

•  Half
pixel
moBon
esBmaBon

–  Rate
is
the
bits
required
to
transmit
the
moBon
data
(including
impact

of
moBon
predictor)

–  DistorBon
is
calculated
from
SATD

•  Block-‐wise
4×4
or
8×8
Hadamard
transform
on
difference
between
original

and
moBon-‐compensated
predicBon,
and
sum
absolute
coefficients

–  Search
8
points
surrounding
best
integer
moBon
vector

•  Quarter
pixel
moBon
esBmaBon

–  Same
rate
and
distorBon
calculaBon
as
half
pixel

–  Search
8
points
surrounding
best
half
pixel
moBon
vector

•  Also
do
search
for
merge/skip
candidates

Mo/on
Es/ma/on
in
HM

K.
McCann
et
al
“High
Efficiency
Video
Coding
(HEVC)
Test
Model
14
(HM
14)
Encoder

DescripBon,”
JCTVC-‐P1002,
2014
83

Mul/ple
Searches
in
Parallel

M.
E.
Sinangil
et
al.,
“Cost
and
Coding
Efficient
MoBon
EsBmaBon
Design
ConsideraBons
for
High

Efficiency
Video
Coding
(HEVC)
Standard,”
IEEE
Journal
of
Selected
Topics
in
Signal
Processing,
2013.

Compared
to
HM

•  2x
fewer
candidates

•  1%-‐3%
coding
loss

•  Perform
moBon
esBmaBon
for
each
PU
in
inter-‐coded
CU

•  Process
CUs
in
parallel
to
increase
throughput

–  Share
search
pixels
across
engines
to
reduce
memory
bandwidth
by
8x

Parallel
Mo/on
Es/ma/on

Reduce
Number
of
PUs
Processed

0
5

10
15
20
25
30
35
40

0 1 2 3 4 5 6 7 8

C
od

in
g

Lo
ss

(
B

D
-r

at
e)

Area Savings (Mgates)

Number
of
Par//on
Units

1

2

4
11
5

8
3

6
7

Smallest
slope
provides

best
trade-‐off:
#3

Trade-‐off
between
coding
efficiency
(BD-‐rate)
and
complexity
(area
cost)
for

different
number
of
inter
predicted
parBBons
units

Only
Square

PUs

9

10

•  In
HM,
moBon
esBmaBon
done
serially
for
PU
within
CU
to
get

AMVP
for
accurate
rate
esBmate

Mo/on
Es/ma/on
with
CU

PU2

PU1

Can’t
process
PU1
and
PU2
in
parallel

Current
PU
A
1

B
1
B0
B2

Co
-‐
located
PU

Parallel
Mo/on
Es/ma/on

•  HEVC
has
“Parallel
MoBon
EsBmaBon”
feature
to
turn
off

dependency
within
an
MoBon
EsBmaBon
Region
(MER)

– PU
within
region
cannot
use
data
from
other
PU
in
region

– All
PUs
in
region
can
be
processed
in
parallel
at
encoder

PU2

PU1

MER

Can
process
PU1
and
PU2
in
parallel

MER0
MER1

MER2
MER3

X
X

X

X

MulBple
MERs
per
CTU

M.
Zhou,
“Parallelized
merge/skip
mode
for
HEVC,”
JCTVC-‐F069,
2011
89

•  In
HM,
CTU
processed
in
raster
scan
order

•  Change
CTU
Processing
Order
to
reduce
reads
from
picture
buffer

(off-‐chip
memory
bandwidth)
due
to
increased
data
locality

•  Requires
frame
decoupling
with
entropy
encoder
(as
entropy

encoder
must
generate
bitstream
in
raster
scan
order
to
be

standard
compliant)

CTU
Processing
Order

n=4

m=2

S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):

Algorithms
and
Architectures,
Springer,
2014.

Raster
Scan
Alterna/ve
Scan

Addi/onal
Complexity
Reduc/ons

•  BoNoms
up
approach

–  Derive
distorBon
cost
for
PU
from

sub-‐PUs
(e.g.
compute
distorBon
of

16×16
PU
from
four
8×8
PU)

–  Requires
storage
of
SAD
sub-‐PUs

•  Reduce
bit-‐width
for
distorBon

calculaBon

•  Use
bilinear
interpolaBon
for

fracBonal
moBon
esBmaBon

SAD16(X)
=

SAD8(A)
+
SAD8(B)
+

SAD8(C)
+
SAD8(D)

A B

C D

16

8

•  Rough
mode
decision:
select
N
best
mode
out
of
35

–  N
equals
8
for
4×4,
8×8

–  N
equals
4
for
16×16,
32×32,
64×64

–  Hadamard
Cost
Ranking
(SATD
distorBon
and
mode
bits
for
rate)

•  Determine
three
Most
Probable
Modes
(MPM)

–  SpaBal
neighbors
to
the
lev
(A)
and
above
(B)

–  If
neighbors
not
available
or
redundant
(A=B),
use
DC,
Planar,
verBcal
or

adjacent
angles
(+/-‐
1)

•  Decide
between
rough
mode
+
MPM
candidates

–  Full
RDO
(SSD
for
distorBon
and
mode
+
coefficient
bits
for
rate)

Intra
Predic/on
Search
in
HM

current

PU

Y.
Piao
et
al.,
“Encoder
Improvement
of
Unified

Intra
PredicBon,”
JCTVC-‐C207,
Oct.
2010.

•  To
reduce
search
space,
use
coarse
search
with
angular

predicBon,
then
refinement
around
coarse
angles

•  Skip
64×64
PU
size

–  Since
max
TU
is
32×32,
predicBon
done
at
32×32;
thus
only
benefit
of

64×64
intra-‐PU
is
signaling

•  To
increase
throughput,
use
original
pixels
for
intra
predicBon

(rather
than
reconstructed
pixels)
to
avoid
dependence
on

reconstrucBon
feedback
loop

Addi/onal
Complexity
Reduc/on

Above
techniques
have
cumulaBve
coding
loss
of
1%

S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):

Algorithms
and
Architectures,
Springer,
2014.

93

Hardware-‐Friendly
RDO
Pipeline

S.
-‐F.
Tsai
et
al.,
“Encoder
Hardware
Architecture
for
HEVC,”
High
Efficiency
Video
Coding
(HEVC):

Algorithms
and
Architectures,
Springer,
2014.

Only
do
full
RDO
on
best
Inter
and
Intra
mode
for
each
CU-‐depth

(6%
coding
loss)

Fi
na

l M
od

e
D

ec
is

io
n

CU0

CU1 CU1 CU1 CU1

CU2 CU2CU2 CU2CU2 CU2 CU2 CU2

32X32 CU

64X64 CU

16X16 CU

HCMD
Cost

Intra Pred Dirs.

Inter PU Sizes & MVs

PU-Mode Pre-decision CU-Layer High Complexity Mode Decision

Full

RDO

Full
RDO
Fast
RDO

Hardware
HEVC
Encoder

S.-‐F.
Tsai
et
al.
,
“A
1062Mpixels/s

8192×4320p
High
Efficiency
Video

Coding
(H.265)
encoder
chip,”
IEEE

VLSIC,
2013

Video
Coding

Standard

HEVC
(WD4)

Technology
TSMC
28-‐nm

HPM

Core
Area
5x5mm2

Gate
Count
8350k

On-‐Chip
Memory

(SRAM)

7.14
MB

Resolu/on
/

Frame
Rate

8192×4320@

30fps

Frequency
312
MHz

Power
708
mW

ASIC
Encoder
Comparison

S.-‐F.
Tsai
et
al.
,
“A
1062Mpixels/s
8192×4320p
High
Efficiency
Video
Coding
(H.265)

encoder
chip,”
2013
Symposium
on
VLSIC,
2013
96

Part
IV:
Emerging
applica/ons
and

HEVC
extensions

What’s
Next

•  More
compression
efficiency

–  Yes,
in
5-‐10
years.
Especially
since
video
delivery
is
moving
from
tradiBonal

broadcast
model
to
IP
delivery
and
one-‐to-‐one
streaming

–  Analogy:
Public
transport
versus
individual
cars

•  Other
consideraBons
have
become
important
too:

–  Power
consumpBon,
complexity,
throughput

–  Ability
to
support
new
funcBonaliBes,
modaliBes
etc.

Dallas

High
Five

•  Need
for
supporBng
diverse
clients
with
varying
capabiliBes

(resoluBon,
computaBonal
power
etc.)

Changing
Landscape
of
Video
Coding

Applica/ons
(1)

99
Image source: Samsung, Youtube

•  Immersive
experience

–  MulBple
cameras
and
at
higher
video

resoluBons
(1080p
è
4K
è
8K)

–  MulBple
displays,
Bigger
displays

(1080p
è
4K
è
8K)

–  Free-‐viewpoint
video,
360degree

video,
augmented
reality,
3D
movies

–  Demos

•  hNp://replay-‐technologies.com/

•  hNp://www.kolor.com/video

100

Changing
Landscape
of
Video
Coding

Applica/ons
(2)

Image source: Cisco, Kolor

•  Growing
requirement
to
support
mixed
format
content

consisBng
of
natural
video
+
graphics/text

101

Changing
Landscape
of
Video
Coding

Applica/ons
(3)

Scalable
Video
Coding

Suppor/ng
Diverse
Clients
-‐

Simulcas/ng

103

Encode

640×480

1280×960

2560×19200

Encode

Client
Server

Bitstream
1

Bitstream
3

Bitstream
2

Can we do better?

Scalable
Video
Coding

Quality
(SNR)
scalability

Temporal
scalability

SpaBal

scalability

Single Bitstream

… 0110111 …

104

Spa/al
Scalability

Figure source: T. Wiegand, JVT-W132 [1].

Layer
N
–
E.g.
640×480

(Base
layer)

Layer
N+1
–
1280×960

(Enhancement
layer)

•  Layered
coding

• Higher
layers
have
higher

spaBal
resoluBon
when

compared
to
lower
layers

• Upper
layers
re-‐uses
data
from

lower
layers

105

Temporal
Scalability

I P P P P P P P P

P I B B P I B B B B P

I p P p P p P p P

IPPP
coding

IBBP
coding

Hierarchical
B-‐frames

I b B b P b B b P

Hierarchical
P-‐frames

•  p,
b
–
Non-‐reference
frames

106

HEVC
Scalable
Extension
(SHVC)

Base
layer

decoder

BL

Bitstream

BL
decoded

pictures

BL
Frame

buffer

Enhancement

layer
decoder

EL

Bitstream

EL
decoded

pictures

Upsampler

EL
Frame

buffer

•  SHVC:
Scalable
extension:
Expected
July
2014

•  EL
–
Enhancement
layer,
BL
–
Base
layer

107

SHVC
Performance

D.-K. Kwon, M. Budagavi, “Combined scalable and mutiview extension of High Efficiency
Video Coding (HEVC),” IEEE Picture Coding Symposium, pp. 414 – 417, 2013.

•  2x
scalability
(i.e.
base
layer
is
half
the
size
of

enhancement
layer)
compared
to
simulcast

• Quality
(SNR)
scalability
compared
to
simulcast

Coding
configuraBon
BD-‐Rate
savings

All
Intra
coding
23%

Random
access

(Hierarchical-‐B)

16%

Coding
configuraBon
BD-‐Rate
savings

All
Intra
coding
28%

Random
access

(Hierarchical-‐B)

20%

108

Mul/view
Video
Coding

Mul/view
Video
Capture

110

Stereo,
3D

video

360degree

video

Free

viewpoint

video

Image source: Fuji, Kolor

Stereoscopic
Video
Coding

Stereo

Video

encoding

Stereo

video

bitstream

Camera

modules

Lev

View

Right

View

Stereo

video

bitstream

Stereo

Video

decoding

Lev

View

Right

View

3D
display

Image source: Samsung

Redundancy
in
Stereo
Video

Lev
view

Right
view

112

Mul/view
Video
Coding
–

Picture
Predic/on
Structures
(1)

•  Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7

Simulcast

113

Interview

predicBon
of

anchor
frames

Mul/view
Video
Coding
–

Picture
Predic/on
Structures
(1)

•  Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7

114

Both
anchor
and
non-‐anchor
views

predicted
from
other
views

•  Linear
camera
array
S0 S1 S2 S3 S4 S5 S6 S7

Mul/view
Video
Coding
–

Picture
Predic/on
Structures
(1)

115

HEVC
Mul/view
Extension
(MV-‐HEVC)

116

View
0

decoder

View
0

Bitstream

View
0

Framebuffer

View
1

decoder

View
1

Bitstream

View
1
decoded

pictures

View
1

Framebuffer

View
0
decoded

pictures
3D
display

• MV-‐HEVC
:
MulBview
extension:
Expected
July
2014

• View
0:
Lev
view,
View
1:
Right
view

Combined
Scalable
and
Mu/view

Extension
of
HEVC

D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.

•  ApplicaBons
of
the
combined
scalable
and
mulBview
HEVC

coding
include:

–  Scalable
stereoscopic
video
(e.g.
1080p
stereo
to
the
emerging
4K

stereo),

–  Mixed
resoluBon
mulBview
coding

•  H.264/AVC
does
not
support
combined
scalable
and
mulBview

coding

•  HEVC
allows
for
combined
scalable
and
mulBview
coding

117

Combined
Scalable
and
Mu/view

Extension
of
HEVC

D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.

118

Combined
Scalable
and
Mu/view

Extension
of
HEVC

D.-K. Kwon, M. Budagavi, “Combined Scalable and Mutiview Extension of High Efficiency
Video Coding (HEVC)”, IEEE Picture Coding Symposium, 2013.

119

MV-‐HEVC
+
Depth
(3D-‐HTM)

Lev
view

Depth
map

Synthesized
right
view

•  StandardizaBon
in
on-‐going

120

MV-‐HEVC
+
Depth
Encoding

Depth

esBmaBon

Depth

coding

View

coding

N
views
+

M
depth
maps

• Views
that
are
transmiNed
will
be
coded
using
MV-‐
HEVC

•  Expect
addiBonal
20%
gain

121

MV-‐HEVC
+
Depth
Decoding

View

synthesis

Depth

decoding

View

decoding

MulBple

views

122

Screen
Content
Video
Coding

Screen
Content
Coding

•  ApplicaBons
such
as
automoBve
infotainment,
wireless
displays,

remote
desktop,
remote
gaming,
cloud
compuBng
etc.
are

becoming
popular

•  Video
in
these
applicaBons
oven
has
mixed
content
consisBng
of

natural
video,
text,
graphics
etc.

–  In
text
and
graphics
regions,
paNerns
(e.g.
text
characters,
icons,
lines
etc.)

can
repeat
within
a
picture

–  Also
blocks
with
limited
set
of
colors
are
possible

124

Intra
Block
Copy

current CU

Search
area

LCU
(64×64)

current CU

Search
area

LCU
(64×64)

Intra Randomaccess Low delay
SC RGB 444 27.0% 21.5% 17.0%
SC YUV 444 23.5% 20.2% 15.9%

Bit-rate savings

M.
Budagavi,
D.-‐K.
Kwon,
“Intra
moBon
compensaBon
and
entropy
coding
improvements
for

HEVC
screen
content
coding”,
IEEE
Picture
Coding
Symposium,
2013.
125

Pale_e
Coding

•  Input
video:

– 8
bits
per
pixel,
per
color
component

– 4×4
block:
8*3*16
=
384
bits

•  PaleNe
coding:

– Color
paleNe:
2
Colors
in
our
example:

2*24
=
48
bits

– 
Color
index:
1
bit
per
pixel
in
our

example:
16
bits

– Total
bits:
64
bits

•  Note:
This
slide
shows
a
very
simple
example
for

explaining
purposes.
Techniques
being
evaluated

currently
cab
use
more
colors
in
paleNe
and
more
bits

for
color
index.

Color 0
Color 1

i12 i13 i14 i15
i8 i9 i10 i11
i4 i5 i6 i7
i0 i1 i2 i3

126

HEVC
Screen
Content
coding

• HEVC
Screen
content
coding
acBvity

– Started
in
April
2014

– Expected
compleBon
early-‐mid
2015

• Key
tools
being
studied

– Intra
Block
Copy
with
extended
search
area

– PaleNe
based
coding

127

Summary

•  Video
content
conBnues
to
impose
a
severe
burden
on
today’s

global
networks

–  Rapid
growth
in
the
usage
and
diversity
of
video
applicaBons
and

services

–  Increasing
popularity
of
HD
video
and
emergence
of
beyond-‐HD
formats

accompanied
by
stereo
and
mulB-‐view
content

•  HEVC
is
the
latest
video
coding
standard,
which
gives
50%

improvement
in
coding
efficiency,
and
is
expected
to
support

video
applicaBons
for
the
next
decade.

•  In
addiBon
to
improving
coding
efficiency,
implementaBon

challenges
were
also
considered
to
maximize
processing
speed

and
minimize
hardware
cost.

128

•  V.
Sze,
M.
Budagavi,
G.
J.
Sullivan
(Editors),
“High
Efficiency

Video
Coding
(HEVC):
Algorithms
and
Architectures,”
Springer,

2014

•  G.
J.
Sullivan,
et
al.
“Overview
of
the
High
Efficiency
Video

Coding
(HEVC)
standard,”
IEEE
Transac9ons
on
Circuits
and

Systems
for
Video
Technology,
2012

•  J.
Ohm
et
al.,
“Comparison
of
the
Coding
Efficiency
of
Video

Coding
Standards—Including
High
Efficiency
Video
Coding

(HEVC),”IEEE
Transac9ons
on
Circuits
and
Systems
for
Video

Technology,
2012

References

129

•  IntroducBon

•  High-‐Level
Syntax
in
HEVC

•  Block
Structures
and
Parallelism
Features
in
HEVC

•  Intra-‐Picture
PredicBon
in
HEVC

•  Inter-‐Picture
PredicBon
in
HEVC

•  Transform
and
QuanBzaBon
in
HEVC

•  In-‐Loop
Filters
in
HEVC

•  Entropy
Coding
in
HEVC

•  Compression
Performance
Analysis
in
HEVC

•  Decoder
Hardware
Architecture
in
HEVC

•  Encoder
Hardware
Architecture
in
HEVC

HEVC
Book

130
http://www.springer.com/engineering/signals/book/978-3-319-06894-7

The
book
serves
the
video
engineering
community
by:

•  Providing
video
applicaBon
developers
an

invaluable
reference
to
the
latest
video
standard,

High
Efficiency
Video
Coding
(HEVC);

•  Serving
as
a
companion
reference
that
is

complementary
to
the
HEVC
standards
document

produced
by
the
JCT-‐VC
–
a
joint
team
of
ITU-‐T

VCEG
and
ISO/IEC
MPEG;

•  Including
in-‐depth
discussion
of
algorithms
and

architectures
for
HEVC
by
some
of
the
key
video

experts
who
have
been
directly
involved
in

developing
and
deploying
the
standard;

•  Giving
insight
into
the
reasoning
behind
the

development
of
the
HEVC
feature
set,
which
will
aid

in
understanding
the
standard
and
how
to
use
it.

HEVC
Book

131

Related Posts