PARMA: Parallelization-Aware Run- time Management for Energy- Efficient Many-Core Systems
Newcastle PRiME team, IEEE TC, 69(10), Oct 2020.
Parallelization and runtime
Copyright By PowCoder代写 加微信 powcoder
• Multiple cores in the h/w
• s/w of different degrees of parallelizability
• How to obtain optimal runtime decisions with regard to task to core mapping?
Intuition and hypothesis
• If an application is not parallelizable, giving it multiple cores would be wasteful
• If an application is parallelizable, giving it a single core does not exploit the h/w fully
• Itisthereforereasonabletoexpectthatruntime decisions based on the parallelizability of apps may lead to energy/performance optimimality
Simulation
Full-domain On-l l l l l l l
e cores and m cores, where n 1, m 1. From (3) w
No No Yes Yes Yes Yes
Simulation Single Full-domain Off-
Experimental Single Full-domain On-
What does parallelizable mean?
Experimental
Experimental
Full-domain On-
Full and Per-core On-
• Amdahl’s Law Experimental
Experimental
– “In computer architecture, Amdahl’s law is a
formula which gives the theoretical speedup in
latency of the execution of a task at fixed workload
system where IPS (1) is single-core performance. The I that can be expected of a system whose resources
classical Amdahl’s speedup model is thus [2]:
are improved. ” I · t(1) 1
S (n, 1) = I · t(n) = (1 p) + np .
r second It is also possible to compare the speedup b cution as 4
Simulation
Full-domain On-l l l l l l l
e cores and m cores, where n 1, m 1. From (3) w
No No Yes Yes Yes Yes
Simulation Single Full-domain Off-
Experimental Single Full-domain On-
What does parallelizable mean?
Experimental
No On- Full-domain On- No Off- Full and Per-core On-
Experimental
• Amdahl’s Law Experimental
Experimental
– Larger p – more parallelizable – Smaller p – less parallelizable
system where IPS (1) is single-core performance. The I – The simplicity of Amdahl’s Law makes it suitable
classical Amdahl’s speedup model is thus [2]:
r second cution as
S(n,1)= I·t(1) = 1 . I · t ( n ) ( 1 p ) + np
It is also possible to compare the speedup b
for runtime use
But how can this be used for runtime?
• The runtime control needs the following inputs
– App parallelizability (the p factor) – Availability of cores
• It makes the following decision
– Map each app to the optimal number of cores
So each app has a p value?
IEEE TRANSACTIONS ON COMPUTERS, VOL. 00, NO. 0, MONTH YYYY
• Not so simple
– Apps may have
phases during their execution
– One p per app is not optimal
11 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2
2 REL This sect paralleliz
A fram in order
It can de the best favored useful as monitori informat and insta improve
An e [20] cont core map
together algorithm run-time
Per-co granular controlla
cores ma 7[22]. In ad DVFS op and ener
Power (W) Instruction Per Second (IPS) Parallelization Factor (p)
Instruction Per Second (IPS)
Parallelization Factor (p)
0 20 40 60 80 100
Execution Time (S) (a)
0 20 40 60 80 100
Execution Time (S) (b)
0 20 40 60 80
2.50E+09 2.00E+09 1.50E+09 1.00E+09 5.00E+08 0.00E+00
35 30 25 20 15 10
1.60E+10 1.40E+10 1.20E+10 1.00E+10 8.00E+09 6.00E+09 4.00E+09 2.00E+09 0.00E+00
Execution Time (S) (d)
0 20 40 60 80
Execution Time (S) (e)
35 30 25 20 15 10
00 0 20 40 60 80 100
Execution Time (S) (c)
Execution Time (S) (f)
Runtime p factor sensing
• If we could determine the instantaneous p
value for each app
– We’d be able to schedule it optimally on a per control cycle basis
– So the idea is to find the p value of an app for each control cycle and make runtime decisions based on that
• Usually you can find the p value of an app in offline static characterization by comparing relative speedup between runs on different numbers of cores
– Using Amdahl’s Law backwards – knowing S, find p.
– This is not practical at runtime on a per control cycle basis
Not the entire app
• An app can be run on two different core
configurations for very small amounts of time
and the relative speedup calculated from the
observations
IEEE TRANSACTIONS ON COMPUTERS, VOL. 00, NO. 0, MONTH YYYY
a linear relationship:
show different speedup values. However, since p is b
n is unknown but relates to the measurable t (m) as follows: e
directly obtained from performance counters. The a
mption behind PARMA two most recent PARMA control cycles: t (m) tim if stays approxim
IEEE TRANSACTIONS ON COMPUTERS, VOL. 00, NO. 0, MONTH YYYY
it is possible to calculate the performance on
to be constant in this scenario, we can use the
This is often used to solve an optimization pro
of any of the parts in order to calculate (8);
des both halted and chalisleunngkenhowernebisuthrealtatpesistoutshueamllyeasaunraubnleknto(wmn)pa
from which, in the case of n t the performance is However, this can be addressed if p can be
executio be found from (5). This formally derive
equation is 2, even if the wor ninthispbstitutefor(8)canbecalculatedfro
monofParallealnFercaescstairoynoverheadforcap entsasfollows:
S2 (n, m) = (I2/t2 (n)) / (I2/t2 (m)). The value
n I1+I2=I I2 IPS (n) ✓I
the highest performance. T
em (wal ompth=epar=chptite(cmtu)r=alper·ftor(ma)n,cecounte
l clock) time
res is defined
directly fr
n and are the model.I1
Find2 ing p can
happemnduringanon- time,whichisdemonstratedinthenextsection. The minimum number of
2 p2=11·p1 IIPS(n) I
t transition as workload dyna be done by measuring the speedup S
So far, all models in this s
a ratio of performances betIw·eten(mtw)o exIePcuStio(n)s on 2 2It is1possible to ex2tend the
S (n,m)= = .
numbers of cores and then solving (7) for an unkn
by using performance assu
S(wne,cman)r eca1lculate the mod Under the PARMA assumptions of the dyna
Fig. 2. Determining p by splitting the workload in two arbitrary parts I I · t (n) IPS (m)
. (3) 112 1
and I . frequency changes between
show different speedup values. However, since p is assumed
er m = 1, but in this
to be constant in this scenario, we can use the 1spSee(dn2u,pm) · (n 1) (m 1)
time domain of another co
n any two numbers of
cannot apply p = p = p to the entire workload; IPS n(1)/F =mIPS (1)/F
of any of the parts in order twoecaclcaunlatsetil(l8)a;plpetr’soxuimseate p to a constant withi However, in real applica
S (n,m) = (I /t (n))/(I /t (Sm()n).,Tmhe)viaslucealocfutla(tmed) according to (3), with I tco2resavailabl2et2othe 2 2slidingwindow.Th2escenarioshowninFigure2r
over time in the same way
a linear relationship:
show different speedup values. However, since p is b
n is unknown but relates to the measurable t (m) as follows: e
directly obtained from performance counters. The a
mption behind PARMA two most recent PARMA control cycles: t (m) tim if stays approxim
IEEE TRANSACTIONS ON COMPUTERS, VOL. 00, NO. 0, MONTH YYYY
it is possible to calculate the performance on
of any of the parts in order to calculate (8);
to be constant in this scenario, we can use the
This is often used to solve an optimization pro
des both halted and chalisleunngkenhowernebisuthrealtatpesistoutshueamllyeasaunraubnleknto(wmn)pa
from which, in the case of n t the performance is However, this can be addressed if p can be
S2 (n, m) = (I2/t2 (n)) / (I2/t2 (m)). The value
the highest performance. T
n I1+I2=I I2 IPS (n) ✓I
em (wal ompth=epar=chptite(cmtu)r=alper·ftor(ma)n,cecounte
2 p2=11·p1 IIPS(n) I
l clock) time
res is defined
directly fr
n and are the model.I1
Find2 ing p can
happemnduringanon-time,whichisdemonstratedxtsection. Thmnumberof
IPS can be
executio befoundfrom(5rmallyderive
eq even if the wor ninthispbstitutefor(8)lculatedfro
monofParallealverheadforcap entsasfollows:
e minimu ). This fo
uation is 2, cathnrobueghca
nFercaescstairoyno PMC’s
S(wne,cman)r eca1lculate the mod Under the PARMA assumptions of the dyna
t tr workload dyna be done by measuring the speedup S
So far, all models in this s
a ratio of performances betIw·eten(mtw)o exIePcuStio(n)s on 2 2It is1possible to ex2tend the
. (3) 112 1
and I . frequency changes between
show different speedup values. However, since p is assumed
er m = 1, but in this
to be constant in this scenario, we can use the 1spSee(dn2u,pm) · (n 1) (m 1)
ansition as
S (n,m)= = .
numbers of cores and then solving (7) for an unkn
by using performance assu
Fig. 2. Determining p by splitting the workload in two arbitrary parts I I · t (n) IPS (m)
time domain of another co
n any two numbers of
cannot apply p = p = p to the entire workload; IPS n(1)/F =mIPS (1)/F
of any of the parts in order twoecaclcaunlatsetil(l8)a;plpetr’soxuimseate p to a constant withi However, in real applica
S (n,m) = (I /t (n))/(I /t (Sm()n).,Tmhe)viaslucealocfutla(tmed) according to (3), with I tco2resavailabl2et2othe 2 2slidingwindow.Th2escenarioshowninFigure2r
over time in the same way
Optimization objective functions
• You can give performance (throughput) and power different degrees of importance
• Multiple factors in the optimization objective
– Weighted sum – seems intuitive, but in this case, difficult to argue for (performance and power are not in the same dimension so weights tend not to make a lot of sense
– Weighted product – this makes sense when the multiple factors are different types of physical quantities, but weights are not multiplied to the factors, they are powers
Optimization objective functions • Examples:
– PNP – power normalized performance in the form of
IPS/Watt (how much performance you can get from spending a watt of power) – this has the dimension of 1/energy (1/J)
– This is the inverse of energy per operation, so both describe the same optimization target, maximizing one minimizes the other
– EDP – energy-delay product, energy per operation multiplied to latency per operation, this puts more emphasis on speed than those above, to minimize this you maximize (IPS)2/Watt
– You can arbitrarily have ExDP with x being any real number (usually greater than 0), E2DP is essentially IPS/Watt
But how to relate p to objectives? ERS, VOL. 00, NO. 0, MONTH YYYY
ng potential. One is to adapt the
Run pthreads to
d classify multiple running and
Pin to Core0
characterize system
ding to their p values and make parameters with regard to
sed on the classification results.
the p factor
e possible to view concurrently
ingle combined workload with
pthreads allows a p ratio to nging p. Both methods require
Create N pthreads
be set by the user
Also allows the running of
Pin to Core0 …
Pin to Core(N–1)
Execute (1–P)·X cycles
el Characterization
an arbitrary number of
tform-independent description
cores and specific cores
perimentally obtaining model
Execute P·X/N cycles
Execute P·X/N cycles
atform. Build a set of lookup tables ents are needed for character-
Join pthreads
that relate p to optimal dels. These measurements are
control decisions
stage of model characterization
g the PARMA run-time control.
parallel sequential
• Once you have a way of sensing p and a map of optimal
ovide Fig. 3. Flowchart of the synthetic benchmark with programmable p, ethod deccoinsiiodenrsinfgoarteotacl whoprkvloalduoef,Xyocoumcpauntathioanvceyctlhese. following:
Many-Core Processor
Hardware Performance Counters
d on s for (15). arks range rallel ilable rallel uring
• Instructions retired • Unhalted cycles
RTM Algorithm
1. Calculate IPS and parallel fraction p
2. Use PNP or EDP lookup table to find optimal
Specify n and F
Fig. 4. The run-time management system.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com