18-646 – How to Write Fast Code II?
1
Carnegie Mellon University Ian Lane
18646 – Week 2
18-646 – How to Write Fast Code II? 2
Homework 1
18-646 – How to Write Fast Code II? 3
Lecture Questions
https://canvas.cmu.edu/courses/21510/quizzes/55580
18-646 – How to Write Fast Code II? 4
Outline – Follow on from 2/4
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-646 – How to Write Fast Code II? 5
(5) Memory Hierarchy
Cache:
A piece of fast memory close to the compute modules in a microprocessor Allows faster access to a limited amount of data
Physical properties of wires and transistors determines trade-off between cache capacity and access throughput/latency
Capacity: Latency: Throughput:
Size, e.g. number of bytes of data
From start to finish, in units of time, e.g. CPU clock cycles Tasks accomplished per unit time, e.g. GB/s
Latency: 200 cycles
Throughput: 34 GB/s
imeem
Capacity: 8GB
CPU
Memory
LI
18-646 – How to Write Fast Code II?
6
(5) Memory Hierarchy
Cache:
A piece of fast memory close to the compute modules in a microprocessor
Allows faster access to a limited amount of data
Physical properties of wires and transistors determines trade-off between
cache capacity and access throughput/latency Network
(Capacity), (throughput), (latency) Exa-Bytes, 0.1 GB/s, ~1-10B cycles
3TB, 6 GB/s, ~1M cycles 16GB, 34.08 GB/s, ~200 cycles 8 MB, 108.8 GB/s, 26-31 cycles
256 KB, 108.8 GB/s, 12 cycles
32 KB Data Cache, 163.2 GB/s, 4-6 cycles
0
18-646 – How to Write Fast Code II?
7
registers
Intel Core2 – 2600k @ 3.4GHz
(viewed from one core)
Working with a Memory Hierarchy
Writing fast codeàget data from the fastest (closest) level
When requesting data from a level in memory hierarchy of limited size,
there are two outcomes
Hit: Data is available in the level
Miss: Data is missing from the level
For fast codeàminimize misses, maximize hits
What application behavior can a memory hierarchy help with? locality
18-646 – How to Write Fast Code II?
8
Working with a Memory Hierarchy
Principle of Locality
The phenomenon of the same value or related storage locations being
frequently accessed, usually amenable to performance optimization
Temporal Locality
Reuse of specific data and/or resources within
relatively small time durations
Spatial Locality
Use of data elements within relatively close storage locations
18-646 – How to Write Fast Code II? 9
When would you get a miss?
Three types of misses in a memory hierarchy
Compulsory misses: caused by the first reference
Capacity misses: due to the finite size of the memory hierarchy
Conflict misses: due to policy of replacement, potentially avoidable
18-646 – How to Write Fast Code II? 10
Exploiting Benefits of Mem Hierarchy
Motivation:
When application exhibit data locality, cache/memory provides faster access to
frequently used data
Advantages:
Allows faster access to data for computation
Caches are managed storage: transparent to the end-user for functional purposes
Lower the energy consumption when getting a “cache-hits” Disadvantages:
When using multiple threads share a cache, they will compete for cache space
18-646 – How to Write Fast Code II? 11
O
Outline – Follow on from 2/4
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-646 – How to Write Fast Code II? 12
(6) System Architecture
A Journey through the opportunities to write fast code with:
Multicore Manycore The Cloud
Multicore
Cloud
18-646 – How to Write Fast Code II?
13
Manycore
Section 1 – Multicore
Writing fast code to exploit multicore parallelism
SIMD level parallelism Core level parallelism
Core
SIMD level Parallelism
Core Chip Core
O
Memory hierarchy
Core Core
Mom
DRAM
Shared Memory Bus
Core level Parallelism
Core Core
18-646 – How to Write Fast Code II?
14
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache
Section 2 – Manycore
GPU System Architecture
c
Moving data
Device System (GPU)
PCI-E 2.0
Host System (CPU)
Device System (GPU)
PCI-E 2.0
Bandwidth: ~5.2GB/s
own men
hierarchy
18-646 – How to Write Fast Code II?
15
Section 2 – Manycore
GPU System Architecture
Device System (GPU)
PCI-E 2.0
Host System (CPU)
DRAM
Device System (GPU)
PCI-E 2.0
Bandwidth: ~5.2GB/s
DRAM
18-646 – How to Write Fast Code II?
mens 16
DRAM
Section 3 – The Cloud
Network
CPU
CPU
18-646 – How to Write Fast Code II?
17
CPU
CPU
Section 3 – The Cloud
Network
GPU
GPU
Host
Host
GPU
18-646 – How to Write Fast Code II?
18
GPU
GPU
GPU
GPU
Host
Host
GPU
A Game of Granularity
10-9
Manycore Throughput
Optimized Implementations
On-die cache Latency
-8 10-7
3 sainthood Memory Latency
System Latency Network Latency
© Jike Chong 2009
10
Multicore Task Queue-based Implementations
Pthread-based Implementations
MPI-based Implementations
Remote Procedure Call based Implementations
O10-6 10-5
10-4
10-3
10-2
10-1 100
Shared Off-chip
18-646 – How to Write Fast Code II?
19
109
104
Task Management Overhead (Instructions)
108 107 106 105
103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)
Outline
Landscape of Computing Platforms Hardware Architectures
Multicore vs Manycore
Instruction level parallelism
SIMD
Simultaneous multithreading Memory hierarchy
System hierarchy
How to Write Fast Code?
18-646 – How to Write Fast Code II? 20
How to Write Fast Code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
We focus on the fast hardware in this lecture
Combines with good techniques to produce fast code…
…in order to solve a problem or improve an outcome
18-646 – How to Write Fast Code II? 21
How to Write Fast Code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Recognizing levels of concurrency in an application
Effective mapping of concurrency in an application with
parallelism of a platform
à Fast code
18-646 – How to Write Fast Code II?
22
Concurrency Opportunity Recognition
The Application Developer Application-level Concurrency The Problem-Solving Process
18-646 – How to Write Fast Code II? 23
Distinction: Concurrency vs. Parallelism
Concurrency
Parallelism
The property of an application that… …allows for tasks to have the potential to be…
…executed simultaneously
The application architecture in which… …more than one task is active and able to…
…make progress at one time
We expose concurrency in our applications.
The property of a platform that…
…allows for tasks to have the potential to be…
…executed simultaneously
The platform architecture in which.. …more than one task can be active and…
…make progress at same time
We exploit parallelism in our platforms.
18-646 – How to Write Fast Code II?
24
The Application Developer
Writing fast code is a process coherent with “general problem solving behavior”
– Newell and Simon, Human Problem Solving (1972), pp. 72-73
The process of problem solving involves:
1. Understand the current state
2. Observe the internal representation
3. Search among alternatives
4. Select from a set of choices
18-646 – How to Write Fast Code II? 25
k-means Problem
Find k cluster centers that minimize the distance from each data
Important algorithm in machine learning:
Statistical data analysis
Vector quantization (Speech Recognition)
NP-hard for arbitrary input
k-means algorithm frequently finds a reasonable
solutions quickly
Issues:
Worst case running time is super-polynomial Approximation can be arbitrarily bad
point to a cluster center
18-646 – How to Write Fast Code II? 26
k-means Problem
Find k cluster centers that minimize the distance from each data point to a cluster center
.
. .
.
Cluster
Cluster center (centroid)
k: Number of clusters (defined a-priori) Cluster: Assignment of data points to a class Cluster Center: μ of data points in a cluster
.
18-646 – How to Write Fast Code II?
27
Related Problems and Algorithms
k-means++: Maximize scattering on initial cluster centers
KD-trees: Fast k-means – Pre-compute distance between data points x-means: k-means with efficient estimation of the number of classes
Gaussian Mixture Models:
Probabilistic assignments to clusters
Multivariate Gaussian distributions instead of means
Expectation Maximization algorithms (EM algorithms) 0
Find maximum likelihood estimates of parameters in a statistical model, where the model depends on unobserved latent variables.
Expectation Maximization Algorithms for Conditional Likelihoods
Estimate parameters in a statistical model to optimize conditional likelihood
(where the objective function is a rational function)
18-646 – How to Write Fast Code II? 28
k-means Algorithm (“Lloyd’s algorithm”) Given an initial set of k means
Expectation Step: Assign each observation to the cluster with the closest mean
Maximization Step: Calculate the new means to be the centroid of the observations in the cluster.
Iterate until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 29
The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 30
The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 31
The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
2. Assign each data point to closest Center
18-646 – How to Write Fast Code II? 32
The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
4. Re-iterate steps 2-3 until convergence or stopping criteria met
3. Update Centers based on assignments from (2)
18-646 – How to Write Fast Code II? 33
The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 34
The Phases
1. 2.
3. 4.
Initialization: Randomly select k cluster centers
Select k samples from data as initial centers [Forgy Partition]
Expectation: Assign each data point go closest center Compare each data point (N) to each cluster center (k) Distance Metric: Euclidean distance (D dimensions)
Maximization: Update centers based on assignments
For each cluster (k) compute mean (D dimensions) from data
points assigned to that cluster
Evaluate: Re-iterate steps 2-3 until convergence or stopping criteria met
Percentage of data points re-assigned
Number of iterations (2-3)
18-646 – How to Write Fast Code II? 35
A Fast Implementation of k-means Following the process of problem solving with k-means:
1.
Understand the current state
Running on a platform
Using a specific set of resources
Achieving a specific performance
Meeting a specific criteria/requirement
Observe the internal representation Search among alternatives
Select from a set of choices
Assumption:
Starting from a functionally correct reference implementation
Implication:
Must observe the current state and implementation requirements before starting to solve a problem
2.
3.
4.
18-646 – How to Write Fast Code II? 36
Understanding the Current State
Running on a platform
Platform: Linux + GCC on x86 multicore processor
Using a specific set of resources (ghcXX)
Computation:
Data:
Synchronization:
? cores, 2 to ? way SIMD
??KB L1, ??KB L2, shared L3 cache, ??GB DRAM on-chip shared-memory abstraction
Achieving a specific performance As measured in Mini-Project 1
Meeting a specific criteria/requirement Matrix-Multiply
k-means
18-646 – How to Write Fast Code II? 37
What to Measure for Performance?
Performance Analysis: Roofline Model
A few simple techniques:
Observe the phases of execution
Characterize the execution time break downs Reason about why a piece of code is slow
Identify performance bottlenecks 0
Sequential Implementation
2.623 0.474 0.073
Parallel Implementation
0.148 0.103 0.043 0.008
RTF: Real Time Factor (Proc Time / Real Time)
Phase 1 Phase 3
Phase 2 Seq. Overhead
18-646 – How to Write Fast Code II?
38
Current state: k-means algorithm 4 Phases (Initialization, Expectation, Maximization, Evaluate)
Majority of time spent on Expectation and Maximization phases OO
Entire data set can fit in memory on a single machine
Number of samples (N) and feature dimensions (D) vary significantly Evaluation for any number of clusters ( 2 ≥ k ≤ 300 )
Example Data Sets: ionosphere_scale: svmguide:
cod-rna:
Ijcnn1:
18-646 – How to Write Fast Code II?
351 Samples, 34 Dimensions 7089 Samples, 4 Dimensions 59535 Samples, 8 Dimensions 191681 Samples, 22 Dimensions
39
A Fast Implementation of k-means Following the process of problem solving with k-means:
1. 2.
Understand the current state
Observe the internal representation Application structure
Identified four phases of execution Implementation concerns
Task considerations
Data representations
Concurrency opportunities
Search among alternatives Select from a set of choices
3. 4.
18-646 – How to Write Fast Code II? 40
Example Code (Initialize)
kmeans/seq_kmeans.c lines: 116-119
….
/* pick first numClusters elements of objects[] as initial cluster centers*/
clusters[i][j] = objects[i][j];
K
for (i=0; i
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
kmeans/seq_kmeans.c lines: 148-162
}
}
…
newClusters[i][j] = 0.0; newClusterSize[i] = 0;
/* set back to 0 */ /* set back to 0 */
18-646 – How to Write Fast Code II?
45
prepare for next iteration
Example Code (Maximization)
….
/* update new cluster centers : sum of objects located within */
newClusterSize[index]++; for (j=0; j
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
newClusters[i][j] = 0.0; /* set back to 0 */
x 1
}
…
N (Histogram computation into k bins
}
newClusterSize[i] = 0;
Possible Concurrencies:
18-646 – How to Write Fast Code II?
46
/* set back to 0 */
D (independent)
The phases – concurrency
1. 2.
Initialization: Randomly select k cluster centers
Expectation: Assign closest center to each data point N (independent)
k (min reduction)
D (sum reduction)
Maximization: Update centers based on assignments D (independent)
N (Histogram computation into k bins)
3.
Evaluate: Re-iterate steps 2-3 until convergence
18-646 – How to Write Fast Code II? 47
4.
A Fast Implementation of k-means
Following the process of problem solving with k-means:
1. Understand the current state
2. Observe the internal representation
3. Search among alternatives
4. Select from a set of choices
18-646 – How to Write Fast Code II? 48
Search Among Alternatives
Given the observed internal representations and the concurrency opportunities…
What are the implementation alternative0s?
Different mapping of application concurrency to platform parallelism
The search process
More complex than one application concurrency to one platform parallelism May want to sequentialize some operations:
Some parallel operations are as “work-efficient” as sequential operations
Reduction – sequential: O(N), Parallel: O(N logN)
One level of concurrency could map to multiple levels of parallelism
18-646 – How to Write Fast Code II? 49
Multicore and Manycore Parallelism
Similar in scaling trends:
Increasing vector unit width
Increasing numbers of cores per die
Increasing bandwidth to off-chip memory
Different in optimization points
DRAM
Shared Memory Bus
Core
SIMD level Parallelism
Core Chip Core Core Core
Core level Parallelism
Core Core
18-646 – How to Write Fast Code II?
50
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache
Memory Hierarchy
Cache:
A piece of fast memory close to the compute modules in a microprocessor
Allows faster access to a limited amount of data
Physical properties of wires and transistors determines trade-off between
cache capacity and access throughput/latency Network
(Capacity), (throughput), (latency) Exa-Bytes, 0.1 GB/s, ~1-10B cycles
3TB, 6 GB/s, ~1M cycles 16GB, 34.08 GB/s, ~200 cycles 8 MB, 108.8 GB/s, 26-31 cycles
256 KB, 108.8 GB/s, 12 cycles
32 KB Data Cache, 163.2 GB/s, 4-6 cycles
Intel Core2 – 2600k @ 3.4GHz
(viewed from one core)
18-646 – How to Write Fast Code II?
51
Mapping Concurrency to Parallelism
How does it map to the platform? SIMD level parallelism
Core level parallelism
How does it map to the cache hierarch?
What data is required for each concurrent operation? What are the synchronization points in the algorithm?
Expectation & Maximization Phases
SIMD & core-level parallelism across data-points (N)
Update membership for each data point sequentially
Compute distance to each cluster center and select index with min. distance
Histogram computation (summation / assignment count for new clusters) Other possible concurrency mappings?
18-646 – How to Write Fast Code II? 52
A Fast Implementation of k-means Following the process of problem solving with k-means:
1. 2. 3. 4.
Understand the current state Observe the internal representation
Search among alternatives Select from a set of choices
Does solution met required criteria
How to evaluate a mapping?
Efficiency: Runs quickly, makes good use of computational resources
Simplicity: Easy to understand code is easier to develop, debug, verify and modify
Portability: Should run on widest range of parallel computers
Scalability: Should be effective on a wide range of processing elements
Other considerations: Practicality, Hardware, Engineering cost
18-646 – How to Write Fast Code II? 53
Evaluate Choice
Expectation & Maximization Phases
SIMD & core-level parallelism across data-points (N)
Update membership for each data point sequentially
Histogram computation (summation / assignment count for new clusters) à OpenMP
How we can evaluate the choice and make a decision Efficiency
Simplicity / Maintainability Portability
Scalability
18-646 – How to Write Fast Code II? 54
How to write fast code
Expose concurrencies in applications and algorithms 2/9 – “Concurrency Opportunity Recognition”
Mini-Projects (1-3) & Term Project
Exploit parallelisms on application platform
2/4 – “Advanced Parallel Hardware Architectures” Mini-Projects (1-3) & Term Project
Explore mapping between concurrency and parallelism
The rest of the semester….
Abstractions to support mapping of concurrencies to parallelisms
OpenMP
CUDA
Map-Reduce
[Section 1] [Section 2] [Section 3]
18-646 – How to Write Fast Code II?
55
How to Write Fast Code?
Fast Platforms
Multicore platforms Manycore platforms Cloud platforms
Good Techniques
Data structures
Algorithms
Software Architecture
Recognizing levels of concurrency in an application
Effective mapping of concurrency in an application with
parallelism of a platform
à Fast code
18-646 – How to Write Fast Code II?
56