CS计算机代考程序代写 concurrency GPU algorithm cache x86 cuda data structure 18-646 – How to Write Fast Code II?

18-646 – How to Write Fast Code II?
1
Carnegie Mellon University Ian Lane

18646 – Week 2
18-646 – How to Write Fast Code II? 2

Homework 1
18-646 – How to Write Fast Code II? 3

Lecture Questions
https://canvas.cmu.edu/courses/21510/quizzes/55580
18-646 – How to Write Fast Code II? 4

Outline – Follow on from 2/4
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-646 – How to Write Fast Code II? 5

(5) Memory Hierarchy
— Cache:
— A piece of fast memory close to the compute modules in a microprocessor — Allows faster access to a limited amount of data
— Physical properties of wires and transistors determines trade-off between cache capacity and access throughput/latency
Capacity: Latency: Throughput:
Size, e.g. number of bytes of data
From start to finish, in units of time, e.g. CPU clock cycles Tasks accomplished per unit time, e.g. GB/s
Latency: 200 cycles
Throughput: 34 GB/s
imeem
Capacity: 8GB
CPU
Memory
LI
18-646 – How to Write Fast Code II?
6

(5) Memory Hierarchy
— Cache:
— A piece of fast memory close to the compute modules in a microprocessor
— Allows faster access to a limited amount of data
— Physical properties of wires and transistors determines trade-off between
cache capacity and access throughput/latency Network
(Capacity), (throughput), (latency) Exa-Bytes, 0.1 GB/s, ~1-10B cycles
3TB, 6 GB/s, ~1M cycles 16GB, 34.08 GB/s, ~200 cycles 8 MB, 108.8 GB/s, 26-31 cycles
256 KB, 108.8 GB/s, 12 cycles
32 KB Data Cache, 163.2 GB/s, 4-6 cycles
0
18-646 – How to Write Fast Code II?
7
registers
Intel Core2 – 2600k @ 3.4GHz
(viewed from one core)

Working with a Memory Hierarchy
— Writing fast codeàget data from the fastest (closest) level
— When requesting data from a level in memory hierarchy of limited size,
there are two outcomes
— Hit: Data is available in the level
— Miss: Data is missing from the level
— For fast codeàminimize misses, maximize hits
— What application behavior can a memory hierarchy help with? locality
18-646 – How to Write Fast Code II?
8

Working with a Memory Hierarchy
— Principle of Locality
— The phenomenon of the same value or related storage locations being
frequently accessed, usually amenable to performance optimization
— Temporal Locality
— Reuse of specific data and/or resources within
relatively small time durations
— Spatial Locality
— Use of data elements within relatively close storage locations
18-646 – How to Write Fast Code II? 9

When would you get a miss?
— Three types of misses in a memory hierarchy
— Compulsory misses: caused by the first reference
— Capacity misses: due to the finite size of the memory hierarchy
— Conflict misses: due to policy of replacement, potentially avoidable
18-646 – How to Write Fast Code II? 10

Exploiting Benefits of Mem Hierarchy
— Motivation:
— When application exhibit data locality, cache/memory provides faster access to
frequently used data
— Advantages:
— Allows faster access to data for computation
— Caches are managed storage: transparent to the end-user for functional purposes
— Lower the energy consumption when getting a “cache-hits” — Disadvantages:
— When using multiple threads share a cache, they will compete for cache space
18-646 – How to Write Fast Code II? 11
O

Outline – Follow on from 2/4
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-646 – How to Write Fast Code II? 12

(6) System Architecture
— A Journey through the opportunities to write fast code with:
— Multicore — Manycore — The Cloud
Multicore
Cloud
18-646 – How to Write Fast Code II?
13
Manycore

Section 1 – Multicore
— Writing fast code to exploit multicore parallelism
— SIMD level parallelism — Core level parallelism
Core
SIMD level Parallelism
Core Chip Core
O
— Memory hierarchy
Core Core
Mom
DRAM
Shared Memory Bus
Core level Parallelism
Core Core
18-646 – How to Write Fast Code II?
14
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache

Section 2 – Manycore
— GPU System Architecture
c
Moving data
Device System (GPU)
PCI-E 2.0
Host System (CPU)
Device System (GPU)
PCI-E 2.0
Bandwidth: ~5.2GB/s
own men
hierarchy
18-646 – How to Write Fast Code II?
15

Section 2 – Manycore
— GPU System Architecture
Device System (GPU)
PCI-E 2.0
Host System (CPU)
DRAM
Device System (GPU)
PCI-E 2.0
Bandwidth: ~5.2GB/s
DRAM
18-646 – How to Write Fast Code II?
mens 16
DRAM

Section 3 – The Cloud
Network
CPU
CPU
18-646 – How to Write Fast Code II?
17
CPU
CPU

Section 3 – The Cloud
Network
GPU
GPU
Host
Host
GPU
18-646 – How to Write Fast Code II?
18
GPU
GPU
GPU
GPU
Host
Host
GPU

A Game of Granularity
10-9
Manycore Throughput
Optimized Implementations
On-die cache Latency
-8 10-7
3 sainthood Memory Latency
System Latency Network Latency
© Jike Chong 2009
10
Multicore Task Queue-based Implementations
Pthread-based Implementations
MPI-based Implementations
Remote Procedure Call based Implementations
O10-6 10-5
10-4
10-3
10-2
10-1 100
Shared Off-chip
18-646 – How to Write Fast Code II?
19
109
104
Task Management Overhead (Instructions)
108 107 106 105
103 102 101
Jobs over networks
OS Processes
SW task Queue
HW task Queue
Synchronization Overhead (seconds)

Outline
—Landscape of Computing Platforms —Hardware Architectures
— Multicore vs Manycore
— Instruction level parallelism
— SIMD
— Simultaneous multithreading — Memory hierarchy
— System hierarchy
—How to Write Fast Code?
18-646 – How to Write Fast Code II? 20

How to Write Fast Code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— We focus on the fast hardware in this lecture
— Combines with good techniques to produce fast code…
…in order to solve a problem or improve an outcome
18-646 – How to Write Fast Code II? 21

How to Write Fast Code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— Recognizing levels of concurrency in an application
— Effective mapping of concurrency in an application with
parallelism of a platform
à Fast code
18-646 – How to Write Fast Code II?
22

Concurrency Opportunity Recognition
—The Application Developer —Application-level Concurrency —The Problem-Solving Process
18-646 – How to Write Fast Code II? 23

Distinction: Concurrency vs. Parallelism
Concurrency
Parallelism
The property of an application that… …allows for tasks to have the potential to be…
…executed simultaneously
The application architecture in which… …more than one task is active and able to…
…make progress at one time
We expose concurrency in our applications.
The property of a platform that…
…allows for tasks to have the potential to be…
…executed simultaneously
The platform architecture in which.. …more than one task can be active and…
…make progress at same time
We exploit parallelism in our platforms.
18-646 – How to Write Fast Code II?
24

The Application Developer
— Writing fast code is a process coherent with “general problem solving behavior”
– Newell and Simon, Human Problem Solving (1972), pp. 72-73
— The process of problem solving involves:
1. Understand the current state
2. Observe the internal representation
3. Search among alternatives
4. Select from a set of choices
18-646 – How to Write Fast Code II? 25

k-means Problem
— Find k cluster centers that minimize the distance from each data
— Important algorithm in machine learning:
— Statistical data analysis
— Vector quantization (Speech Recognition)
— NP-hard for arbitrary input
— k-means algorithm frequently finds a reasonable
solutions quickly
— Issues:
— Worst case running time is super-polynomial — Approximation can be arbitrarily bad
point to a cluster center
18-646 – How to Write Fast Code II? 26

k-means Problem
— Find k cluster centers that minimize the distance from each data point to a cluster center
.
. .
.
Cluster
Cluster center (centroid)
k: Number of clusters (defined a-priori) Cluster: Assignment of data points to a class Cluster Center: μ of data points in a cluster
.
18-646 – How to Write Fast Code II?
27

Related Problems and Algorithms
— k-means++: Maximize scattering on initial cluster centers
— KD-trees: Fast k-means – Pre-compute distance between data points — x-means: k-means with efficient estimation of the number of classes
— Gaussian Mixture Models:
— Probabilistic assignments to clusters
— Multivariate Gaussian distributions instead of means
— Expectation Maximization algorithms (EM algorithms) 0
— Find maximum likelihood estimates of parameters in a statistical model, where the model depends on unobserved latent variables.
— Expectation Maximization Algorithms for Conditional Likelihoods
— Estimate parameters in a statistical model to optimize conditional likelihood
(where the objective function is a rational function)
18-646 – How to Write Fast Code II? 28

k-means Algorithm (“Lloyd’s algorithm”) — Given an initial set of k means
— Expectation Step: Assign each observation to the cluster with the closest mean
— Maximization Step: Calculate the new means to be the centroid of the observations in the cluster.
— Iterate until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 29

The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 30

The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 31

The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
2. Assign each data point to closest Center
18-646 – How to Write Fast Code II? 32

The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
4. Re-iterate steps 2-3 until convergence or stopping criteria met
3. Update Centers based on assignments from (2)
18-646 – How to Write Fast Code II? 33

The Algorithm
Example: k=5
Distance metric=euclidean Dimensions=2
1. Randomly select k cluster Centers
2. Assign closest Center to each data point
3. Update Centers based on assignments from (2)
4. Re-iterate steps 2-3 until convergence or stopping criteria met
18-646 – How to Write Fast Code II? 34

The Phases
1. 2.
3. 4.
Initialization: Randomly select k cluster centers
— Select k samples from data as initial centers [Forgy Partition]
Expectation: Assign each data point go closest center — Compare each data point (N) to each cluster center (k) — Distance Metric: Euclidean distance (D dimensions)
Maximization: Update centers based on assignments
— For each cluster (k) compute mean (D dimensions) from data
points assigned to that cluster
Evaluate: Re-iterate steps 2-3 until convergence or stopping criteria met
— Percentage of data points re-assigned
— Number of iterations (2-3)
18-646 – How to Write Fast Code II? 35

A Fast Implementation of k-means — Following the process of problem solving with k-means:
1.
Understand the current state
— Running on a platform
— Using a specific set of resources
— Achieving a specific performance
— Meeting a specific criteria/requirement
Observe the internal representation Search among alternatives
Select from a set of choices
Assumption:
Starting from a functionally correct reference implementation
Implication:
Must observe the current state and implementation requirements before starting to solve a problem
2.
3.
4.
18-646 – How to Write Fast Code II? 36

Understanding the Current State
— Running on a platform
— Platform: Linux + GCC on x86 multicore processor
— Using a specific set of resources (ghcXX)
— Computation:
— Data:
— Synchronization:
? cores, 2 to ? way SIMD
??KB L1, ??KB L2, shared L3 cache, ??GB DRAM on-chip shared-memory abstraction
— Achieving a specific performance — As measured in Mini-Project 1
— Meeting a specific criteria/requirement — Matrix-Multiply
— k-means
18-646 – How to Write Fast Code II? 37

What to Measure for Performance?
— Performance Analysis: Roofline Model
— A few simple techniques:
— Observe the phases of execution
— Characterize the execution time break downs — Reason about why a piece of code is slow
— Identify performance bottlenecks 0
Sequential Implementation
2.623 0.474 0.073
Parallel Implementation
0.148 0.103 0.043 0.008
RTF: Real Time Factor (Proc Time / Real Time)
Phase 1 Phase 3
Phase 2 Seq. Overhead
18-646 – How to Write Fast Code II?
38

Current state: k-means algorithm — 4 Phases (Initialization, Expectation, Maximization, Evaluate)
— Majority of time spent on Expectation and Maximization phases OO
— Entire data set can fit in memory on a single machine
— Number of samples (N) and feature dimensions (D) vary significantly — Evaluation for any number of clusters ( 2 ≥ k ≤ 300 )
— Example Data Sets: — ionosphere_scale: — svmguide:
— cod-rna:
— Ijcnn1:
18-646 – How to Write Fast Code II?
351 Samples, 34 Dimensions 7089 Samples, 4 Dimensions 59535 Samples, 8 Dimensions 191681 Samples, 22 Dimensions
39

A Fast Implementation of k-means — Following the process of problem solving with k-means:
1. 2.
Understand the current state
Observe the internal representation — Application structure
— Identified four phases of execution — Implementation concerns
— Task considerations
— Data representations
— Concurrency opportunities
Search among alternatives Select from a set of choices
3. 4.
18-646 – How to Write Fast Code II? 40

Example Code (Initialize)
kmeans/seq_kmeans.c lines: 116-119
….
/* pick first numClusters elements of objects[] as initial cluster centers*/
clusters[i][j] = objects[i][j];
K
for (i=0; i 0)
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
kmeans/seq_kmeans.c lines: 148-162
}
}

newClusters[i][j] = 0.0; newClusterSize[i] = 0;
/* set back to 0 */ /* set back to 0 */
18-646 – How to Write Fast Code II?
45
prepare for next iteration

Example Code (Maximization)
….
/* update new cluster centers : sum of objects located within */
newClusterSize[index]++; for (j=0; j 0)
clusters[i][j] = newClusters[i][j] / newClusterSize[i];
newClusters[i][j] = 0.0; /* set back to 0 */
x 1
}

N (Histogram computation into k bins
}
newClusterSize[i] = 0;
Possible Concurrencies:
18-646 – How to Write Fast Code II?
46
/* set back to 0 */
D (independent)

The phases – concurrency
1. 2.
Initialization: Randomly select k cluster centers
Expectation: Assign closest center to each data point — N (independent)
— k (min reduction)
— D (sum reduction)
Maximization: Update centers based on assignments — D (independent)
— N (Histogram computation into k bins)
3.
Evaluate: Re-iterate steps 2-3 until convergence
18-646 – How to Write Fast Code II? 47
4.

A Fast Implementation of k-means
— Following the process of problem solving with k-means:
1. Understand the current state
2. Observe the internal representation
3. Search among alternatives
4. Select from a set of choices
18-646 – How to Write Fast Code II? 48

Search Among Alternatives
— Given the observed internal representations and the concurrency opportunities…
— What are the implementation alternative0s?
— Different mapping of application concurrency to platform parallelism
— The search process
— More complex than one application concurrency to one platform parallelism — May want to sequentialize some operations:
— Some parallel operations are as “work-efficient” as sequential operations
— Reduction – sequential: O(N), Parallel: O(N logN)
— One level of concurrency could map to multiple levels of parallelism
18-646 – How to Write Fast Code II? 49

Multicore and Manycore Parallelism
— Similar in scaling trends:
— Increasing vector unit width
— Increasing numbers of cores per die
— Increasing bandwidth to off-chip memory
— Different in optimization points
DRAM
Shared Memory Bus
Core
SIMD level Parallelism
Core Chip Core Core Core
Core level Parallelism
Core Core
18-646 – How to Write Fast Code II?
50
Cache
Cache
Cache
mem mem mem mem mem mem mem mem
Cache Cache Cache

Memory Hierarchy
— Cache:
— A piece of fast memory close to the compute modules in a microprocessor
— Allows faster access to a limited amount of data
— Physical properties of wires and transistors determines trade-off between
cache capacity and access throughput/latency Network
(Capacity), (throughput), (latency) Exa-Bytes, 0.1 GB/s, ~1-10B cycles
3TB, 6 GB/s, ~1M cycles 16GB, 34.08 GB/s, ~200 cycles 8 MB, 108.8 GB/s, 26-31 cycles
256 KB, 108.8 GB/s, 12 cycles
32 KB Data Cache, 163.2 GB/s, 4-6 cycles
Intel Core2 – 2600k @ 3.4GHz
(viewed from one core)
18-646 – How to Write Fast Code II?
51

Mapping Concurrency to Parallelism
— How does it map to the platform? — SIMD level parallelism
— Core level parallelism
— How does it map to the cache hierarch?
— What data is required for each concurrent operation? — What are the synchronization points in the algorithm?
— Expectation & Maximization Phases
— SIMD & core-level parallelism across data-points (N)
— Update membership for each data point sequentially
— Compute distance to each cluster center and select index with min. distance
— Histogram computation (summation / assignment count for new clusters) — Other possible concurrency mappings?
18-646 – How to Write Fast Code II? 52

A Fast Implementation of k-means — Following the process of problem solving with k-means:
1. 2. 3. 4.
Understand the current state Observe the internal representation
Search among alternatives Select from a set of choices
— Does solution met required criteria
— How to evaluate a mapping?
— Efficiency: Runs quickly, makes good use of computational resources
— Simplicity: Easy to understand code is easier to develop, debug, verify and modify
— Portability: Should run on widest range of parallel computers
— Scalability: Should be effective on a wide range of processing elements
— Other considerations: Practicality, Hardware, Engineering cost
18-646 – How to Write Fast Code II? 53

Evaluate Choice
— Expectation & Maximization Phases
— SIMD & core-level parallelism across data-points (N)
— Update membership for each data point sequentially
— Histogram computation (summation / assignment count for new clusters) à OpenMP
— How we can evaluate the choice and make a decision — Efficiency
— Simplicity / Maintainability — Portability
— Scalability
18-646 – How to Write Fast Code II? 54

How to write fast code
— Expose concurrencies in applications and algorithms — 2/9 – “Concurrency Opportunity Recognition”
— Mini-Projects (1-3) & Term Project
— Exploit parallelisms on application platform
— 2/4 – “Advanced Parallel Hardware Architectures” — Mini-Projects (1-3) & Term Project
— Explore mapping between concurrency and parallelism
— The rest of the semester….
— Abstractions to support mapping of concurrencies to parallelisms
— OpenMP
— CUDA
— Map-Reduce
[Section 1] [Section 2] [Section 3]
18-646 – How to Write Fast Code II?
55

How to Write Fast Code?
Fast Platforms
— Multicore platforms — Manycore platforms — Cloud platforms
Good Techniques
— Data structures
— Algorithms
— Software Architecture
— Recognizing levels of concurrency in an application
— Effective mapping of concurrency in an application with
parallelism of a platform
à Fast code
18-646 – How to Write Fast Code II?
56