Problem 1: Locking and Transactional Memory
Consider the following very simple banking system, where all account values are whole dollars.
// Account balances in whole dollars
int balance[NACCT];
The only supported banking operations are to transfer money between accounts and to check that the sum
of all balances equals or exceeds some threshold value.
// Transfer money between accounts.
// Do not allow from_acct to become negative
boolean transfer(int to_acct, int from_acct, int amount);
// Check that sum of all accounts equals or exceeds some threshold
boolean check_sum(int threshold);
If a transfer operation would cause an account to become overdrawn, function transfer should return
false without changing any balances. (You can assume that amount is not negative.)
Function check_sum must be implemented in such a way that it gets a value for the sum that would be a
valid snapshot of the state of the balances for some sequentially consistent ordering of the transactions.
2
1A: A Banking System with Locks
Suppose you are provided a mutual exclusion lock for each account:
// Support operations acquire() and release()
int lock[NACCT];
Fill in code for these two operations using locking. You may not add any additional locks. Make sure your
code cannot deadlock. Try to minimize the amount of time that locks are held. 1. Transfer
boolean transfer(int to_acct, int from_acct, int amount) { boolean ok;
int amin = min(from_acct, to_acct);
int amax = max(from_acct, to_acct);
// Acquire locks in ascending order of account number acquire(&lock[amin]); acquire(&lock[amax]);
ok = balance[from_acct] >= amount;
if (ok) {
balance[from_acct] -= amount;
balance[to_acct] += amount;
}
release(&lock[amax]); release(&lock[amin]);
return ok; }
2. Check Sum
boolean check_sum(int threshold); int sum = 0;
int a;
// Must lock down all accounts. Can sum as acquire locks for (a = 0; a < NACCT; a++) {
acquire(lock[a]);
sum += balance[a];
}
for (a = 0; a < NACCT; a++) release(lock[a]);
return sum >= threshold; }
3
1B: A Banking System with Transactional Memory
Suppose we have a transactional memory system supporting the following operations:
// Begin transaction
void xbegin();
// Abort transaction
void xabort();
// (Attempt to) commit transaction.
// Returns true if successful, false if failed boolean xend();
Rewrite the two functions using transactional memory primitives rather than locks. 1. Transfer
boolean transfer(int to_acct, int from_acct, int amount) { boolean ok;
while (true) { xbegin();
ok = balance[from_acct] >= amount; if (ok) {
balance[from_acct] -= amount;
balance[to_acct] += amount;
}
if (xend()) break;
}
return ok;
}
2. Check Sum
boolean check_sum(int threshold) int sum = 0;
while (true) { int a;
sum = 0;
xbegin();
for (a = 0; a < NACCT; a++) {
sum += balance[a];
}
if (xend()) break;
}
return sum >= threshold; }
4
3. If the system performs millions of transactions per second over thousands of bank accounts, what problem to you forsee for the implementation based on transactional memory? Assume the transac- tional memory is implemented using lazy versioning and optimistic conflict detection.
Chances are that check sum would keep failing, because ongoing transfers would keep causing writes to the array balance.
5
1C: Improving transactional memory performance
Suppose that you can be guaranteed that there will be only one outstanding call to check sum at a time, and that it will be relatively infrequent (around once per hour.) Devise a simple scheme, without any locks, where the system can temporarily force all transfers to wait while a sum is being computed, thereby minimizing the chance of the sum computation failing. You may make use of the following additional global variable:
boolean summing = false;
1. Transfer
boolean transfer(int to_acct, int from_acct, int amount) { boolean ok;
while (true) { xbegin();
if (summing) xabort();
ok = balance[from_acct] >= amount; if (ok) {
balance[from_acct] -= amount;
balance[to_acct] += amount;
}
if (xend()) break;
}
return ok;
}
2. Check Sum
boolean check_sum(int threshold) int sum = 0;
summing = true; // Must be done outside of transaction while (true) {
int a; sum = 0;
xbegin();
for (a = 0; a < NACCT; a++) {
sum += balance[a];
}
if (xend()) break;
}
summing = false;
return sum >= threshold;
}
6
Problem 2: Processor Scaling
This problem explores the two scaling models presented on slides 5–9 in the lecture on Heterogeneous Parallelism:
http://www.cs.cmu.edu/ ̃418/lectures/21 heterogeneity.pdf.
These are based on the paper “Amdahl’s Law in the Multicore Era,” by Mark D. Hill and Michael R. Marty
of the University of Wisconsin, published in IEEE Computer, July, 2008.
In both models, we express computing performance and resources (e.g., chip area) relative to a baseline
processor. In scaling to a processor with r resource units (scaled such that the baseline processor has
r = 1), we obtain a processor with single-threaded performance perf (r) (scaled such perf (1) = 1.)
√
The slides show graphs for the case of perf (r) =
increasing processing resources will improve performance, but not in a linear way. Increasing to large values of r yields diminishing returns in terms of single-threaded processor performance. We will generalize this model to one with perf (r) = rα, where α is value with 0.0 ≤ α ≤ 1.0. (For example, the slides are based on α = 0.5.)
For scaling the system design to use n resource units, two options are considered:
Homogeneous: We create a new processor design that uses r resource units, and populate the system with p = n/r identical processors. In this case, p must be an integer, but r need not be. For example, to get p = 5 when n = 16, we would have r = 16/5 = 3.2.
Heterogeneous: We create a design that uses r resource units (where r is an integer), and then create a system with one of these more powerful processors, plus n−r baseline processors, giving p = n−r+1 total processors.
As with the conventional presentation of Amdahl’s Law, we assume that the problem to be solved has some fraction f that can use arbitrary levels of parallelism, while the remaining fraction 1 − f must be executed sequentially.
With performance normalized to a single baseline processor, Amdahl’s Law states that the speedup over the baseline is given by the equation
S= 1 (1) Tseq + Tpar
where Tseq is the time required to execute the sequential portion of the code and Tpar is the time required to execute the parallel portion of the code, with both of these using all available resources.
r. This function is plausible—it captures the idea that
7
2A: Homogeneous Model Speedup
Slide 6 gives the following equation for the speedup in the homogeneous model:
Sho = 1 (2)
Explain how Equation 2 follows the form of Equation 1.
1. What resources are used and what is the duration of Tseq?
The sequential part of the computation involves computing fraction 1 − f of the total using a processor with performance perf (r), giving Tseq = (1 − f )/perf (r).
2. What resources are used and what is the duration of Tpar?
The parallel part of the computation involves computing fraction f, but using p = n/r processors, each with computing power perf (r).
1−f + f n perf (r) perf (r)· r
8
2B: Heterogeneous Model Speedup
Slide 8 gives the following equation for the speedup in the heterogenous model:
She = 1 (3)
Explain how Equation 3 follows the form of Equation 1.
1. What resources are used and what is the duration of Tseq?
As before, we have Tseq = (1 − f )/perf (r).
2. What resources are used and what is the duration of Tpar?
The parallel part of the computation involves computing fraction f, but using both the large processor and the smaller ones. To maximize performance, we split the load, mapping part onto the larger processor, and the remainder on the others, fully using the combined computing power perf (r) + (n − r).
1−f + f
perf (r) perf (r)+n−r
9
2C: Optimizing Parameter r for the Homogeneous Model
For either of these models, we can find a value r∗ that achieves maximum speedup as a function of f, n, and α. Keeping n fixed at 256 but varying parameters α and f, fill in the table below giving the values of r∗ and p = n/r∗ for the homogeneous model, and the resulting speedups.
Hint: The number of processors p = n/r must be an integer ranging from 1 to n. You can simply compute 2 for all possible values and select the one with maximum speedup.
You may find the following spreadsheet a useful starting point:
http://www.cs.cmu.edu/ ̃418/exercises/ex5-model.xlsx
This spreadsheet generates the graphs shown on slide 9 for the case of n = 256. You can adapt this spreadsheet to determine the values of r∗.
αf
r∗ p = n/r∗ Sho
0.5 0.700 0.5 0.970 0.5 0.995
128.000
2
17.41
8.00
32
46.90
1.286
199
113.42
0.7 0.700 0.7 0.970 0.7 0.995
256.0
1
48.50
18.266
14
77.02
3.012
85
129.51
10
2D: Optimizing Parameter r for the Heterogeneous Model
Fill in the table below giving the values of r∗ and p = n − r∗ + 1 for the Heterogeneous model, and the resulting speedups.
αf
r∗ p = n − r∗ + 1 She
0.5 0.700 0.5 0.970 0.5 0.995
170
87
33.25
72
185
116.62
28
229
191.94
0.7 0.700 0.7 0.970 0.7 0.995
163
94
71.75
65
192
160.18
27
230
214.59
11
Problem 3: Cache Performance of Multi-Threaded Program
This problem relates to the image processing task described in Slides 40–46 of Lecture 22: http://www.cs.cmu.edu/ ̃418/lectures/22 dsl.pdf.
You have been hired by a movie production company to write programs used for rendering images. Each frame is h × w pixels, with a pixel represented with data type float. Your first task is to write filtering programs that apply k × k filters to an image. Code to do this is shown for on Slide 42 for the case of k = 3. The filters you must implement can be divided into two phases: one horizontal filtering stage and one vertical stage, as shown in Slide 43.
Consider the code of Slide 43. In this version, you allocate a temporary buffer large enough to hold all w × (h + k − 1) pixels of the result generated by the first phase and used by the second. The total work performed is proportional to 2k · w · h. We call this the baseline code.
In the ideal case, the cache would be large enough to generate the temporary buffer and keep it resident in cache while it is being consumed in the second phase. Unfortunately, that is not feasible. You have to work within the following constraints:
• Theimagesare4K×4K,i.e.,w=h=4096.
• The machine has 16 cores sharing a single 2 MB (221-byte) cache.
You therefore consider the code of Slide 46, where you choose a chunk size c, and size the temporary buffer to hold c + k − 1 rows. The code then processes the image in multiple passes. Each pass has two phases, with the first filling the temporary buffer based on c + k − 1 rows of the input, and the second generating c rows of the output. By selecting a small enough chunk size, the temporary buffer will be stored in the cache during phase 1 and remain in the cache as it is being used in phase 2.
There are many images to process, and so you plan to use the multiple cores to run multiple threads, each operating on separate images. However, you want to select a chunk size c such that all of the threads will keep their temporary buffers resident in cache.
12
A. Consider the single-threaded case. Your colleague states that as long as the cache can hold 2(c + k − 1) rows, the temporary buffer will remain resident in cache, because the first phase only accesses the input and the temporary buffer, while the second phase only accesses the tempoary buffer and the output. Assuming the cache uses an LRU replacement policy, and ignoring conflict misses, explain why this estimate may be too optimisic.
The last rows of the input are accessed after some of the upper rows of the temporary buffer have been generated. With LRU, the cache might start evicting parts of the temporary buffer during phase 2, even the program is done with the input.
B. What do you propose as a safe rule for the chunk size? (Some advice: it’s better to keep this simple and conservative. If you have to run a cache simulator to answer this question, you’re overdoing it.)
Make sure the input, temporary buffer, and output of a phase can all be held in cache, requiring 3c + 2(k − 1) rows.
C. How many rows can safely fit in the cache? Remember that the input rows must be padded with an extra k − 1 pixels, and you want to avoid false sharing. Assume a cache block size of 64 bytes.
Pad each row with 64 bytes, giving a total of 22+12 + 64 bytes. The cache can hold 127 of these.
13
D. When the caching works out, you find that the total amount of work is a good predictor of the program execution time. Give a formula, in terms of c and k, of how much work the chunked version must do, relative to the baseline version.
Baseline: Processing c rows requires work 2c · w. Chunked: Processing c rows requires (2c + k − 1) · w. Ratio: (2c + k − 1)/2c.
E. Consider the single-threaded case, and consider k ∈ {3, 5, 7}. For each of these cases, what would be the maximum chunk size? Give an estimate of the execution time, relative to the baseline running on a machine with a very large cache.
k=3: c=38,timeratio=1.03 k=5: c=34,timeratio=1.06 k=7: c=30,timeratio=1.10
F. For values of k ∈ {3, 5, 7}, determine the number of threads that would lead to optimum performance, considering that increasing the number of threads requires decreasing the chunk size. What would be the chunk size? Give an estimate of the execution time, relative to the baseline running on a single-threaded machine with a very large cache.
k=3: 7threads,c=2,timeratio=0.21 k=5: 3threads,c=6,timeratio=0.44 k=7: 2threads,c=9,timeratio=0.67
14