CMPSC 450
Concurrent Scientific Programming
Introduction
CMPSC 450
Welcome to the class!
• Class meets MWF 8-8:50AM on Zoom
• Office hours
• Tuesdays and Thursdays 8PM – 9PM. On Zoom. • By appointment
• Email: use Canvas
• About me:
• Master of Engineering, CSE, Penn State 2001
• 20 years industry experience
• Focus on implementing/optimizing algorithms
CMPSC 450
2
Class Overview and Topics:
• Week 1: Introduction, Modern Computer Architecture, ACI-ICDS Introduction
• Week 2: Serial Optimization and Data Access Optimization
• Week 3 & 4: Shared Memory Parallel Algorithm Theory, Exam 1 (2/12)
• Week 4 & 5: Shared Memory Parallel Algorithm Practice: std::threads
• Week 6 & 7: Open MP
• Week 8: Prefix Sums, Pointer Jumping, Exam 2 (3/10 or 3/15 TBD)
• Week 9 & 10: Networks, Distributed Memory Theory
• Week 11 & 12: MPI
• Week 12 & 13: SUMMA & Cannon Matrix Multiplication, Exam 3 (4/5)
• Week 14: Simulations and Locality
• Week 15: CUDA Overview
• Finals Week: Final (Time TBD).
CMPSC 450
3
Class Material
• Textbook: Georg Hager and Gerhard Wellein, Introduction to High Performance Computing for Scientists and Engineers
• ISBN: 978-1-4398-1192-4
• Useful for terminology, practical programming
recommendations, analysis of scientific codes
• Available as an e-book (See Library Resources on Canvas page)
• Lecture slides available in Canvas
• Miscellaneous Reading
• Class discussions
This is a new edition!
CMPSC 450
4
Grading
• Programming assignments*, 40% • Class contribution*, 10%
• Mini project, 10%
• 3 In-Class Exams, 30%
• Final exam, 10%
* A score of less than 50% in this section results in an F for the course.
Concerns regarding earned grades must contact instructor within 2 weeks of date of posted grade
CMPSC 450
5
Programming Assignments
• All programming assignments shall run on the ACI-ICS cluster (Roar Supercomputer).
• All programming assignments shall be written in C/C++.
• All source-code shall have: • Consistent formatting
• Author header
• Makefiles
• All write-ups submitted shall be in pdf format following the “How to Write Technical Papers” format.
• Late penalty: 50% (unless prior approval granted)
• Example Topics:
• Benchmarking – memory, functions/subroutines, algorithms • Code optimization
• Threading
• OpenMP
• MPI
• Advanced parallel algorithms
CMPSC 450
6
Generic Rubric (available on Canvas)
CMPSC450 Generic Assignment Rubric
Unsatisfactory [0%]
Note: Any Unsatisfactory criteria will result in a zero grade for the assignment, and no further grading will be done.
Poor (40%)
Satisfactory (75%)
Excellent (100%)
Code Performance
50% (10 pts)
• Code does not compile
• Code does not meet
functional requirements.
• Code execution results
in segmentation fault.
• No Makefile provided.
• No code provided.
• No Author header.
• Code is not optimized.
• Parallelism (if applicable)
does not work.
• Code does not scale in
parallelization.
• Code does not scale with
input data size (where
applicable).
• Code is commented
poorly (no comments)
• Poor code structure.
• Some optimizations missed.
• Code execution time does not meet times specified in assignment.
• Code scales appropriately with parallelization (where applicable)
• Code scales to medium data sizes, but does not scale fully.
• Code is fully optimized
• Execution times meet or
exceed times specified
in assignment.
• Code execution scales to
appropriately large data
sizes.
• Code is easy to read and
commented appropriately.
Documentation (writeup)
50% (10 pts)
• No writeup provided.
• No author name on
writeup.
• Writeup not in pdf
format.
• Write-up is missing insight into performance results or is just a text description of a graph.
• Write-up missing discussion on code implementation or just includes a copy of the source code.
• Performance details presented poorly (table).
• Data presentation (graph) is difficult to read.
• Data presentation provides too few data points or does not cover full range of performance.
• Writeup could use more details regarding code implementation.
• Writeup could use more insight into performance results.
• Excellent writeup.
• Excellent presentation of
performance results
(graph).
• Efficiently explains
critical code
implementation details.
• Accurately explains
insight into performance results.
CMPSC 450
7
Mini project (due date April TBD)
• Go and learn good things!
• Learn something about High Performance Computing or apply what you have learned to your research and then tell us about what you learned.
• Up to groups of 2.
• More info to come…
CMPSC 450
8
Class Contribution
• A short paragraph due every Friday (midnight) highlighting how your participation in class contributed to the success of the class.
• Evaluated out of 2 points.
• Acceptable contributions:
• Answer questions in class
• Ask relevant questions in class
• Point out code errors during live coding demos
• Point out math errors in Execution time calculations • Participated in discussions in Canvas
• Unacceptable contributions:
• Read the book.
• Took notes during class.
• Did not disrupt class with questions.
CMPSC 450
9
Attendance
• University policy: attendance is mandatory.
• Zoom attendance will be recorded to validate class contribution
reports.
• Tips for morning classes:
• Have breakfast before class.
• Have a morning beverage… I prefer coffee. • Assume it is your job.
CMPSC 450
10
Academic Integrity
• The work submitted is assumed to be YOUR work.
• Homework will be scanned for plagiarism.
• If the data in your reports doesn’t match the code you provided, that is also a violation!
• If code is used from online, citations in the comments are required!
• Discussion of assignments is encouraged. Writing source code and
reports must be done individually!
• Penalty: zero on assignment, 1 letter grade drop on final grade.
• Ask for help!
CMPSC 450
11
You should take this class if …
• You think you will like writing high-performance code
• You are comfortable with foundational CS material (Algorithms,
Computer Organization and Architecture)
• You are proficient in C • for loops
• variable declarations (int, float, double) • arrays
• memory allocation
float *A = new float[N]; float *B = new float[N]; float *C = new float[N];
for (int ii = 0; ii < N; ii++)
C[ii] = A[ii] * B[ii];
CMPSC 450
12
Today’s tasks
• Review material posted on Canvas
• Read Chapter 1 of the HPC book for Wednesday
• Read “How To Write Technical Papers” found in Canvas • Read Chapters 2 & 3 for Next week.
• Register for account on ACI cluster.
CMPSC 450
13
Questions?
CMPSC 450
14
High Performance Computing Today
• https://www.hpcwire.com/2020/04/08/supercomputer-modeling- tests-how-covid-19-spreads-in-grocery-stores/
CMPSC 450
15
Why C?
CMPSC 450
16
Why C?
https://insights.stackoverflow.com/survey/2018#most-popular-technologies
CMPSC 450
17
Why C?
• C is still one of the fastest languages around • Assembly is faster but much more complex!
CMPSC 450
18
Matrix Multiply: relative speedup to a Python version (18 core Intel)
from: “There’s Plenty of Room at the Top,” Leiserson, et. al., to appear.
CMPSC 450
63,000!
19
ACI Cluster
• 26,000 Cores, 2.4MW of redundant power
• Dual 10/12 core Xeon E5-2680, Quad 10 core Xeon E7-4830 • 128GB / 256GB / 1TB Memory
• Utilize ACI open allocation.
• Limited to 100 cores, 48-hour jobs
• User Guide:
• https://ics.psu.edu/computing-services/icds-aci-user-guide/
• Sign up for an account:
• Affiliation: -- Class –
• Role: Undergrad/Grad Student
• Sponsor: wrs122
• Computational and Data Requirements: 1 GB storage, Open Allocation
CMPSC 450
20
What you should get out of this class?
In-depth understanding of:
• When is parallel computing useful?
• Programming models (software) and tools, and experience using some of them
• Performance analysis and tuning
• Important parallel applications and commonly-used algorithms
• Current computing hardware and trends
• Exposure to various open research questions
CMPSC 450
21
Why parallel computing?
CMPSC 450
22
Why parallel computing?
• Fastest Processor Overclocked to 8.8GHz (2011)
• 25MHz in 1993 • 75MHz in 1995 • 1GHz in 1999
• Power Density of CPU is somewhere between a hotplate and Nuclear Reactor.
• They’re already here...
• Intel Xeon Phi 7250 – 68 cores.
CMPSC 450
23
CMPSC 450
CMPSC 450
25
Parallel computing
• Parallel processors are everywhere
• Including server, desktops, laptops, smartphones, refrigerators
• We require powerful computers to solve large-scale science and engineering problems
• Big problems, big data require better use of parallel hardware
• Writing good parallel programs is hard • ‘Efficient’ programs require heroic coding
CMPSC 450
26
Writing fast code: practical issues
In-depth understanding of architecture and performance programming • Optimizations for serial code
• Data access optimizations
• Locality, Memory hierarchy
• Shared memory programming
• Distributed memory programming
CMPSC 450
27
Definitions from Wikipedia (petascale, exascale)
• In computing, petascale refers to a computer system capable of reaching performance in excess of one petaflops, i.e. one quadrillion floating point operations per second. The standard benchmark tool is LINPACK and Top500.org is the organization which tracks the fastest supercomputers.
• Exascale computing refers to computing systems capable of at least one exaFLOPS, or a billion billion calculations per second. Such capacity represents a thousandfold increase over the first petascale computer that came into operation in 2008.[1] (One exaflops is a thousand petaflops or a quintillion, 1018, floating point operations per second.) At a supercomputing conference in 2009, Computerworld projected exascale implementation by 2018.[2] This proved accurate, as Oak Ridge National Labs performed a 1.8×1018 flop calculation on the Summit Supercomputer while analyzing genomic information in 2018.[3] They are Gordon Bell Finalists at Supercomputing 2018.
• Exascale computing would be considered as a significant achievement in computer engineering, for it is believed[by whom?] to be the order of processing power of the human brain at neural level[4](functional might be lower). It is, for instance, the target power of the Human Brain Project.
CMPSC 450
28
Today’s petascale systems are solving challenging research problems . . .
Physics of high-temperature superconducting cuprates
Protein structure and function for cellulose-to-ethanol conversion
Fundamental instability of supernova shocks
Global simulation of CO2 dynamics
Next-generation combustion devices burning alternative fuels
CMPSC 450
Optimization of plasma heating systems for fusion experiments
29
Many compelling problems become tractable only at the exascale
Fundamental science
• Decipher and comprehend the core laws governing the universe and unravel its origins
Materials science
• Design, characterize, and manufacture materials tailored and optimized for specific applications
Earth science
• Understand the complex biogeochemical cycles that underpin global ecosystems and control the sustainability of life on Earth
Biology
and medicine
• Understand relationships from individual proteins through whole cells into ecosystems and environments
Energy assurance
• Attain, without costly disruption, the energy required by the nation
in guaranteed, economically viable, and environmentally benign ways to satisfy residential, commercial, and transportation requirements
National security
• Analyze, design, test, and optimize critical systems for communications, homeland security, and defense
• Understand and uncover human behavioral systems underlying asymmetric operation environments
Engineering design
• Design, deploy, and operate safe and economical structures, machines, processes, and systems with reduced concept-to-deployment time
CMPSC 450
30
What do commercial and computational science applications have in common?
Health Image Speech Music Browser
1 Finite State Mach. 2 Combinational
3 Graph Traversal 4 Structured Grid
5 Dense Matrix 6 Sparse Matrix 7 Spectral (FFT) 8 Dynamic Prog 9 N-Body
10 MapReduce
11 Backtrack/ B&B 12 Graphical Models 13 Unstructured Grid
CMPSC 450
31
Embed SPEC DB Games ML HPC
‘Big Data’ analysis requires new computational methods and powerful computers
Recommendation Systems
Web Search
Targeted Advertising Viral Marketing
CMPSC 450
32
Why is writing parallel code difficult?
• Finding enough parallelism (Amdahl’s Law)
• Granularity – how big should each parallel task be
• Locality – moving data costs more than arithmetic
• Load balance – don’t want 1000 processors to wait for one slow one
• Coordination and synchronization – sharing data safely
• Performance modeling/debugging/tuning
Impediments to parallelization:
• Legacy serial code
• Rapidly-evolving architectures
• Potential speedup only a ‘constant factor’ on current platforms, especially for memory-intensive computations
CMPSC 450
33
Pre-existing Parallelism in Modern Machines
• Bit level parallelism
• within operations, etc.
• Instruction level parallelism (ILP)
• multiple instructions execute per clock cycle
• Memory system parallelism
• overlap of memory operations with computation
• OS parallelism
• multiple jobs run in parallel on commodity SMPs
Limits to all of these: for very high performance, need user to identify, schedule and coordinate parallel tasks
CMPSC 450
34
Automatic Performance Tuning: Motivation
• Writing high-performance software is hard
• Make programming easier while getting high speed
• Ideal case: program in your favorite high-level language (MATLAB, Python, Java) and get a high fraction of peak performance
• Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler, ...
• How much of this can we automate?
CMPSC 450
35
Parallel software development: an emerging view
• Efficiency layer (10% of today’s programmers)
• Expert programmers build libraries, frameworks, OS • Highest fraction of peak performance
• Productivity layer (90% of today’s programmers)
• Domain experts develop parallel applications by composing frameworks and
libraries
• Emphasis on productivity, willing to sacrifice some performance for productive programming
CMPSC 450
36
Algorithm Design
Efficient algorithm
CMPSC 450
37
Applications
Algorithm ‘Engineering’
Real inputs
Realistic models
Design
Experimentation
Implementation Alg. libraries
Analysis
Perf. Guarantees
CMPSC 450
38
Applications
Increasing hardware complexity
AMD Magny-Cours
NVIDIA Fermi
CMPSC 450
39
Models designed by computer architects are too complex
CMPSC 450
40
CMPSC 450
41
Additional Resources
• See Canvas
• Some parallel computing text books
CMPSC 450
42
Parallel computing
CMPSC 450
43
Software tools, Languages, Libraries for mini project
• NAMD, LAMMPS, ParMetis, SCOTCH, NWCHEM, SCALAPACK, PLASMA, MAGMA, SLEPC
• Intel MKL, PetSC, Trilinos, Hypre
• SPIRAL, FFTW, OSKI, ROSE, Orio
• Visit, FastBit
• RaxML, AbySS, VASP, Blast
• UPC, Cilk, Chapel, Habanero, X10, OpenACC, OpenCL
CMPSC 450
44
Tunnel Vision by Experts
• •
• •
“I think there is a world market for maybe five computers.”
Thomas Watson, chairman of IBM, 1943.
“There is no reason for any individual to have a computer in their
home”
Ken Olson, president and founder of Digital Equipment Corporation, 1977.
“640K [of memory] ought to be enough for anybody.”
Bill Gates, chairman of Microsoft,1981.
“On several recent occasions, I have been asked whether parallel computing will soon be relegated to the trash heap reserved for promising technologies that never quite make it.”
Ken Kennedy, CRPC Directory, 1994
CMPSC 450
45
Parallel Algorithm Design and Analysis
• Costs of computation: Asymptotics, Time, Space, Power, Energy, Speedup, Tradeoffs
• Scalability
• Parallel models of computations: PRAM, BSP
• Scheduling: Task graphs, work, span
• Analysis
CMPSC 450
46
Algorithmic Design Strategies
• Divide and conquer
• Parallel prefix
• Stencil-based iterations • Pipelining
• Randomization
CMPSC 450
47
Parallel Programming
• Target machine model
• Shared memory (threading: OpenMP)
• Distributed memory (message passing: MPI) • Heterogeneous
• Hello world→Proficient use of parallel libraries • Synchronization
• Load balancing
• Data layout, locality, memory management
CMPSC 450
48
Resources
• The HPC book.
• The Internet
• Course Instructor and TA.
CMPSC 450
49