CS代写 Overview of Distributed and Parallel

Overview of Distributed and Parallel
Computing Systems – Lecture 5-6 Professor Richard O. Sinnott
Director, Melbourne eResearch Group University of Melbourne

Some things…
• Next weeks lectures/workshops – (it starts getting practical!!!)
• Assignment 1 – handed out yesterday (for thinking)
– Attending next weeks lecture essential to do assignment
• Find a “buddy”
• Workshop next week
– Discussion of assignment 1 • Tips and tricks
• … don’t expect answers

Some more things…
• Assignment 2 – teams reminder – Organise or be organised
• Find many buddies
• UniMelb data centre visit
• Happened this week and will take place again next week

Data Centre

• Accounts in process of being set up on Karaage – (HPC admin system)
• Access and use will look like
– (may need Putty or VPN if coming in from elsewhere) – More in next lecture

Compute Scaling
• Vertical Computational Scaling
– Have faster processors
• Switch your n GHz CPU for a 2n GHz one = 2x faster – Easy to do, but costs more
– Limits of fundamental physics/matter (nanoCMOS) • 3.6Ghz (Intel i7-4960X)

Compute Scaling
• Horizontal Computational Scaling
– Have more processors
• Easy to add more; cost increase not so great • But…
– harder to design, develop, test, debug, deploy, manage, understand, …
• Note that for many/most folks
– HTC far more important than HPC
– (I don’t do HPC, but I’ve done a lot of HTC)

Add More = ???
• Single machine multiple cores
– Typical laptop/PC/server these days
• Loosely coupled collection/cluster of machines
– Pooling/sharing of resources
• Dedicated vs available only when not in use by others • Web services, … Condor, Boinc
• Tightly coupled cluster of machines
– Typical HPC/HTC set-up (SPARTAN, NCI, …)
• Many servers in same rack/server room (often with fast message passing interconnects)
• Widely distributed clusters of machines
– UK NGS, EGEE, … distributed systems more generally
• Hybrid combinations of the above
– Leads to many challenges with distributed systems
• Shared state (or lack thereof)
• Message passing paradigms – dangers of delayed/lost messages

Add More…limitations
• Add more processes…?
– If n processors (cores) are thrown at a problem how much faster will it go?
• Some terminology:
• T(1)=timeforserialcomputation
• T(N)=timeforNparallelcomputations
• S(N)=speedup
• Proportionofspeedupdependsonpartsofprogram that can’t be parallelised

Amdahl’s Law
Fraction of running time that sequential program spends on non-parallel parts of a computation approximates to S = 1/
If 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20×, no matter how many processors are used, i.e. if the non-parallelisable part takes 1 hour, then no matter how many cores you throw at it, it won’t complete in <1 hour. Over-Simplification of Amdahl’s Law • Consider a program that executes a single loop, where all iterations can be computed independently, i.e., code can be parallelized. By splitting the loop into several parts, e.g., one loop iteration per processor, each processor now has to deal with loop overheads such as calculation of bounds, test for loop completion etc. This overhead is replicated as many times as there are processors. In effect, loop overhead acts as a further (serial) overhead in running the code. Also getting data to/from many processor overheads? In a nutshell Amdahl’s Law greatly simplifies the real world! • It also assumes a fixed problem size – sometimes can’t predict length of time required for jobs, – e.g., state space exploration or differential equations that don’t solve... Gustafson-Barsis's Law • Gives the “scaled speed-up” Fixed parallel time per process Fraction of running time sequential program spends on parallel parts Speed up S using N processes is given as a linear formula dependent on the number of processes and the fraction of time to run sequential parts. Gustafson's Law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a practical fixed time. Faster (more parallel) equipment available, larger problems can be solved in the same time. Computer Architecture - 101 • At the simplest level a computer comprises – CPU for executing programs • ALU, FPU, ... • Load/store unit • Registers (fast memory locations) • Program counter (address of instruction that is executing) • Memory interface – Memory that stores/executing programs and related data – I/O systems (keyboards, networks, ...) – Permanent storage for read/writing data into out of memory – Of key importance (especially for HPC systems!) is the balance of all of these • Superfast CPUs starved of data • Bioinformatics jobs killing clusters!!! – There are many different ways to design/architect computers • different flavours suitable to different problems Computer Architectures • Flynn’s Taxonomy – Single Instruction, Single Data stream (SISD) – Single Instruction, Multiple Data streams (SIMD) – Multiple Instruction, Single Data stream (MISD) – Multiple Instruction, Multiple Data streams (MIMD) Flynn’s::SISD Architectures • Single Instruction, Single Data stream (SISD) • Sequential computer which exploits no parallelism in either the instruction or data streams • Single control unit (CU/CPU) fetches single Instruction Stream from memory. The CU/CPU then generates appropriate control signals to direct single processing element to operate on single Data Stream, i.e., one operation at a time. • Pretty much obsolete...! – Basic idea of von Neumann computer 16 Flynn’s::MISD Architectures • Multiple Instruction, Single Data stream (MISD) – Parallel computing architecture where many functional units (PU/CPU) perform different operations on the same data – Examples include fault tolerant computer architectures, e.g., running multiple error checking processes on same data stream • (Not very common though!) Flynn’s::SIMD Architectures • Single Instruction, Multiple Data stream (SIMD) – multiple processing elements that perform the same operation on multiple data points simultaneously – focus is on data level parallelism, i.e., many parallel computations, but only a single process (instruction) at a given moment – Many modern computers use SIMD instructions, e.g., to improve performance of multimedia use such as for image processing Flynn’s::MIMD Architectures • Multiple Instruction, Multiple Data stream (SIMD) – number of processors that function asynchronously and independently – at any time, different processors may be executing different instructions on different pieces of data – machines can be shared memory or distributed memory categories • depends on how MIMD processors access memory – Most systems these days operate on MIMD Approaches for Parallelism • Where and how – Explicit vs Implicit parallelism – Hardware – Operating System – Software/Applications – Some or all of these Explicit vs Implicit Parallelisation • Implicit Parallelism – Supported by parallel languages and parallelizing compilers that take care of identifying parallelism, the scheduling of calculations and the placement of data • Pretty hard to do (more later!) • Explicit Parallelism – In this approach, the programmer is responsible for most of the parallelization effort such as task decomposition, mapping tasks to processors, inter-process communications – This approach assumes user is the best judge of how parallelism can be exploited for a particular application • Typically non-trivial to achieve! – Consider SPARTAN HPC cluster assignment Slurm script (#SBATCH --ntasks=2 --cpus-per-task=4 vs #SBATCH –ntasks=1 --cpus-per-task=8) – (More next week!) Approaches for Parallelism • Where and how – Explicit vs Implicit parallelism – Hardware – Operating System – Software/Applications – Some or all of these Control Unit Integer Floating Vector Point Hardware Threading CPU (Heavily Simplified) Hardware Parallelisation Control Unit Control Unit Floating Vector Point Cache – much faster than reading/writing to main memory; instruction cache, data cache (multi- level) and translation lookaside buffer used for virtual-physical address translation (more later on Cloud and hypervisors). Parallelisation by adding extra CPU to allow more instructions to be processed per cycle. Usually shares arithmetic units. Heavy use of one type of computation can tie up all the available units of the CPU preventing other threads from using them. Basic CPU Cache Control Unit Integer Floating Vector Point Multi-Core Control Unit Control Unit Integer Floating Vector Integer Floating Vector Point Point (Heavily Simplified) Hardware Parallelisation Multiple cores that can process data and (in principle!!!) perform computational tasks in parallel. Typically share same cache, but issue of cache read/write performance and cache coherence. Possibility of cache stalls (CPU not doing anything whilst waiting for caching); many chips have mixture (L1 cache on single cores; L2 cache on pairs of cores; L3 cache shared by all cores); typical to have different cache speeds and cache sizes (higher hit rates but potentially higher latency). Heavily Simplified (Symmetric Multiprocessing (SMP)) Control Unit Integer Floating Vector Point Control Unit Integer Floating Vector Point Two (or more) identical processors connected to a single, shared main memory, with full access to all I/O devices, controlled by a single OS instance that treats all processors equally. Each processor executes different programs and works on different data but with capability of sharing common resources (memory, I/O device, ...). Processors can be connected in a variety of ways: buses, crossbar switches, meshes. More complex to program since need to program both for CPU and inter-processor communication (bus). Non-Uniform Memory Access (NUMA) Memory CPU Memory CPU Non-uniform memory access (NUMA) provides speed-up by allowing a processor to access its own local memory faster than non-local memory. Improved performance as long as data are localized to specific processes/processors. Key is allocating memory/processors in NUMA friendly ways, e.g., to avoid scheduling/locking and (expensive) inter-processor communication. Approaches such as ccNUMA with range of cache coherency protocols/products. X=X+1; Y=1; If Y=0 print X STORE R03, %A10 – store result R03, %A10 – load X into register R03, %A10 – add 1 to value of R03 (=X) Approaches for Parallelism • Where and how – Explicit vs Implicit parallelism – Hardware – Operating System – Software/Applications – Some or all of these Operating System Parallelism Approaches • Most modern multi-core operating systems support different “forms” of parallelisation – parallel vs interleaved semantics • A||BvsA|||B • Compute parallelism – Processes • Used to realise tasks, structure activities, ... • Native threads – Fork, Spawn, Join • Green threads – Scheduled by a virtual machine instead of natively by the OS • Data parallelism – Caching (cache coherency) – OS implies on “a” computer Approaches for Parallelism • Where and how – Explicit vs Implicit parallelism – Hardware – Operating System – Software/Applications – Some or all of these Software Parallelism Approaches • Many (most!) languages now support a range of parallelisation/concurrency features – Threads, thread pools, locks, semaphores, ... • Many languages developed specifically for parallel/concurrent systems – http://en.wikipedia.org/wiki/Concurrent_compu ting • Key issues that need to be tackled – Deadlock – processes involved constantly waiting for each other – Livelock – processes involved in livelock constantly change with regard to one another, but none are progressing Message Passing Interface • Widelyadoptedapproachformessagepassinginparallel systems • MappingstomajorlanguagesFortran,C,C++,Python,Java – Standardised, widely adopted, portable, performant, ... – Parallelisation = user’s problem • KeyMPIfunctions – MPI_Init – MPI_Finalize – MPI_COMM_SIZE – MPI_COMM_RANK – MPI_SEND – MPI_RECV :initiate MPI computation :terminatecomputation :determine number of processors :determine my process identifier :send a message :receive a message • Supportspoint-point,broadcastcommunications Examples of programming MPI next week (for assignment) /* C Example */ #include #include
int main (argc, argv) int argc;
char *argv[];
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes
HelloWorld MPI
printf( “Hello world from process %d of %d\n”, rank, size ); MPI_Finalize();
$mpicc -o hello_world hello_world.c
$ mpirun -np 4 hello_world Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4

(HT)Condor
• A specialized workload management system for compute-intensive jobs developed at University of Wisconsin
– Offers job queueing mechanisms, scheduling policies, priority schemes, resource monitoring/management
– User submits jobs to Condor and it chooses when and where to run the jobs, monitors their progress, and informs the user upon completion
– Allows to harvest “free” (?) CPU power from otherwise idle desktop workstations
• e.g. use desktop machines when keyboard and mouse are idle
– key press detected checkpoint and migrate a job to a different (idle) machine
– No need for shared file system across machines • Data can be staged to machines as/when needed
– Can work across organisational boundaries •
– ClassAds
• Advertise resources and accept jobs (according to policy)

Data Parallelism Approaches
• Challenges of big data
– The most important kind of parallelism challenge?
• Distributed data
– Consistency, Availability, Partition tolerance • CAP Theorem – more later
– ACID <-> BASE
• Distributed File Systems – e.g. Hadoop, Lustre, Ceph…

Challenges with Distribution
• A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable

Challenges with Distribution
• General assumptions that typically don’t hold in the real world…

(Some) Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous

Network reliability
• If I send some data over the network – It will arrive
– It will arrive in the order I sent it
– It will arrive uncorrupted
• If the network is unreliable, it will be so in only one of the above ways
– Consistently
• The lower layers in the networking stack protect me from these issues
• None of these statements are always true!

Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous

Latency is zero
• If I send some data, it will arrive “now”. – Or so fast to be effectively “now”
• Really…?
• Consider accessing Twitter (San Francisco) from Melbourne
– ~12,600km

The Route (traceroute)
1. 128.250.7.66
2. 172.18.65.89
3. fr151-v2010-vpn-general.unimelb.net.au
4. fr151-v2011-vpn-fusion-general.unimelb.net.au 5. br-151-te-2-3.unimelb.net.au
6. gigabitethernet1.er1.unimelb.cpe.aarnet.net.au 7. ge-5-1-0.bb1.a.mel.aarnet.net.au
8. so-0-1-0.bb1.a.syd.aarnet.net.au
9. so-2-2-0.bb1.a.pao.aarnet.net.au
10. paix1.cr1.pao1.twttr.com
11. ae52.smf1-er1.twttr.com

Latency Consequences
• On a 1Gbps link
– 9MB of data can be in flight
• Or 145 TCP (maximum) packets
• You could have sent 18MB of data before you start receiving responses, all of which will be in your TCP stack’s buffer
• You can’t get a message in less than 78ms
• You can’t get a reply in less than 156ms
• Latency can be much higher
• In a system with many nodes/hops each link can have different latencies