Overview of Distributed and Parallel
Computing Systems – Lecture 5-6 Professor Richard O. Sinnott
Director, Melbourne eResearch Group University of Melbourne
Copyright By PowCoder代写 加微信 powcoder
Some things…
• Next weeks lectures/workshops – (it starts getting practical!!!)
• Assignment 1 – handed out yesterday (for thinking)
– Attending next weeks lecture essential to do assignment
• Find a “buddy”
• Workshop next week
– Discussion of assignment 1 • Tips and tricks
• … don’t expect answers
Some more things…
• Assignment 2 – teams reminder – Organise or be organised
• Find many buddies
• UniMelb data centre visit
• Happened this week and will take place again next week
Data Centre
• Accounts in process of being set up on Karaage – (HPC admin system)
• Access and use will look like
– (may need Putty or VPN if coming in from elsewhere) – More in next lecture
Compute Scaling
• Vertical Computational Scaling
– Have faster processors
• Switch your n GHz CPU for a 2n GHz one = 2x faster – Easy to do, but costs more
– Limits of fundamental physics/matter (nanoCMOS) • 3.6Ghz (Intel i7-4960X)
Compute Scaling
• Horizontal Computational Scaling
– Have more processors
• Easy to add more; cost increase not so great • But…
– harder to design, develop, test, debug, deploy, manage, understand, …
• Note that for many/most folks
– HTC far more important than HPC
– (I don’t do HPC, but I’ve done a lot of HTC)
Add More = ???
• Single machine multiple cores
– Typical laptop/PC/server these days
• Loosely coupled collection/cluster of machines
– Pooling/sharing of resources
• Dedicated vs available only when not in use by others • Web services, … Condor, Boinc
• Tightly coupled cluster of machines
– Typical HPC/HTC set-up (SPARTAN, NCI, …)
• Many servers in same rack/server room (often with fast message passing interconnects)
• Widely distributed clusters of machines
– UK NGS, EGEE, … distributed systems more generally
• Hybrid combinations of the above
– Leads to many challenges with distributed systems
• Shared state (or lack thereof)
• Message passing paradigms – dangers of delayed/lost messages
Add More…limitations
• Add more processes…?
– If n processors (cores) are thrown at a problem how much faster will it go?
• Some terminology:
• T(1)=timeforserialcomputation
• T(N)=timeforNparallelcomputations
• S(N)=speedup
• Proportionofspeedupdependsonpartsofprogram that can’t be parallelised
Amdahl’s Law
Fraction of running time that sequential program spends on non-parallel parts of a computation approximates to S = 1/
If 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20×, no matter how many processors are used, i.e. if the non-parallelisable part takes 1 hour, then no matter how many cores you throw at it, it won’t complete in <1 hour.
Over-Simplification of Amdahl’s Law
• Consider a program that executes a single loop, where all iterations can be computed independently, i.e., code can be parallelized. By splitting the loop into several parts, e.g., one loop iteration per processor, each processor now has to deal with loop overheads such as calculation of bounds, test for loop completion etc. This overhead is replicated as many times as there are processors. In effect, loop overhead acts as a further (serial) overhead in running the code. Also getting data to/from many processor overheads?
In a nutshell Amdahl’s Law greatly simplifies the real world!
• It also assumes a fixed problem size – sometimes can’t predict length of time required for jobs,
– e.g., state space exploration or differential equations that don’t solve...
Gustafson-Barsis's Law • Gives the “scaled speed-up”
Fixed parallel time per process
Fraction of running time sequential program spends on parallel parts
Speed up S using N processes is given as a linear formula dependent on the number of processes and the fraction of time to run sequential parts. Gustafson's Law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a practical fixed time. Faster (more parallel) equipment available, larger problems can be solved in the same time.
Computer Architecture - 101
• At the simplest level a computer comprises
– CPU for executing programs • ALU, FPU, ...
• Load/store unit
• Registers (fast memory locations)
• Program counter (address of instruction that is executing) • Memory interface
– Memory that stores/executing programs and related data
– I/O systems (keyboards, networks, ...)
– Permanent storage for read/writing data into out of memory
– Of key importance (especially for HPC systems!) is the balance of all of these
• Superfast CPUs starved of data
• Bioinformatics jobs killing clusters!!!
– There are many different ways to design/architect computers • different flavours suitable to different problems
Computer Architectures
• Flynn’s Taxonomy
– Single Instruction, Single Data stream (SISD)
– Single Instruction, Multiple Data streams (SIMD)
– Multiple Instruction, Single Data stream (MISD)
– Multiple Instruction, Multiple Data streams (MIMD)
Flynn’s::SISD Architectures
• Single Instruction, Single Data stream (SISD)
• Sequential computer which exploits no parallelism in either the instruction or data streams
• Single control unit (CU/CPU) fetches single Instruction Stream from memory. The CU/CPU then generates appropriate control signals to direct single processing element to operate on single Data Stream, i.e., one operation at a time.
• Pretty much obsolete...!
– Basic idea of von Neumann
computer 16
Flynn’s::MISD Architectures
• Multiple Instruction, Single Data stream (MISD)
– Parallel computing architecture where many functional units (PU/CPU) perform different operations on the same data
– Examples include fault tolerant computer architectures, e.g., running multiple error checking processes on same data stream
• (Not very common though!)
Flynn’s::SIMD Architectures
• Single Instruction, Multiple Data stream (SIMD)
– multiple processing elements that perform the same operation on multiple data points simultaneously
– focus is on data level parallelism, i.e., many parallel computations, but only a single process (instruction) at a given moment
– Many modern computers use SIMD instructions, e.g., to improve performance of multimedia use such as for image processing
Flynn’s::MIMD Architectures
• Multiple Instruction, Multiple Data stream (SIMD)
– number of processors that function asynchronously and independently
– at any time, different processors may be executing different instructions on different pieces of data
– machines can be shared memory or distributed memory categories
• depends on how MIMD processors access memory
– Most systems these days operate on MIMD
Approaches for Parallelism
• Where and how
– Explicit vs Implicit parallelism
– Hardware
– Operating System
– Software/Applications – Some or all of these
Explicit vs Implicit Parallelisation
• Implicit Parallelism
– Supported by parallel languages and parallelizing compilers that take care of identifying parallelism, the scheduling of calculations and the placement of data
• Pretty hard to do (more later!)
• Explicit Parallelism
– In this approach, the programmer is responsible for most of the parallelization effort such as task decomposition, mapping tasks to processors, inter-process communications
– This approach assumes user is the best judge of how parallelism can be exploited for a particular application
• Typically non-trivial to achieve!
– Consider SPARTAN HPC cluster assignment
Slurm script (#SBATCH --ntasks=2 --cpus-per-task=4 vs #SBATCH –ntasks=1 --cpus-per-task=8)
– (More next week!)
Approaches for Parallelism
• Where and how
– Explicit vs Implicit parallelism – Hardware
– Operating System
– Software/Applications
– Some or all of these
Control Unit
Integer Floating Vector Point
Hardware Threading CPU
(Heavily Simplified) Hardware Parallelisation
Control Unit
Control Unit
Floating Vector Point
Cache – much faster than reading/writing to main memory; instruction cache, data cache (multi- level) and translation lookaside buffer used for virtual-physical address translation (more later on Cloud and hypervisors).
Parallelisation by adding extra CPU to allow more instructions to be processed per cycle. Usually shares arithmetic units. Heavy use of one type of computation can tie up all the available units of
the CPU preventing other threads from using them.
Basic CPU Cache
Control Unit
Integer Floating Vector Point
Multi-Core
Control Unit Control Unit
Integer Floating Vector Integer Floating Vector Point Point
(Heavily Simplified) Hardware Parallelisation
Multiple cores that can process data and (in principle!!!) perform computational tasks in parallel. Typically share same cache, but issue of cache read/write performance and cache coherence. Possibility of cache stalls (CPU not doing anything whilst waiting for caching); many chips have mixture (L1 cache on single cores; L2 cache on pairs of cores; L3 cache shared by all cores); typical to have different cache speeds and cache sizes (higher hit rates but potentially higher latency).
Heavily Simplified (Symmetric Multiprocessing (SMP))
Control Unit
Integer Floating Vector Point
Control Unit
Integer Floating Vector Point
Two (or more) identical processors connected to a single, shared main memory, with full access to all I/O devices, controlled by a single OS instance that treats all processors equally. Each processor executes different programs and works on different data but with capability of sharing common resources (memory, I/O device, ...). Processors can be connected in a variety of ways: buses, crossbar switches, meshes. More complex to program since need to program both for CPU and inter-processor communication (bus).
Non-Uniform Memory Access (NUMA)
Memory CPU
Memory CPU
Non-uniform memory access (NUMA) provides speed-up by allowing a processor to access its own local memory faster than non-local memory. Improved performance as long as data are localized to specific processes/processors. Key is allocating memory/processors in NUMA friendly ways, e.g., to avoid scheduling/locking and (expensive) inter-processor communication. Approaches such as ccNUMA with range of cache coherency protocols/products.
X=X+1; Y=1;
If Y=0 print X
STORE R03, %A10 – store result
R03, %A10 – load X into register
R03, %A10 – add 1 to value of R03 (=X)
Approaches for Parallelism
• Where and how
– Explicit vs Implicit parallelism – Hardware
– Operating System
– Software/Applications
– Some or all of these
Operating System Parallelism Approaches
• Most modern multi-core operating systems support different “forms” of parallelisation
– parallel vs interleaved semantics • A||BvsA|||B
• Compute parallelism
– Processes
• Used to realise tasks, structure activities, ...
• Native threads
– Fork, Spawn, Join
• Green threads
– Scheduled by a virtual machine instead of natively by the OS
• Data parallelism
– Caching (cache coherency) – OS implies on “a” computer
Approaches for Parallelism
• Where and how
– Explicit vs Implicit parallelism – Hardware
– Operating System
– Software/Applications
– Some or all of these
Software Parallelism Approaches
• Many (most!) languages now support a range of parallelisation/concurrency features
– Threads, thread pools, locks, semaphores, ...
• Many languages developed specifically for parallel/concurrent systems
– http://en.wikipedia.org/wiki/Concurrent_compu ting
• Key issues that need to be tackled
– Deadlock – processes involved constantly waiting for each other
– Livelock – processes involved in livelock constantly change with regard to one another, but none are progressing
Message Passing Interface
• Widelyadoptedapproachformessagepassinginparallel systems
• MappingstomajorlanguagesFortran,C,C++,Python,Java
– Standardised, widely adopted, portable, performant, ...
– Parallelisation = user’s problem
• KeyMPIfunctions
– MPI_Init
– MPI_Finalize
– MPI_COMM_SIZE – MPI_COMM_RANK – MPI_SEND
– MPI_RECV
:initiate MPI computation :terminatecomputation :determine number of processors :determine my process identifier :send a message
:receive a message
• Supportspoint-point,broadcastcommunications
Examples of programming MPI next week (for assignment)
/* C Example */ #include
int main (argc, argv) int argc;
char *argv[];
int rank, size;
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes
HelloWorld MPI
printf( “Hello world from process %d of %d\n”, rank, size ); MPI_Finalize();
$mpicc -o hello_world hello_world.c
$ mpirun -np 4 hello_world Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4
(HT)Condor
• A specialized workload management system for compute-intensive jobs developed at University of Wisconsin
– Offers job queueing mechanisms, scheduling policies, priority schemes, resource monitoring/management
– User submits jobs to Condor and it chooses when and where to run the jobs, monitors their progress, and informs the user upon completion
– Allows to harvest “free” (?) CPU power from otherwise idle desktop workstations
• e.g. use desktop machines when keyboard and mouse are idle
– key press detected checkpoint and migrate a job to a different (idle) machine
– No need for shared file system across machines • Data can be staged to machines as/when needed
– Can work across organisational boundaries •
– ClassAds
• Advertise resources and accept jobs (according to policy)
Data Parallelism Approaches
• Challenges of big data
– The most important kind of parallelism challenge?
• Distributed data
– Consistency, Availability, Partition tolerance • CAP Theorem – more later
– ACID <-> BASE
• Distributed File Systems – e.g. Hadoop, Lustre, Ceph…
Challenges with Distribution
• A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable
Challenges with Distribution
• General assumptions that typically don’t hold in the real world…
(Some) Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
Network reliability
• If I send some data over the network – It will arrive
– It will arrive in the order I sent it
– It will arrive uncorrupted
• If the network is unreliable, it will be so in only one of the above ways
– Consistently
• The lower layers in the networking stack protect me from these issues
• None of these statements are always true!
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
Latency is zero
• If I send some data, it will arrive “now”. – Or so fast to be effectively “now”
• Really…?
• Consider accessing Twitter (San Francisco) from Melbourne
– ~12,600km
The Route (traceroute)
1. 128.250.7.66
2. 172.18.65.89
3. fr151-v2010-vpn-general.unimelb.net.au
4. fr151-v2011-vpn-fusion-general.unimelb.net.au 5. br-151-te-2-3.unimelb.net.au
6. gigabitethernet1.er1.unimelb.cpe.aarnet.net.au 7. ge-5-1-0.bb1.a.mel.aarnet.net.au
8. so-0-1-0.bb1.a.syd.aarnet.net.au
9. so-2-2-0.bb1.a.pao.aarnet.net.au
10. paix1.cr1.pao1.twttr.com
11. ae52.smf1-er1.twttr.com
Latency Consequences
• On a 1Gbps link
– 9MB of data can be in flight
• Or 145 TCP (maximum) packets
• You could have sent 18MB of data before you start receiving responses, all of which will be in your TCP stack’s buffer
• You can’t get a message in less than 78ms
• You can’t get a reply in less than 156ms
• Latency can be much higher
• In a system with many nodes/hops each link can have different latencies
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
Bandwidth is infinite
• I can send any amount of data I wish between any nodes
– Think of BIG DATA (Tb->Pb+)
• SKA, LHC, Genomics, …
– And consequences for others, e.g., Melbourne hospitals
• Network bandwidth typically of the order – 1GB, 10GB, 100GB, …
– For hospitals, homes varies wildly…
• Networking capacity carefully planned
– JANET (UK) backbone planned for up to 25% actual bandwidth capacity actually used
Bandwidth is infinite
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
The network is secure
• Really…?
• So, I don’t have to worry about…
– People sending data to my services
• Repeated password attempts, SQL injections, …!?
– People actively attacking me
• Distributed denial of service attacks
– People reading the data sent over the network • Man in the middle attacks
– People masquerading as one of my nodes • Spoofing
– People breaking into one of my nodes • Trojans, viruses, brute force attacks, …
– People stealing the physical hardware
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
Topology doesn’t change
• Node x is always there
– There = ? • Latency
• Services
– Typically can’t guarantee the route taken and hence latency
• unless specific routing protocols selected, e.g., differentiated services with reservation protocols
• Mostly not available (TCP/IP) – Next hop behaviour =?
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
There is one administrator
– Pretty much never the case
– Who is administrator for University of Melbourne
– Firewall changes, server reconfigurations, services, access control (students/staff/others…)
– Inter-organisational administrators?
Erroneous Assumptions of Distributed Systems
1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change 6. There is one administrator 7. Transport cost is zero
8. The network is homogeneous 9. Time is ubiquitous
Transport cost is zero
• I can send as much data as I like for free
– Can appear so in some places, but…
• Australia…!?!?!?!
– Capped uploads/downloads
– UniMelb researchers with $50k bill for data access that wasn’t over AARNET
– Amazon EC2/S3…$$$$$ – It is NEVER free!
Erroneous Assumptions of Distributed Systems
1. The network is rel
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com