CS代考 PART 02

P Y THON PAR ALLEL COMPUTING – PART 02

PROBLEM DECOMPOSI TION
• A central problem when working with MPI is to break the problem into “chunks” to be handled by individual processes.

Copyright By PowCoder代写 加微信 powcoder

• There are two main ways to decompose a problem:
• Domain decomposition: Data associated with a problem is split into chunks and
each parallel process works on a chunk of the data.
• Functional decomposition: Focus is on the computation rather than on the data. Used when pieces of data require different processing.

EXAMPLE OF A DOMAIN DECOMPOSI TION WI TH MPI

HERMITE INTERPOLATION

HERMITE INTERPOLATION

HERMITE INTERPOLATION

• Findingthepiece-wisehermitepolynomialinterpolationofasetofpoints

SEQUENTIAL VERSION

SEQUENTIAL VERSION

SEQUENTIAL VERSION

SEQUENTIAL VERSION

SEQUENTIAL VERSION

SEQUENTIAL VERSION

MPI VERSION

MPI VERSION

MPI VERSION

MPI VERSION

MPI VERSION

MPI VERSION

COLLEC TI VE OPERATIONS: POIN T- TO- POIN T VS COLLEC TI VE COMMUNICATION
• Collectivecommunicationallowstosenddatabetweenmultipleprocessesofagroup simultaneously

COLLECTIVE COMMUNICAN
• Synchronization
• Processes wait until all members of the group have reached the synchronization
• Global communication functions
• Broadcast data from one member to all members of a group • Gather data from all members to one member of a group
• Scatter data from one member to all members of a group

COLLECTIVE COMMUNICAN
• Collective computation (reductions)
• One member of the group collects data from the other members and performs an
operation (min, max, add, multiply, etc.) on that data.
• Collective Input/Output
• Each member of the group reads or writes a section of a file.

S YNCHRONIZATION
• MPIhasaspecialfunctionthatisdedicatedtosynchronizingprocesses:comm.Barrier(). • Noprocessadvancesuntilallhavecalledthefunction.

S YNCHRONIZATION

GLOBAL COMMUNICATION FUNC TIONS

BROADCASTING
• Oneprocesssendsthesamedatatoallprocessesinacommunicatorusingthe command comm.Bcast(buf, root=0).

• Broadcastsendsthesamepieceofdatatoallprocesseswhilescattersends chunks of an array to different processes.

• Comm.Scatter(sendbuf,recvbuf,root=0)methodtakesthreearguments.
• The first is an array of data that resides on the root process.
• The second parameter is used to hold the received data.
• The last parameter indicates the root process that is scattering the array of data.

• Gatheristheinverseofscatter,takingelementsfrommanyprocessesand gathering them to one single process.

• Comm.Gather(sendbuf,recvbuf,root=0)methodtakesthesame arguments as Comm.Scatter.

REDUC TION
• Comm.Reduce(sendbuf, recvbuf, op=MPI.SUM, root=0) handles almost all of the common reductions that a programmer needs to do in a parallel application.

REDUC TION
• Comm.Reducetakesanarrayofinputelementsandreturnsanarrayof reduced elements to the root process.
• MPI. M AX – Returns the maximum element.
• MPI. MIN – Returns the minimum element.
• MPI. SUM – Sums the elements.
• MPI. PROD – Multiplies all elements.

OTHER COLLECTIVE OPERATIONS
• Comm.Alltoall(sendbuf, recvbuf)
• File.Open(comm, filename, amode, info) • File.Write_all(buffer)

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

SERIAL VERSION

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

MATRIX- VEC TOR MULTIPLICATION

MATRIX-VECTOR MULTIPLICATION

MATRIX-VECTOR MULTIPLICATION

MATRIX-VECTOR MULTIPLICATION

COMMUNICATION OF BUFFER-LIKE OBJEC TS
• When using the upper case version of the methods (Send, Irecv, Gather, etc.) the data object must support the single-segment buffer interface.
• This interface is a standard Python mechanism provided by some types (e.g., strings and numeric arrays), which is why we have been using NumPy arrays in the examples

COMMUNICATION OF GENERIC P Y THON OBJEC TS
• ItisalsopossibletotransmitanarbitraryPythondatatypeusingthelower case version of the methods (send, irecv, gather, etc.)
• mpi4pywillserializethedatatype,sendittotheremoteprocess, then deserialize it back to the original data type (a process known as pickling and unpickling).
• While this is simple, it also adds significant overhead to the MPI operation.

WHAT IS JAX?
• JaxisaPythonlibrarydesignedforhigh-performanceMLresearch.
• Jaxisnothingmorethananumericalcomputinglibrary,justlikeNumpy,but
with some key improvements.
• ItwasdevelopedbyGoogleandusedinternallybothbyGoogleandDeepmind teams.

JAX BASICS

THE DEVICEARRAY
• OneofJAX’smainadvantagesisthatwecanrunthesameprogram, without any change, in hardware accelerators like GPUs and TPUs.
• This is accomplished by an underlying structure called DeviceArray, which essentially replaces Numpy’s standard array.
• DeviceArraysarelazy,whichmeansthattheykeepthevaluesinthe accelerator and pull them only when needed.

DEVICEARRAY

DE VICE ARR AYS
• We can use DeviceArrays just like we use standard arrays.
• We can pass it to other libraries, plot graphs, perform differentiation and things
will work.
• Also note that the majority of Numpy’s API (functions and operations) are supported by JAX, so your JAX code will be almost identical to Numpy.

• JAXcanbeseenasasetoffunctiontransformationsofregular code.

AUTO DIFFERENTIATION WITH GRAD() FUNCTION
• JAXisabletodifferentiatethroughallsortsofpythonandNumPyfunctions, including loops, branches, recursions, and more.
• This is incredibly useful for Deep Learning apps as we can run backpropagation pretty much effortlessly.
• The main function to accomplish this is called grad()

AUTO DIFFERENTIATION

ACCELERATED LINEAR ALGEBRA (XLA COMPILER)
• OneofthefactorsthatmakeJAXsofastisalsoAcceleratedLinearAlgebra or XLA.
• XLAisadomain-specificcompilerforlinearalgebrathathasbeenused extensively by Tensorflow.

ACCELERATED LINEAR ALGEBRA (XLA COMPILER)
• Inordertoperformmatrixoperationsasfastaspossible,thecodeiscompiled into a set of computation kernels that can be extensively optimized based on the nature of the code.
• Exampleofsuchoptimizationsinclude:
• Fusion of operations: Intermediate results are not saved into memory
• Optimized layout: Optimize the “shape” an array is represented in memory

JUST IN TIME COMPILATION (JIT)
• InordertotakeadvantageofthepowerofXLA,thecodemustbecompiled into the XLA kernels.
• Just-in-time(JIT)compilationisawayofexecutingcomputercodethat involves compilation during the execution of a program – at run time – rather than before execution.

• Pmapisanothertransformationthatenablesustoreplicatethecomputation into multiple cores or devices and execute them in parallel
• pinpmapstandsforparallel
• Itautomaticallydistributescomputationacrossallthecurrentdevicesand handles all the communication between them.

COLLECTIVE COMMUNICATION

AUTOMATIC VECTORIZATION WITH VMAP
• A function transformation that enables us to vectorize functions
• vstandsforvector.
• We can take a function that operates on a single data point and vectorize it

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com