CS代考 PART 02 - PowCoder代写

P Y THON PAR ALLEL COMPUTING – PART 02

PROBLEM DECOMPOSI TION
• A central problem when working with MPI is to break the problem into “chunks” to be handled by individual processes.

• There are two main ways to decompose a problem:
• Domain decomposition: Data associated with a problem is split into chunks and
each parallel process works on a chunk of the data.
• Functional decomposition: Focus is on the computation rather than on the data. Used when pieces of data require different processing.

EXAMPLE OF A DOMAIN DECOMPOSI TION WI TH MPI

HERMITE INTERPOLATION

• Findingthepiece-wisehermitepolynomialinterpolationofasetofpoints

SEQUENTIAL VERSION

MPI VERSION

COLLEC TI VE OPERATIONS: POIN T- TO- POIN T VS COLLEC TI VE COMMUNICATION
• Collectivecommunicationallowstosenddatabetweenmultipleprocessesofagroup simultaneously

COLLECTIVE COMMUNICAN
• Synchronization
• Processes wait until all members of the group have reached the synchronization
• Global communication functions
• Broadcast data from one member to all members of a group • Gather data from all members to one member of a group
• Scatter data from one member to all members of a group

COLLECTIVE COMMUNICAN
• Collective computation (reductions)
• One member of the group collects data from the other members and performs an
operation (min, max, add, multiply, etc.) on that data.
• Collective Input/Output
• Each member of the group reads or writes a section of a file.

S YNCHRONIZATION
• MPIhasaspecialfunctionthatisdedicatedtosynchronizingprocesses:comm.Barrier(). • Noprocessadvancesuntilallhavecalledthefunction.

S YNCHRONIZATION

GLOBAL COMMUNICATION FUNC TIONS

BROADCASTING
• Oneprocesssendsthesamedatatoallprocessesinacommunicatorusingthe command comm.Bcast(buf, root=0).

• Broadcastsendsthesamepieceofdatatoallprocesseswhilescattersends chunks of an array to different processes.

• Comm.Scatter(sendbuf,recvbuf,root=0)methodtakesthreearguments.
• The first is an array of data that resides on the root process.
• The second parameter is used to hold the received data.
• The last parameter indicates the root process that is scattering the array of data.

• Gatheristheinverseofscatter,takingelementsfrommanyprocessesand gathering them to one single process.

• Comm.Gather(sendbuf,recvbuf,root=0)methodtakesthesame arguments as Comm.Scatter.

REDUC TION
• Comm.Reduce(sendbuf, recvbuf, op=MPI.SUM, root=0) handles almost all of the common reductions that a programmer needs to do in a parallel application.

REDUC TION
• Comm.Reducetakesanarrayofinputelementsandreturnsanarrayof reduced elements to the root process.
• MPI. M AX – Returns the maximum element.
• MPI. MIN – Returns the minimum element.
• MPI. SUM – Sums the elements.
• MPI. PROD – Multiplies all elements.

OTHER COLLECTIVE OPERATIONS
• Comm.Alltoall(sendbuf, recvbuf)
• File.Open(comm, filename, amode, info) • File.Write_all(buffer)

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

SERIAL VERSION

COMPUTING AN INTEGRAL USING PARALLEL COLLECTIVE VERSION

MATRIX- VEC TOR MULTIPLICATION

MATRIX-VECTOR MULTIPLICATION

COMMUNICATION OF BUFFER-LIKE OBJEC TS
• When using the upper case version of the methods (Send, Irecv, Gather, etc.) the data object must support the single-segment buffer interface.
• This interface is a standard Python mechanism provided by some types (e.g., strings and numeric arrays), which is why we have been using NumPy arrays in the examples

COMMUNICATION OF GENERIC P Y THON OBJEC TS
• ItisalsopossibletotransmitanarbitraryPythondatatypeusingthelower case version of the methods (send, irecv, gather, etc.)
• mpi4pywillserializethedatatype,sendittotheremoteprocess, then deserialize it back to the original data type (a process known as pickling and unpickling).
• While this is simple, it also adds significant overhead to the MPI operation.

WHAT IS JAX?
• JaxisaPythonlibrarydesignedforhigh-performanceMLresearch.
• Jaxisnothingmorethananumericalcomputinglibrary,justlikeNumpy,but
with some key improvements.
• ItwasdevelopedbyGoogleandusedinternallybothbyGoogleandDeepmind teams.

JAX BASICS

THE DEVICEARRAY
• OneofJAX’smainadvantagesisthatwecanrunthesameprogram, without any change, in hardware accelerators like GPUs and TPUs.
• This is accomplished by an underlying structure called DeviceArray, which essentially replaces Numpy’s standard array.
• DeviceArraysarelazy,whichmeansthattheykeepthevaluesinthe accelerator and pull them only when needed.

DEVICEARRAY

DE VICE ARR AYS
• We can use DeviceArrays just like we use standard arrays.
• We can pass it to other libraries, plot graphs, perform differentiation and things
will work.
• Also note that the majority of Numpy’s API (functions and operations) are supported by JAX, so your JAX code will be almost identical to Numpy.

• JAXcanbeseenasasetoffunctiontransformationsofregular code.

AUTO DIFFERENTIATION WITH GRAD() FUNCTION
• JAXisabletodifferentiatethroughallsortsofpythonandNumPyfunctions, including loops, branches, recursions, and more.
• This is incredibly useful for Deep Learning apps as we can run backpropagation pretty much effortlessly.
• The main function to accomplish this is called grad()

AUTO DIFFERENTIATION

ACCELERATED LINEAR ALGEBRA (XLA COMPILER)
• OneofthefactorsthatmakeJAXsofastisalsoAcceleratedLinearAlgebra or XLA.
• XLAisadomain-specificcompilerforlinearalgebrathathasbeenused extensively by Tensorflow.

ACCELERATED LINEAR ALGEBRA (XLA COMPILER)
• Inordertoperformmatrixoperationsasfastaspossible,thecodeiscompiled into a set of computation kernels that can be extensively optimized based on the nature of the code.
• Exampleofsuchoptimizationsinclude:
• Fusion of operations: Intermediate results are not saved into memory
• Optimized layout: Optimize the “shape” an array is represented in memory

JUST IN TIME COMPILATION (JIT)
• InordertotakeadvantageofthepowerofXLA,thecodemustbecompiled into the XLA kernels.
• Just-in-time(JIT)compilationisawayofexecutingcomputercodethat involves compilation during the execution of a program – at run time – rather than before execution.

• Pmapisanothertransformationthatenablesustoreplicatethecomputation into multiple cores or devices and execute them in parallel
• pinpmapstandsforparallel
• Itautomaticallydistributescomputationacrossallthecurrentdevicesand handles all the communication between them.

COLLECTIVE COMMUNICATION

AUTOMATIC VECTORIZATION WITH VMAP
• A function transformation that enables us to vectorize functions
• vstandsforvector.
• We can take a function that operates on a single data point and vectorize it

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts