fbpx
Wikipedia

Collective operation

Collective operations are building blocks for interaction patterns, that are often used in SPMD algorithms in the parallel programming context. Hence, there is an interest in efficient realizations of these operations.

A realization of the collective operations is provided by the Message Passing Interface[1] (MPI).

Definitions edit

In all asymptotic runtime functions, we denote the latency   (or startup time per message, independent of message size), the communication cost per word  , the number of processing units   and the input size per node  . In cases where we have initial messages on more than one node we assume that all local messages are of the same size. To address individual processing units we use  .

If we do not have an equal distribution, i.e. node   has a message of size  , we get an upper bound for the runtime by setting  .

A distributed memory model is assumed. The concepts are similar for the shared memory model. However, shared memory systems can provide hardware support for some operations like broadcast (§ Broadcast) for example, which allows convenient concurrent read.[2] Thus, new algorithmic possibilities can become available.

Broadcast edit

 
Information flow of Broadcast operation performed on three nodes.

The broadcast pattern[3] is used to distribute data from one processing unit to all processing units, which is often needed in SPMD parallel programs to dispense input or global values. Broadcast can be interpreted as an inverse version of the reduce pattern (§ Reduce). Initially only root   with     stores message  . During broadcast   is sent to the remaining processing units, so that eventually   is available to all processing units.

Since an implementation by means of a sequential for-loop with   iterations becomes a bottleneck, divide-and-conquer approaches are common. One possibility is to utilize a binomial tree structure with the requirement that   has to be a power of two. When a processing unit is responsible for sending   to processing units  , it sends   to processing unit   and delegates responsibility for the processing units   to it, while its own responsibility is cut down to  .

Binomial trees have a problem with long messages  . The receiving unit of   can only propagate the message to other units, after it received the whole message. In the meantime, the communication network is not utilized. Therefore pipelining on binary trees is used, where   is split into an array of   packets of size  . The packets are then broadcast one after another, so that data is distributed fast in the communication network.

Pipelined broadcast on balanced binary tree is possible in  , whereas for the non-pipelined case it takes   cost.

Reduce edit

 
Information flow of Reduce operation performed on three nodes. f is the associative operator and α is the result of the reduction.

The reduce pattern[4] is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator. Given   processing units, message   is on processing unit   initially. All   are aggregated by   and the result is eventually stored on  . The reduction operator   must be associative at least. Some algorithms require a commutative operator with a neutral element. Operators like  ,  ,   are common.

Implementation considerations are similar to broadcast (§ Broadcast). For pipelining on binary trees the message must be representable as a vector of smaller object for component-wise reduction.

Pipelined reduce on a balanced binary tree is possible in  .

All-Reduce edit

 
Information flow of All-Reduce operation performed on three nodes. f is the associative operator and α is the result of the reduction.

The all-reduce pattern[5] (also called allreduce) is used if the result of a reduce operation (§ Reduce) must be distributed to all processing units. Given   processing units, message   is on processing unit   initially. All   are aggregated by an operator   and the result is eventually stored on all  . Analog to the reduce operation, the operator   must be at least associative.

All-reduce can be interpreted as a reduce operation with a subsequent broadcast (§ Broadcast). For long messages a corresponding implementation is suitable, whereas for short messages, the latency can be reduced by using a hypercube (Hypercube (communication pattern) § All-Gather/ All-Reduce) topology, if   is a power of two. All-reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth.[6]

All-reduce is possible in  , since reduce and broadcast are possible in   with pipelining on balanced binary trees. All-reduce implemented with a butterfly algorithm achieves the same asymptotic runtime.

Prefix-Sum/Scan edit

 
Information flow of Prefix-Sum/Scan operation performed on three nodes. The operator + can be any associative operator.

The prefix-sum or scan operation[7] is used to collect data or partial results from different processing units and to compute intermediate results by an operator, which are stored on those processing units. It can be seen as a generalization of the reduce operation (§ Reduce). Given   processing units, message   is on processing unit  . The operator   must be at least associative, whereas some algorithms require also a commutative operator and a neutral element. Common operators are  ,   and  . Eventually processing unit   stores the prefix sum   . In the case of the so-called exclusive prefix sum, processing unit   stores the prefix sum   . Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums.

For short messages, this can be achieved with a hypercube topology if   is a power of two. For long messages, the hypercube (Hypercube (communication pattern) § Prefix sum, Prefix sum § Distributed memory: Hypercube algorithm) topology is not suitable, since all processing units are active in every step and therefore pipelining can't be used. A binary tree topology is better suited for arbitrary   and long messages (Prefix sum § Large Message Sizes: Pipelined Binary Tree).

Prefix-sum on a binary tree can be implemented with an upward and downward phase. In the upward phase reduction is performed, while the downward phase is similar to broadcast, where the prefix sums are computed by sending different data to the left and right children. With this approach pipelining is possible, because the operations are equal to reduction (§ Reduce) and broadcast (§ Broadcast).

Pipelined prefix sum on a binary tree is possible in  .

Barrier edit

The barrier[8] as a collective operation is a generalization of the concept of a barrier, that can be used in distributed computing. When a processing unit calls barrier, it waits until all other processing units have called barrier as well. Barrier is thus used to achieve global synchronization in distributed computing.

One way to implement barrier is to call all-reduce (§ All-Reduce) with an empty/ dummy operand. We know the runtime of All-reduce is  . Using a dummy operand reduces size   to a constant factor and leads to a runtime of  .

Gather edit

 
Information flow of Gather operation performed on three nodes.

The gather communication pattern[9] is used to store data from all processing units on a single processing unit. Given   processing units, message   on processing unit  . For a fixed processing unit  , we want to store the message   on  . Gather can be thought of as a reduce operation (§ Reduce) that uses the concatenation operator. This works due to the fact that concatenation is associative. By using the same binomial tree reduction algorithm we get a runtime of  . We see that the asymptotic runtime is similar to the asymptotic runtime of reduce  , but with the addition of a factor p to the term  . This additional factor is due to the message size increasing in each step as messages get concatenated. Compare this to reduce where message size is a constant for operators like  .

All-Gather edit

 
Information flow of All-Gather operation performed on three nodes.

The all-gather communication pattern[9] is used to collect data from all processing units and to store the collected data on all processing units. Given   processing units  , message   initially stored on  , we want to store the message   on each  .

It can be thought of in multiple ways. The first is as an all-reduce operation (§ All-Reduce) with concatenation as the operator, in the same way that gather can be represented by reduce. The second is as a gather-operation followed by a broadcast of the new message of size  . With this we see that all-gather in   is possible.

Scatter edit

 
Information flow of Scatter operation performed on three nodes.

The scatter communication pattern[10] is used to distribute data from one processing unit to all the processing units. It differs from broadcast, in that it does not send the same message to all processing units. Instead it splits the message and delivers one part of it to each processing unit.

Given   processing units  , a fixed processing unit   that holds the message  . We want to transport the message   onto  . The same implementation concerns as for gather (§ Gather) apply. This leads to an optimal runtime in  .

All-to-all edit

All-to-all[11] is the most general communication pattern. For  , message   is the message that is initially stored on node   and has to be delivered to node  . We can express all communication primitives that do not use operators through all-to-all. For example, broadcast of message   from node   is emulated by setting   for   and setting   empty for  .

Assuming we have a fully connected network, the best possible runtime for all-to-all is in   . This is achieved through   rounds of direct message exchange. For   power of 2, in communication round   , node   exchanges messages with node   .

If the message size is small and latency dominates the communication, a hypercube algorithm can be used to distribute the messages in time   .

 
Information flow of All-to-All operation performed on three nodes. Letters indicate nodes and numbers indicate information items.

Runtime Overview edit

This table[12] gives an overview over the best known asymptotic runtimes, assuming we have free choice of network topology.

Example topologies we want for optimal runtime are binary tree, binomial tree, hypercube.

In practice, we have to adjust to the available physical topologies, e.g. dragonfly, fat tree, grid network (references other topologies, too).

More information under Network topology.

For each operation, the optimal algorithm can depend on the input sizes  . For example, broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal.

The complexities stated in the table depend on the latency   and the communication cost per word   in addition to the number of processing units   and the input message size per node  . The # senders and # receivers columns represent the number of senders and receivers that are involved in the operation respectively. The # messages column lists the number of input messages and the Computations? column indicates if any computations are done on the messages or if the messages are just delivered without processing. Complexity gives the asymptotic runtime complexity of an optimal implementation under free choice of topology.

Name # senders # receivers # messages Computations? Complexity
Broadcast       no  
Reduce       yes  
All-reduce       yes  
Prefix sum       yes  
Barrier       no  
Gather       no  
All-Gather       no  
Scatter       no  
All-To-All       no   or  

Notes edit

  1. ^ Intercommunicator Collective Operations. The Message Passing Interface (MPI) standard, chapter 7.3.1. Mathematics and Computer Science Division, Argonne National Laboratory.
  2. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 395
  3. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 396-401
  4. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 402-403
  5. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 403-404
  6. ^ Yuan, Xin (February 2009). "Bandwidth optimal all-reduce algorithms for clusters of workstations" (PDF). Journal of Parallel and Distributed Computing. 69 (2).
  7. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 404-406
  8. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 408
  9. ^ a b Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 412-413
  10. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 413
  11. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, pp. 413-418
  12. ^ Sanders, Mehlhorn, Dietzfelbinger, Dementiev 2019, p. 394

References edit

Sanders, Peter; Mehlhorn, Kurt; Dietzfelbinger, Martin; Dementiev, Roman (2019). Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer Nature Switzerland AG. ISBN 978-3-030-25208-3.

collective, operation, building, blocks, interaction, patterns, that, often, used, spmd, algorithms, parallel, programming, context, hence, there, interest, efficient, realizations, these, operations, realization, collective, operations, provided, message, pas. Collective operations are building blocks for interaction patterns that are often used in SPMD algorithms in the parallel programming context Hence there is an interest in efficient realizations of these operations A realization of the collective operations is provided by the Message Passing Interface 1 MPI Contents 1 Definitions 2 Broadcast 3 Reduce 4 All Reduce 5 Prefix Sum Scan 6 Barrier 7 Gather 8 All Gather 9 Scatter 10 All to all 11 Runtime Overview 12 Notes 13 ReferencesDefinitions editIn all asymptotic runtime functions we denote the latency a displaystyle alpha nbsp or startup time per message independent of message size the communication cost per word b displaystyle beta nbsp the number of processing units p displaystyle p nbsp and the input size per node n displaystyle n nbsp In cases where we have initial messages on more than one node we assume that all local messages are of the same size To address individual processing units we use pi p0 p1 pp 1 displaystyle p i in p 0 p 1 dots p p 1 nbsp If we do not have an equal distribution i e node pi displaystyle p i nbsp has a message of size ni displaystyle n i nbsp we get an upper bound for the runtime by setting n max n0 n1 np 1 displaystyle n max n 0 n 1 dots n p 1 nbsp A distributed memory model is assumed The concepts are similar for the shared memory model However shared memory systems can provide hardware support for some operations like broadcast Broadcast for example which allows convenient concurrent read 2 Thus new algorithmic possibilities can become available Broadcast editMain article Broadcast parallel pattern nbsp Information flow of Broadcast operation performed on three nodes The broadcast pattern 3 is used to distribute data from one processing unit to all processing units which is often needed in SPMD parallel programs to dispense input or global values Broadcast can be interpreted as an inverse version of the reduce pattern Reduce Initially only root r displaystyle r nbsp with id displaystyle id nbsp 0 displaystyle 0 nbsp stores message m displaystyle m nbsp During broadcast m displaystyle m nbsp is sent to the remaining processing units so that eventually m displaystyle m nbsp is available to all processing units Since an implementation by means of a sequential for loop with p 1 displaystyle p 1 nbsp iterations becomes a bottleneck divide and conquer approaches are common One possibility is to utilize a binomial tree structure with the requirement that p displaystyle p nbsp has to be a power of two When a processing unit is responsible for sending m displaystyle m nbsp to processing units i j displaystyle i j nbsp it sends m displaystyle m nbsp to processing unit i j 2 displaystyle left lceil i j 2 right rceil nbsp and delegates responsibility for the processing units i j 2 j displaystyle left lceil i j 2 right rceil j nbsp to it while its own responsibility is cut down to i i j 2 1 displaystyle i left lceil i j 2 right rceil 1 nbsp Binomial trees have a problem with long messages m displaystyle m nbsp The receiving unit of m displaystyle m nbsp can only propagate the message to other units after it received the whole message In the meantime the communication network is not utilized Therefore pipelining on binary trees is used where m displaystyle m nbsp is split into an array of k displaystyle k nbsp packets of size n k displaystyle left lceil n k right rceil nbsp The packets are then broadcast one after another so that data is distributed fast in the communication network Pipelined broadcast on balanced binary tree is possible in O alog p bn displaystyle mathcal O alpha log p beta n nbsp whereas for the non pipelined case it takes O a bn log p displaystyle mathcal O alpha beta n log p nbsp cost Reduce editMain article Reduce parallel pattern nbsp Information flow of Reduce operation performed on three nodes f is the associative operator and a is the result of the reduction The reduce pattern 4 is used to collect data or partial results from different processing units and to combine them into a global result by a chosen operator Given p displaystyle p nbsp processing units message mi displaystyle m i nbsp is on processing unit pi displaystyle p i nbsp initially All mi displaystyle m i nbsp are aggregated by displaystyle otimes nbsp and the result is eventually stored on p0 displaystyle p 0 nbsp The reduction operator displaystyle otimes nbsp must be associative at least Some algorithms require a commutative operator with a neutral element Operators like sum displaystyle sum nbsp min displaystyle min nbsp max displaystyle max nbsp are common Implementation considerations are similar to broadcast Broadcast For pipelining on binary trees the message must be representable as a vector of smaller object for component wise reduction Pipelined reduce on a balanced binary tree is possible in O alog p bn displaystyle mathcal O alpha log p beta n nbsp All Reduce edit nbsp Information flow of All Reduce operation performed on three nodes f is the associative operator and a is the result of the reduction The all reduce pattern 5 also called allreduce is used if the result of a reduce operation Reduce must be distributed to all processing units Given p displaystyle p nbsp processing units message mi displaystyle m i nbsp is on processing unit pi displaystyle p i nbsp initially All mi displaystyle m i nbsp are aggregated by an operator displaystyle otimes nbsp and the result is eventually stored on all pi displaystyle p i nbsp Analog to the reduce operation the operator displaystyle otimes nbsp must be at least associative All reduce can be interpreted as a reduce operation with a subsequent broadcast Broadcast For long messages a corresponding implementation is suitable whereas for short messages the latency can be reduced by using a hypercube Hypercube communication pattern All Gather All Reduce topology if p displaystyle p nbsp is a power of two All reduce can also be implemented with a butterfly algorithm and achieve optimal latency and bandwidth 6 All reduce is possible in O alog p bn displaystyle mathcal O alpha log p beta n nbsp since reduce and broadcast are possible in O alog p bn displaystyle mathcal O alpha log p beta n nbsp with pipelining on balanced binary trees All reduce implemented with a butterfly algorithm achieves the same asymptotic runtime Prefix Sum Scan editMain article Prefix sum nbsp Information flow of Prefix Sum Scan operation performed on three nodes The operator can be any associative operator The prefix sum or scan operation 7 is used to collect data or partial results from different processing units and to compute intermediate results by an operator which are stored on those processing units It can be seen as a generalization of the reduce operation Reduce Given p displaystyle p nbsp processing units message mi displaystyle m i nbsp is on processing unit pi displaystyle p i nbsp The operator displaystyle otimes nbsp must be at least associative whereas some algorithms require also a commutative operator and a neutral element Common operators are sum displaystyle sum nbsp min displaystyle min nbsp and max displaystyle max nbsp Eventually processing unit pi displaystyle p i nbsp stores the prefix sum i lt i displaystyle otimes i lt i nbsp mi displaystyle m i nbsp In the case of the so called exclusive prefix sum processing unit pi displaystyle p i nbsp stores the prefix sum i lt i displaystyle otimes i lt i nbsp mi displaystyle m i nbsp Some algorithms require to store the overall sum at each processing unit in addition to the prefix sums For short messages this can be achieved with a hypercube topology if p displaystyle p nbsp is a power of two For long messages the hypercube Hypercube communication pattern Prefix sum Prefix sum Distributed memory Hypercube algorithm topology is not suitable since all processing units are active in every step and therefore pipelining can t be used A binary tree topology is better suited for arbitrary p displaystyle p nbsp and long messages Prefix sum Large Message Sizes Pipelined Binary Tree Prefix sum on a binary tree can be implemented with an upward and downward phase In the upward phase reduction is performed while the downward phase is similar to broadcast where the prefix sums are computed by sending different data to the left and right children With this approach pipelining is possible because the operations are equal to reduction Reduce and broadcast Broadcast Pipelined prefix sum on a binary tree is possible in O alog p bn displaystyle mathcal O alpha log p beta n nbsp Barrier editMain article Barrier computer science The barrier 8 as a collective operation is a generalization of the concept of a barrier that can be used in distributed computing When a processing unit calls barrier it waits until all other processing units have called barrier as well Barrier is thus used to achieve global synchronization in distributed computing One way to implement barrier is to call all reduce All Reduce with an empty dummy operand We know the runtime of All reduce is O alog p bn displaystyle mathcal O alpha log p beta n nbsp Using a dummy operand reduces size n displaystyle n nbsp to a constant factor and leads to a runtime of O alog p displaystyle mathcal O alpha log p nbsp Gather edit nbsp Information flow of Gather operation performed on three nodes The gather communication pattern 9 is used to store data from all processing units on a single processing unit Given p displaystyle p nbsp processing units message mi displaystyle m i nbsp on processing unit pi displaystyle p i nbsp For a fixed processing unit pj displaystyle p j nbsp we want to store the message m1 m2 mp displaystyle m 1 cdot m 2 cdot ldots cdot m p nbsp on pj displaystyle p j nbsp Gather can be thought of as a reduce operation Reduce that uses the concatenation operator This works due to the fact that concatenation is associative By using the same binomial tree reduction algorithm we get a runtime of O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp We see that the asymptotic runtime is similar to the asymptotic runtime of reduce O alog p bn displaystyle mathcal O alpha log p beta n nbsp but with the addition of a factor p to the term bn displaystyle beta n nbsp This additional factor is due to the message size increasing in each step as messages get concatenated Compare this to reduce where message size is a constant for operators like min displaystyle min nbsp All Gather edit nbsp Information flow of All Gather operation performed on three nodes The all gather communication pattern 9 is used to collect data from all processing units and to store the collected data on all processing units Given p displaystyle p nbsp processing units pi displaystyle p i nbsp message mi displaystyle m i nbsp initially stored on pi displaystyle p i nbsp we want to store the message m1 m2 mp displaystyle m 1 cdot m 2 cdot ldots cdot m p nbsp on each pj displaystyle p j nbsp It can be thought of in multiple ways The first is as an all reduce operation All Reduce with concatenation as the operator in the same way that gather can be represented by reduce The second is as a gather operation followed by a broadcast of the new message of size pn displaystyle pn nbsp With this we see that all gather in O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp is possible Scatter edit nbsp Information flow of Scatter operation performed on three nodes The scatter communication pattern 10 is used to distribute data from one processing unit to all the processing units It differs from broadcast in that it does not send the same message to all processing units Instead it splits the message and delivers one part of it to each processing unit Given p displaystyle p nbsp processing units pi displaystyle p i nbsp a fixed processing unit pj displaystyle p j nbsp that holds the message m m1 m2 mp displaystyle m m 1 cdot m 2 cdot ldots cdot m p nbsp We want to transport the message mi displaystyle m i nbsp onto pi displaystyle p i nbsp The same implementation concerns as for gather Gather apply This leads to an optimal runtime in O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp All to all editMain article All to all parallel pattern All to all 11 is the most general communication pattern For 0 i j lt p displaystyle 0 leq i j lt p nbsp message mi j displaystyle m i j nbsp is the message that is initially stored on node i displaystyle i nbsp and has to be delivered to node j displaystyle j nbsp We can express all communication primitives that do not use operators through all to all For example broadcast of message m displaystyle m nbsp from node pk displaystyle p k nbsp is emulated by setting mi j m displaystyle m i j m nbsp for i k displaystyle i k nbsp and setting ml j displaystyle m l j nbsp empty for l k displaystyle l neq k nbsp Assuming we have a fully connected network the best possible runtime for all to all is in O p a bn displaystyle mathcal O p alpha beta n nbsp This is achieved through p displaystyle p nbsp rounds of direct message exchange For p displaystyle p nbsp power of 2 in communication round k displaystyle k nbsp node pi displaystyle p i nbsp exchanges messages with node pj j i k displaystyle p j j i oplus k nbsp If the message size is small and latency dominates the communication a hypercube algorithm can be used to distribute the messages in time O log p a bpn displaystyle mathcal O log p alpha beta pn nbsp nbsp Information flow of All to All operation performed on three nodes Letters indicate nodes and numbers indicate information items Runtime Overview editThis table 12 gives an overview over the best known asymptotic runtimes assuming we have free choice of network topology Example topologies we want for optimal runtime are binary tree binomial tree hypercube In practice we have to adjust to the available physical topologies e g dragonfly fat tree grid network references other topologies too More information under Network topology For each operation the optimal algorithm can depend on the input sizes n displaystyle n nbsp For example broadcast for short messages is best implemented using a binomial tree whereas for long messages a pipelined communication on a balanced binary tree is optimal The complexities stated in the table depend on the latency a displaystyle alpha nbsp and the communication cost per word b displaystyle beta nbsp in addition to the number of processing units p displaystyle p nbsp and the input message size per node n displaystyle n nbsp The senders and receivers columns represent the number of senders and receivers that are involved in the operation respectively The messages column lists the number of input messages and the Computations column indicates if any computations are done on the messages or if the messages are just delivered without processing Complexity gives the asymptotic runtime complexity of an optimal implementation under free choice of topology Name senders receivers messages Computations ComplexityBroadcast 1 displaystyle 1 nbsp p displaystyle p nbsp 1 displaystyle 1 nbsp no O alog p bn displaystyle mathcal O alpha log p beta n nbsp Reduce p displaystyle p nbsp 1 displaystyle 1 nbsp p displaystyle p nbsp yes O alog p bn displaystyle mathcal O alpha log p beta n nbsp All reduce p displaystyle p nbsp p displaystyle p nbsp p displaystyle p nbsp yes O alog p bn displaystyle mathcal O alpha log p beta n nbsp Prefix sum p displaystyle p nbsp p displaystyle p nbsp p displaystyle p nbsp yes O alog p bn displaystyle mathcal O alpha log p beta n nbsp Barrier p displaystyle p nbsp p displaystyle p nbsp 0 displaystyle 0 nbsp no O alog p displaystyle mathcal O alpha log p nbsp Gather p displaystyle p nbsp 1 displaystyle 1 nbsp p displaystyle p nbsp no O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp All Gather p displaystyle p nbsp p displaystyle p nbsp p displaystyle p nbsp no O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp Scatter 1 displaystyle 1 nbsp p displaystyle p nbsp p displaystyle p nbsp no O alog p bpn displaystyle mathcal O alpha log p beta pn nbsp All To All p displaystyle p nbsp p displaystyle p nbsp p2 displaystyle p 2 nbsp no O log p a bpn displaystyle mathcal O log p alpha beta pn nbsp or O p a bn displaystyle mathcal O p alpha beta n nbsp Notes edit Intercommunicator Collective Operations The Message Passing Interface MPI standard chapter 7 3 1 Mathematics and Computer Science Division Argonne National Laboratory Sanders Mehlhorn Dietzfelbinger Dementiev 2019 p 395 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 396 401 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 402 403 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 403 404 Yuan Xin February 2009 Bandwidth optimal all reduce algorithms for clusters of workstations PDF Journal of Parallel and Distributed Computing 69 2 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 404 406 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 p 408 a b Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 412 413 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 p 413 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 pp 413 418 Sanders Mehlhorn Dietzfelbinger Dementiev 2019 p 394References editSanders Peter Mehlhorn Kurt Dietzfelbinger Martin Dementiev Roman 2019 Sequential and Parallel Algorithms and Data Structures The Basic Toolbox Springer Nature Switzerland AG ISBN 978 3 030 25208 3 Retrieved from https en wikipedia org w index php title Collective operation amp oldid 1212795475 All Reduce, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.