fbpx
Wikipedia

Hypercube (communication pattern)

-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum.[1] The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.

Algorithm outline edit

Most of the communication primitives presented in this article share a common template.[2] Initially, each processing element possesses one message that must reach every other processing element during the course of the algorithm. The following pseudo code sketches the communication steps necessary. Hereby, Initialization, Operation, and Output are placeholders that depend on the given communication primitive (see next section).

Input: message  . Output: depends on Initialization, Operation and Output. Initialization   for   do   Send   to   Receive   from   Operation  endfor Output 

Each processing element iterates over its neighbors (the expression   negates the  -th bit in  's binary representation, therefore obtaining the numbers of its neighbors). In each iteration, each processing element exchanges a message with the neighbor and processes the received message afterwards. The processing operation depends on the communication primitive.

 
Algorithm outline applied to the  -dimensional hypercube. In the first step (before any communication), each processing element possesses one message (blue). Communication is marked red. After each step, the processing elements store the received message, but other operations are also possible.

Communication primitives edit

Prefix sum edit

In the beginning of a prefix sum operation, each processing element   owns a message  . The goal is to compute  , where   is an associative operation. The following pseudo code describes the algorithm.

Input: message   of processor  . Output: prefix sum   of processor  .     for   do   Send   to   Receive   from     if bit   in   is set then   endfor 

The algorithm works as follows. Observe that hypercubes of dimension   can be split into two hypercubes of dimension  . Refer to the sub cube containing nodes with a leading 0 as the 0-sub cube and the sub cube consisting of nodes with a leading 1 as 1-sub cube. Once both sub cubes have calculated the prefix sum, the sum over all elements in the 0-sub cube has to be added to the every element in the 1-sub cube, since every processing element in the 0-sub cube has a lower rank than the processing elements in the 1-sub cube. The pseudo code stores the prefix sum in variable   and the sum over all nodes in a sub cube in variable  . This makes it possible for all nodes in 1-sub cube to receive the sum over the 0-sub cube in every step.

This results in a factor of   for   and a factor of   for  :  .

 
Example for a prefix sum calculation. Upper number: tentatetive prefix sum (variable  ). Lower number: sum over all elements in the sub cube (variable  ).

All-gather / all-reduce edit

All-gather operations start with each processing element having a message  . The goal of the operation is for each processing element to know the messages of all other processing elements, i.e.   where   is concatenation. The operation can be implemented following the algorithm template.

Input: message   at processing unit  . Output: all messages  .   for   do   Send   to   Receive   from     endfor 

With each iteration, the transferred message doubles in length. This leads to a runtime of  .

The same principle can be applied to the All-Reduce operations, but instead of concatenating the messages, it performs a reduction operation on the two messages. So it is a Reduce operation, where all processing units know the result. Compared to a normal reduce operation followed by a broadcast, All-Reduce in hypercubes reduces the number of communication steps.

All-to-all edit

Here every processing element has a unique message for all other processing elements.

Input: message   at processing element   to processing element  . for   do Receive from processing element  : all messages for my  -dimensional sub cube Send to processing element  : all messages for its  -dimensional sub cube endfor 

With each iteration a messages comes closer to its destination by one dimension, if it hasn't arrived yet. Hence, all messages have reached their target after at most   steps. In every step,   messages are sent: in the first iteration, half of the messages aren't meant for the own sub cube. In every following step, the sub cube is only half the size as before, but in the previous step exactly the same number of messages arrived from another processing element.

This results in a run-time of  .

ESBT-broadcast edit

The ESBT-broadcast (Edge-disjoint Spanning Binomial Tree) algorithm[3] is a pipelined broadcast algorithm with optimal runtime for clusters with hypercube network topology. The algorithm embeds   edge-disjoint binomial trees in the hypercube, such that each neighbor of processing element   is the root of a spanning binomial tree on   nodes. To broadcast a message, the source node splits its message into   chunks of equal size and cyclically sends them to the roots of the binomial trees. Upon receiving a chunk, the binomial trees broadcast it.

Runtime edit

In each step, the source node sends one of its   chunks to a binomial tree. Broadcasting the chunk within the binomial tree takes   steps. Thus, it takes   steps to distribute all chunks and additionally   steps until the last binomial tree broadcast has finished, resulting in   steps overall. Therefore, the runtime for a message of length   is  . With the optimal chunk size  , the optimal runtime of the algorithm is  .

Construction of the binomial trees edit

 
A  -dimensional hypercubes with three ESBT embedded.

This section describes how to construct the binomial trees systematically. First, construct a single binomial spanning tree von   nodes as follows. Number the nodes from   to   and consider their binary representation. Then the children of each nodes are obtained by negating single leading zeroes. This results in a single binomial spanning tree. To obtain   edge-disjoint copies of the tree, translate and rotate the nodes: for the  -th copy of the tree, apply a XOR operation with   to each node. Subsequently, right-rotate all nodes by   digits. The resulting binomial trees are edge-disjoint and therefore fulfill the requirements for the ESBT-broadcasting algorithm.

References edit

  1. ^ Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN 978-0201648652.
  2. ^ Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN 0201575949.
  3. ^ Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

hypercube, communication, pattern, displaystyle, dimensional, hypercube, network, topology, parallel, computers, with, displaystyle, processing, elements, topology, allows, efficient, implementation, some, basic, communication, primitives, such, broadcast, red. d displaystyle d dimensional hypercube is a network topology for parallel computers with 2d displaystyle 2 d processing elements The topology allows for an efficient implementation of some basic communication primitives such as Broadcast All Reduce and Prefix sum 1 The processing elements are numbered 0 displaystyle 0 through 2d 1 displaystyle 2 d 1 Each processing element is adjacent to processing elements whose numbers differ in one and only one bit The algorithms described in this page utilize this structure efficiently Contents 1 Algorithm outline 2 Communication primitives 2 1 Prefix sum 2 2 All gather all reduce 2 3 All to all 3 ESBT broadcast 3 1 Runtime 3 2 Construction of the binomial trees 4 ReferencesAlgorithm outline editMost of the communication primitives presented in this article share a common template 2 Initially each processing element possesses one message that must reach every other processing element during the course of the algorithm The following pseudo code sketches the communication steps necessary Hereby Initialization Operation and Output are placeholders that depend on the given communication primitive see next section Input message m displaystyle m nbsp Output depends on Initialization Operation and Output Initialization s m displaystyle s m nbsp for 0 k lt d displaystyle 0 leq k lt d nbsp do y i XOR 2k displaystyle y i text XOR 2 k nbsp Send s displaystyle s nbsp to y displaystyle y nbsp Receive m displaystyle m nbsp from y displaystyle y nbsp Operation s m displaystyle s m nbsp endfor Output Each processing element iterates over its neighbors the expression i XOR 2k displaystyle i text XOR 2 k nbsp negates the k displaystyle k nbsp th bit in i displaystyle i nbsp s binary representation therefore obtaining the numbers of its neighbors In each iteration each processing element exchanges a message with the neighbor and processes the received message afterwards The processing operation depends on the communication primitive nbsp Algorithm outline applied to the 3 displaystyle 3 nbsp dimensional hypercube In the first step before any communication each processing element possesses one message blue Communication is marked red After each step the processing elements store the received message but other operations are also possible Communication primitives editPrefix sum edit In the beginning of a prefix sum operation each processing element i displaystyle i nbsp owns a message mi displaystyle m i nbsp The goal is to compute 0 j imj displaystyle bigoplus 0 leq j leq i m j nbsp where displaystyle oplus nbsp is an associative operation The following pseudo code describes the algorithm Input message mi displaystyle m i nbsp of processor i displaystyle i nbsp Output prefix sum 0 j imj displaystyle bigoplus 0 leq j leq i m j nbsp of processor i displaystyle i nbsp x mi displaystyle x m i nbsp s mi displaystyle sigma m i nbsp for 0 k d 1 displaystyle 0 leq k leq d 1 nbsp do y i XOR 2k displaystyle y i text XOR 2 k nbsp Send s displaystyle sigma nbsp to y displaystyle y nbsp Receive m displaystyle m nbsp from y displaystyle y nbsp s s m displaystyle sigma sigma oplus m nbsp if bit k displaystyle k nbsp in i displaystyle i nbsp is set then x x m displaystyle x x oplus m nbsp endfor The algorithm works as follows Observe that hypercubes of dimension d displaystyle d nbsp can be split into two hypercubes of dimension d 1 displaystyle d 1 nbsp Refer to the sub cube containing nodes with a leading 0 as the 0 sub cube and the sub cube consisting of nodes with a leading 1 as 1 sub cube Once both sub cubes have calculated the prefix sum the sum over all elements in the 0 sub cube has to be added to the every element in the 1 sub cube since every processing element in the 0 sub cube has a lower rank than the processing elements in the 1 sub cube The pseudo code stores the prefix sum in variable x displaystyle x nbsp and the sum over all nodes in a sub cube in variable s displaystyle sigma nbsp This makes it possible for all nodes in 1 sub cube to receive the sum over the 0 sub cube in every step This results in a factor of log p displaystyle log p nbsp for Tstart displaystyle T text start nbsp and a factor of nlog p displaystyle n log p nbsp for Tbyte displaystyle T text byte nbsp T n p Tstart nTbyte log p displaystyle T n p T text start nT text byte log p nbsp nbsp Example for a prefix sum calculation Upper number tentatetive prefix sum variable x displaystyle x nbsp Lower number sum over all elements in the sub cube variable s displaystyle sigma nbsp All gather all reduce edit All gather operations start with each processing element having a message mi displaystyle m i nbsp The goal of the operation is for each processing element to know the messages of all other processing elements i e x m0 m1 mp displaystyle x m 0 cdot m 1 dots m p nbsp where displaystyle cdot nbsp is concatenation The operation can be implemented following the algorithm template Input message x mi displaystyle x m i nbsp at processing unit i displaystyle i nbsp Output all messages m1 m2 mp displaystyle m 1 cdot m 2 dots m p nbsp x mi displaystyle x m i nbsp for 0 k lt d displaystyle 0 leq k lt d nbsp do y i XOR 2k displaystyle y i text XOR 2 k nbsp Send x displaystyle x nbsp to y displaystyle y nbsp Receive x displaystyle x nbsp from y displaystyle y nbsp x x x displaystyle x x cdot x nbsp endfor With each iteration the transferred message doubles in length This leads to a runtime of T n p j 0d 1 Tstart n 2jTbyte log p Tstart p 1 nTbyte displaystyle T n p approx sum j 0 d 1 T text start n cdot 2 j T text byte log p T text start p 1 nT text byte nbsp The same principle can be applied to the All Reduce operations but instead of concatenating the messages it performs a reduction operation on the two messages So it is a Reduce operation where all processing units know the result Compared to a normal reduce operation followed by a broadcast All Reduce in hypercubes reduces the number of communication steps All to all edit Here every processing element has a unique message for all other processing elements Input message mij displaystyle m ij nbsp at processing element i displaystyle i nbsp to processing element j displaystyle j nbsp for d gt k 0 displaystyle d gt k geq 0 nbsp do Receive from processing element i XOR 2k displaystyle i text XOR 2 k nbsp all messages for my k displaystyle k nbsp dimensional sub cube Send to processing element i XOR 2k displaystyle i text XOR 2 k nbsp all messages for its k displaystyle k nbsp dimensional sub cube endfor With each iteration a messages comes closer to its destination by one dimension if it hasn t arrived yet Hence all messages have reached their target after at most d log p displaystyle d log p nbsp steps In every step p 2 displaystyle p 2 nbsp messages are sent in the first iteration half of the messages aren t meant for the own sub cube In every following step the sub cube is only half the size as before but in the previous step exactly the same number of messages arrived from another processing element This results in a run time of T n p log p Tstart p2nTbyte displaystyle T n p approx log p T text start frac p 2 nT text byte nbsp ESBT broadcast editThe ESBT broadcast Edge disjoint Spanning Binomial Tree algorithm 3 is a pipelined broadcast algorithm with optimal runtime for clusters with hypercube network topology The algorithm embeds d displaystyle d nbsp edge disjoint binomial trees in the hypercube such that each neighbor of processing element 0 displaystyle 0 nbsp is the root of a spanning binomial tree on 2d 1 displaystyle 2 d 1 nbsp nodes To broadcast a message the source node splits its message into k displaystyle k nbsp chunks of equal size and cyclically sends them to the roots of the binomial trees Upon receiving a chunk the binomial trees broadcast it Runtime edit In each step the source node sends one of its k displaystyle k nbsp chunks to a binomial tree Broadcasting the chunk within the binomial tree takes d displaystyle d nbsp steps Thus it takes k displaystyle k nbsp steps to distribute all chunks and additionally d displaystyle d nbsp steps until the last binomial tree broadcast has finished resulting in k d displaystyle k d nbsp steps overall Therefore the runtime for a message of length n displaystyle n nbsp is T n p k nkTbyte Tstart k d displaystyle T n p k left frac n k T text byte T text start right k d nbsp With the optimal chunk size k nd TbyteTstart displaystyle k sqrt frac nd cdot T text byte T text start nbsp the optimal runtime of the algorithm is T n p n Tbyte log p Tstart nlog p Tstart Tbyte displaystyle T n p n cdot T text byte log p cdot T text start sqrt n log p cdot T text start cdot T text byte nbsp Construction of the binomial trees edit nbsp A 3 displaystyle 3 nbsp dimensional hypercubes with three ESBT embedded This section describes how to construct the binomial trees systematically First construct a single binomial spanning tree von 2d displaystyle 2 d nbsp nodes as follows Number the nodes from 0 displaystyle 0 nbsp to 2d 1 displaystyle 2 d 1 nbsp and consider their binary representation Then the children of each nodes are obtained by negating single leading zeroes This results in a single binomial spanning tree To obtain d displaystyle d nbsp edge disjoint copies of the tree translate and rotate the nodes for the k displaystyle k nbsp th copy of the tree apply a XOR operation with 2k displaystyle 2 k nbsp to each node Subsequently right rotate all nodes by k displaystyle k nbsp digits The resulting binomial trees are edge disjoint and therefore fulfill the requirements for the ESBT broadcasting algorithm References edit Grama A 2003 Introduction to Parallel Computing Addison Wesley Auflage 2 ed ISBN 978 0201648652 Foster I 1995 Designing and Building Parallel Programs Concepts and Tools for Parallel Software Engineering Addison Wesley ISBN 0201575949 Johnsson S L Ho C T 1989 Optimum broadcasting and personalized communication in hypercubes IEEE Transactions on Computers 38 9 1249 1268 doi 10 1109 12 29465 ISSN 0018 9340 Retrieved from https en wikipedia org w index php title Hypercube communication pattern amp oldid 1192735892, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.