fbpx
Wikipedia

Cannon's algorithm

In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.[1][2]

It is especially suitable for computers laid out in an N × N mesh.[3] While Cannon's algorithm works well in homogeneous 2D grids, extending it to heterogeneous 2D grids has been shown to be difficult.[4]

The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors.[2]

The Scalable Universal Matrix Multiplication Algorithm (SUMMA)[5] is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid. It is used by the ScaLAPACK, PLAPACK, and Elemental libraries.

Algorithm overview edit

When multiplying two n×n matrices A and B, we need n×n processing nodes p arranged in a 2D grid.

// PE(i , j) k := (i + j) mod N; a := a[i][k]; b := b[k][j]; c[i][j] := 0; for (l := 0; l < N; l++) { c[i][j] := c[i][j] + a * b; concurrently { send a to PE(i, (j + N − 1) mod N); send b to PE((i + N − 1) mod N, j); } with { receive a' from PE(i, (j + 1) mod N); receive b' from PE((i + 1) mod N, j ); } a := a'; b := b'; } 

We need to select k in every iteration for every Processor Element (PE) so that processors don't access the same data for computing  .

Therefore processors in the same row / column must begin summation with different indexes. If for example PE(0,0) calculates   in the first step, PE(0,1) chooses   first. The selection of k := (i + j) mod n for PE(i,j) satisfies this constraint for the first step.

In the first step we distribute the input matrices between the processors based on the previous rule.

In the next iterations we choose a new k' := (k + 1) mod n for every processor. This way every processor will continue accessing different values of the matrices. The needed data is then always at the neighbour processors. A PE(i,j) needs then the   from PE(i,(j + 1) mod n) and the   from PE((i + 1) mod n,j) for the next step. This means that   has to be passed cyclically to the left and also   cyclically upwards. The results of the multiplications are summed up as usual. After n steps, each processor has calculated all   once and its sum is thus the searched  .

After the initial distribution of each processor, only the data for the next step has to be stored. These are the intermediate result of the previous sum, a   and a  . This means that all three matrices only need to be stored in memory once evenly distributed across the processors.

Generalisation edit

In practice we have much fewer processors than the matrix elements. We can replace the matrix elements with submatrices, so that every processor processes more values. The scalar multiplication and addition become sequential matrix multiplication and addition. The width and height of the submatrices will be  .

The runtime of the algorithm is   , where   is the time of the initial distribution of the matrices in the first step,   is the calculation of the intermediate results and   and   stands for the time it takes to establish a connection and transmission of byte respectively.

A disadvantage of the algorithm is that there are many connection setups, with small message sizes. It would be better to be able to transmit more data in each message.

See also edit

References edit

  1. ^ Lynn Elliot Cannon, A cellular computer to implement the Kalman Filter Algorithm, Technical report, Ph.D. Thesis, Montana State University, 14 July 1969.
  2. ^ a b Gupta, H.; Sadayappan, P.: Communication Efficient Matrix-Multiplication on Hypercubes, dbpubs.stanford.edu
  3. ^ 4.2 Matrix Multiplication on a Distributed Memory Machine, www.phy.ornl.gov
  4. ^ Jean-François Pineau, Communication-aware scheduling on heterogeneous master-worker platforms, PhD thesis, October 2010.
  5. ^ Robert A. van de Geijn and Jerrell Watts, SUMMA: scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience. Volume 9, Issue 4, pages 255–274, April 1997.

External links edit

  • Lecture at Berkeley

cannon, algorithm, computer, science, distributed, algorithm, matrix, multiplication, dimensional, meshes, first, described, 1969, lynn, elliot, cannon, especially, suitable, computers, laid, mesh, while, works, well, homogeneous, grids, extending, heterogeneo. In computer science Cannon s algorithm is a distributed algorithm for matrix multiplication for two dimensional meshes first described in 1969 by Lynn Elliot Cannon 1 2 It is especially suitable for computers laid out in an N N mesh 3 While Cannon s algorithm works well in homogeneous 2D grids extending it to heterogeneous 2D grids has been shown to be difficult 4 The main advantage of the algorithm is that its storage requirements remain constant and are independent of the number of processors 2 The Scalable Universal Matrix Multiplication Algorithm SUMMA 5 is a more practical algorithm that requires less workspace and overcomes the need for a square 2D grid It is used by the ScaLAPACK PLAPACK and Elemental libraries Contents 1 Algorithm overview 1 1 Generalisation 2 See also 3 References 4 External linksAlgorithm overview editWhen multiplying two n n matrices A and B we need n n processing nodes p arranged in a 2D grid PE i j k i j mod N a a i k b b k j c i j 0 for l 0 l lt N l c i j c i j a b concurrently send a to PE i j N 1 mod N send b to PE i N 1 mod N j with receive a from PE i j 1 mod N receive b from PE i 1 mod N j a a b b We need to select k in every iteration for every Processor Element PE so that processors don t access the same data for computing a i k b k j displaystyle a ik b kj nbsp Therefore processors in the same row column must begin summation with different indexes If for example PE 0 0 calculates a 00 b 00 displaystyle a 00 b 00 nbsp in the first step PE 0 1 chooses a 01 b 11 displaystyle a 01 b 11 nbsp first The selection of k i j mod n for PE i j satisfies this constraint for the first step In the first step we distribute the input matrices between the processors based on the previous rule In the next iterations we choose a new k k 1 mod n for every processor This way every processor will continue accessing different values of the matrices The needed data is then always at the neighbour processors A PE i j needs then the a displaystyle a nbsp from PE i j 1 mod n and the b displaystyle b nbsp from PE i 1 mod n j for the next step This means that a displaystyle a nbsp has to be passed cyclically to the left and also b displaystyle b nbsp cyclically upwards The results of the multiplications are summed up as usual After n steps each processor has calculated all a i k b k j displaystyle a ik b kj nbsp once and its sum is thus the searched c i j displaystyle c ij nbsp After the initial distribution of each processor only the data for the next step has to be stored These are the intermediate result of the previous sum a a i k displaystyle a ik nbsp and a b k j displaystyle b kj nbsp This means that all three matrices only need to be stored in memory once evenly distributed across the processors Generalisation edit In practice we have much fewer processors than the matrix elements We can replace the matrix elements with submatrices so that every processor processes more values The scalar multiplication and addition become sequential matrix multiplication and addition The width and height of the submatrices will be N n p displaystyle N n sqrt p nbsp The runtime of the algorithm is T n p T c o l l n N p N T s e q n N 2 N 1 T s t a r t T b y t e n N 2 displaystyle T mathcal n p T coll n N p N T seq n N 2 N 1 T start T byte n N 2 nbsp where T c o l l displaystyle T coll nbsp is the time of the initial distribution of the matrices in the first step T s e q displaystyle T seq nbsp is the calculation of the intermediate results and T s t a r t displaystyle T start nbsp and T b y t e displaystyle T byte nbsp stands for the time it takes to establish a connection and transmission of byte respectively A disadvantage of the algorithm is that there are many connection setups with small message sizes It would be better to be able to transmit more data in each message See also editSystolic arrayReferences edit Lynn Elliot Cannon A cellular computer to implement the Kalman Filter Algorithm Technical report Ph D Thesis Montana State University 14 July 1969 a b Gupta H Sadayappan P Communication Efficient Matrix Multiplication on Hypercubes dbpubs stanford edu 4 2 Matrix Multiplication on a Distributed Memory Machine www phy ornl gov Jean Francois Pineau Communication aware scheduling on heterogeneous master worker platforms PhD thesis October 2010 Robert A van de Geijn and Jerrell Watts SUMMA scalable universal matrix multiplication algorithm Concurrency Practice and Experience Volume 9 Issue 4 pages 255 274 April 1997 External links editLecture at Berkeley mu oz au Retrieved from https en wikipedia org w index php title Cannon 27s algorithm amp oldid 1183933678, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.