fbpx
Wikipedia

Parallel external memory

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

PEM Model

Model edit

Definition edit

The PEM model[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of   processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size   and   small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size   which is partitioned in blocks of size  . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size  .

I/O complexity edit

The complexity measure of the PEM model is the I/O complexity,[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if   processors load parallelly a data block of size   form the main memory into their caches, it is considered as an I/O complexity of   not  . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts edit

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[1] occur. Like in the PRAM model, three different variations of this problem are considered:

  • Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
  • Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
  • Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms[1] solve the CREW and EREW problem if   processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of   parallel block transfers. A second approach needs   parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round   processors combine their blocks into   blocks. Then   processors combine the   blocks into  . This procedure is continued until all the data is combined in one block.

Comparison to other models edit

Model Multi-core Cache-aware
Random-access machine (RAM) No No
Parallel random-access machine (PRAM) Yes No
External memory (EM) No Yes
Parallel external memory (PEM) Yes Yes

Examples edit

Multiway partitioning edit

Let   be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition[1] of A is a set   , where   and   for  .   is called the i-th bucket. The number of elements in   is greater than   and smaller than  . In the following algorithm[1] the input is partitioned into N/P-sized contiguous segments   in main memory. The processor i primarily works on the segment  . The multiway partitioning algorithm (PEM_DIST_SORT[1]) uses a PEM prefix sum algorithm[1] to calculate the prefix sum with the optimal   I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segments   for each processor i in parallel do Read the vector of pivots M into the cache. Partition   into d buckets and let vector   be the number of items in each bucket. end for Run PEM prefix sum on the set of vectors   simultaneously. // Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements   into memory locations offset appropriately by   and  . end for Using the prefix sums stored in   the last processor P calculates the vector B of bucket sizes and returns it. 

If the vector of   pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with   I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection edit

The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in  , and SELECT, which is a cache optimal single-processor selection algorithm.

if   then   return   end if //Find median of each   for each processor i in parallel do   end for // Sort medians   // Partition around median of medians   if   then return   else return   end if 

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

 

Distribution sort edit

Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If   the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm[1] is used:

// Sample   elements from A for each processor i in parallel do if   then   Load   in M-sized pages and sort pages individually else   Load and sort   as single page end if Pick every  'th element from each sorted memory page into contiguous vector   of samples end for in parallel do Combine vectors   into a single contiguous vector   Make   copies of  :   end do // Find   pivots   for   to   in parallel do   end for Pack pivots in contiguous array   // Partition Aaround pivots into buckets     // Recursively sort buckets for   to   in parallel do recursively call   on bucket jof size   using   processors responsible for elements in bucket j end for 

The I/O complexity of PEMDISTSORT is:

 

where

 

If the number of processors is chosen that  and   the I/O complexity is then:

 

Other PEM algorithms edit

PEM Algorithm I/O complexity Constraints
Mergesort[1]    
List ranking[2]    
Euler tour[2]    
Expression tree evaluation[2]    
Finding a MST[2]    

Where   is the time it takes to sort N items with P processors in the PEM model.

See also edit

References edit

  1. ^ a b c d e f g h i j k l Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. S2CID 11067041.
  2. ^ a b c d Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425. S2CID 587572.

parallel, external, memory, computer, science, parallel, external, memory, model, cache, aware, external, memory, abstract, machine, parallel, computing, analogy, single, processor, external, memory, model, similar, cache, aware, analogy, parallel, random, acc. In computer science a parallel external memory PEM model is a cache aware external memory abstract machine 1 It is the parallel computing analogy to the single processor external memory EM model In a similar way it is the cache aware analogy to the parallel random access machine PRAM The PEM model consists of a number of processors together with their respective private caches and a shared main memory PEM Model Contents 1 Model 1 1 Definition 1 2 I O complexity 1 3 Read write conflicts 1 4 Comparison to other models 2 Examples 2 1 Multiway partitioning 2 2 Selection 2 3 Distribution sort 2 4 Other PEM algorithms 3 See also 4 ReferencesModel editDefinition edit The PEM model 1 is a combination of the EM model and the PRAM model The PEM model is a computation model which consists of P displaystyle P nbsp processors and a two level memory hierarchy This memory hierarchy consists of a large external memory main memory of size N displaystyle N nbsp and P displaystyle P nbsp small internal memories caches The processors share the main memory Each cache is exclusive to a single processor A processor can t access another s cache The caches have a size M displaystyle M nbsp which is partitioned in blocks of size B displaystyle B nbsp The processors can only perform operations on data which are in their cache The data can be transferred between the main memory and the cache in blocks of size B displaystyle B nbsp I O complexity edit The complexity measure of the PEM model is the I O complexity 1 which determines the number of parallel blocks transfers between the main memory and the cache During a parallel block transfer each processor can transfer a block So if P displaystyle P nbsp processors load parallelly a data block of size B displaystyle B nbsp form the main memory into their caches it is considered as an I O complexity of O 1 displaystyle O 1 nbsp not O P displaystyle O P nbsp A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches Read write conflicts edit In the PEM model there is no direct communication network between the P processors The processors have to communicate indirectly over the main memory If multiple processors try to access the same block in main memory concurrently read write conflicts 1 occur Like in the PRAM model three different variations of this problem are considered Concurrent Read Concurrent Write CRCW The same block in main memory can be read and written by multiple processors concurrently Concurrent Read Exclusive Write CREW The same block in main memory can be read by multiple processors concurrently Only one processor can write to a block at a time Exclusive Read Exclusive Write EREW The same block in main memory cannot be read or written by multiple processors concurrently Only one processor can access a block at a time The following two algorithms 1 solve the CREW and EREW problem if P B displaystyle P leq B nbsp processors write to the same block simultaneously A first approach is to serialize the write operations Only one processor after the other writes to the block This results in a total of P displaystyle P nbsp parallel block transfers A second approach needs O log P displaystyle O log P nbsp parallel block transfers and an additional block for each processor The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block In the first round P displaystyle P nbsp processors combine their blocks into P 2 displaystyle P 2 nbsp blocks Then P 2 displaystyle P 2 nbsp processors combine the P 2 displaystyle P 2 nbsp blocks into P 4 displaystyle P 4 nbsp This procedure is continued until all the data is combined in one block Comparison to other models edit Model Multi core Cache awareRandom access machine RAM No NoParallel random access machine PRAM Yes NoExternal memory EM No YesParallel external memory PEM Yes YesExamples editMultiway partitioning edit Let M m1 md 1 displaystyle M m 1 m d 1 nbsp be a vector of d 1 pivots sorted in increasing order Let A be an unordered set of N elements A d way partition 1 of A is a set P A1 Ad displaystyle Pi A 1 A d nbsp where i 1dAi A displaystyle cup i 1 d A i A nbsp and Ai Aj displaystyle A i cap A j emptyset nbsp for 1 i lt j d displaystyle 1 leq i lt j leq d nbsp Ai displaystyle A i nbsp is called the i th bucket The number of elements in Ai displaystyle A i nbsp is greater than mi 1 displaystyle m i 1 nbsp and smaller than mi2 displaystyle m i 2 nbsp In the following algorithm 1 the input is partitioned into N P sized contiguous segments S1 SP displaystyle S 1 S P nbsp in main memory The processor i primarily works on the segment Si displaystyle S i nbsp The multiway partitioning algorithm PEM DIST SORT 1 uses a PEM prefix sum algorithm 1 to calculate the prefix sum with the optimal O NPB log P displaystyle O left frac N PB log P right nbsp I O complexity This algorithm simulates an optimal PRAM prefix sum algorithm Compute parallelly a d way partition on the data segments Si displaystyle S i nbsp for each processor i in parallel do Read the vector of pivots M into the cache Partition Si displaystyle S i nbsp into d buckets and let vector Mi j1i jdi displaystyle M i j 1 i j d i nbsp be the number of items in each bucket end for Run PEM prefix sum on the set of vectors M1 MP displaystyle M 1 M P nbsp simultaneously Use the prefix sum vector to compute the final partition for each processor i in parallel do Write elements Si displaystyle S i nbsp into memory locations offset appropriately by Mi 1 displaystyle M i 1 nbsp and Mi displaystyle M i nbsp end for Using the prefix sums stored in MP displaystyle M P nbsp the last processor P calculates the vector B of bucket sizes and returns it If the vector of d O MB displaystyle d O left frac M B right nbsp pivots M and the input set A are located in contiguous memory then the d way partitioning problem can be solved in the PEM model with O NPB dB gt log P dlog B displaystyle O left frac N PB left lceil frac d B right rceil gt log P d log B right nbsp I O complexity The content of the final buckets have to be located in contiguous memory Selection edit The selection problem is about finding the k th smallest item in an unordered list A of size N The following code 1 makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in O log N displaystyle O log N nbsp and SELECT which is a cache optimal single processor selection algorithm if N P displaystyle N leq P nbsp then PRAMSORT A P displaystyle texttt PRAMSORT A P nbsp return A k displaystyle A k nbsp end if Find median of each Si displaystyle S i nbsp for each processor i in parallel do mi SELECT Si N2P displaystyle m i texttt SELECT S i frac N 2P nbsp end for Sort medians PRAMSORT m1 m2 P displaystyle texttt PRAMSORT lbrace m 1 dots m 2 rbrace P nbsp Partition around median of medians t PEMPARTITION A mP 2 P displaystyle t texttt PEMPARTITION A m P 2 P nbsp if k t displaystyle k leq t nbsp then return PEMSELECT A 1 t P k displaystyle texttt PEMSELECT A 1 t P k nbsp else return PEMSELECT A t 1 N P k t displaystyle texttt PEMSELECT A t 1 N P k t nbsp end if Under the assumption that the input is stored in contiguous memory PEMSELECT has an I O complexity of O NPB log PB log NP displaystyle O left frac N PB log PB cdot log frac N P right nbsp Distribution sort edit Distribution sort partitions an input list A of size N into d disjoint buckets of similar size Every bucket is then sorted recursively and the results are combined into a fully sorted list If P 1 displaystyle P 1 nbsp the task is delegated to a cache optimal single processor sorting algorithm Otherwise the following algorithm 1 is used Sample 4Nd displaystyle tfrac 4N sqrt d nbsp elements from A for each processor i in parallel do if M lt Si displaystyle M lt S i nbsp then d M B displaystyle d M B nbsp Load Si displaystyle S i nbsp in M sized pages and sort pages individually else d Si displaystyle d S i nbsp Load and sort Si displaystyle S i nbsp as single page end if Pick every d 4 displaystyle sqrt d 4 nbsp th element from each sorted memory page into contiguous vector Ri displaystyle R i nbsp of samples end for in parallel do Combine vectors R1 RP displaystyle R 1 dots R P nbsp into a single contiguous vector R displaystyle mathcal R nbsp Make d displaystyle sqrt d nbsp copies of R displaystyle mathcal R nbsp R1 Rd displaystyle mathcal R 1 dots mathcal R sqrt d nbsp end do Find d displaystyle sqrt d nbsp pivots M j displaystyle mathcal M j nbsp for j 1 displaystyle j 1 nbsp to d displaystyle sqrt d nbsp in parallel do M j PEMSELECT Ri Pd j 4Nd displaystyle mathcal M j texttt PEMSELECT mathcal R i tfrac P sqrt d tfrac j cdot 4N d nbsp end for Pack pivots in contiguous array M displaystyle mathcal M nbsp Partition A around pivots into buckets B displaystyle mathcal B nbsp B PEMMULTIPARTITION A 1 N M d P displaystyle mathcal B texttt PEMMULTIPARTITION A 1 N mathcal M sqrt d P nbsp Recursively sort buckets for j 1 displaystyle j 1 nbsp to d 1 displaystyle sqrt d 1 nbsp in parallel do recursively call PEMDISTSORT displaystyle texttt PEMDISTSORT nbsp on bucket j of size B j displaystyle mathcal B j nbsp using O B j N P displaystyle O left left lceil tfrac mathcal B j N P right rceil right nbsp processors responsible for elements in bucket j end for The I O complexity of PEMDISTSORT is O NPB logd P logM B NPB f N P d logd P displaystyle O left left lceil frac N PB right rceil left log d P log M B frac N PB right f N P d cdot log d P right nbsp where f N P d O log PBdlog NP dBlog P dlog B displaystyle f N P d O left log frac PB sqrt d log frac N P left lceil frac sqrt d B log P sqrt d log B right rceil right nbsp If the number of processors is chosen that f N P d O NPB displaystyle f N P d O left left lceil tfrac N PB right rceil right nbsp and M lt BO 1 displaystyle M lt B O 1 nbsp the I O complexity is then O NPBlogM B NB displaystyle O left frac N PB log M B frac N B right nbsp Other PEM algorithms edit PEM Algorithm I O complexity ConstraintsMergesort 1 O NPBlogMB NB sortP N displaystyle O left frac N PB log frac M B frac N B right textrm sort P N nbsp P NB2 M BO 1 displaystyle P leq frac N B 2 M B O 1 nbsp List ranking 2 O sortP N displaystyle O left textrm sort P N right nbsp P N B2log B logO 1 N M BO 1 displaystyle P leq frac N B 2 log B cdot log O 1 N M B O 1 nbsp Euler tour 2 O sortP N displaystyle O left textrm sort P N right nbsp P NB2 M BO 1 displaystyle P leq frac N B 2 M B O 1 nbsp Expression tree evaluation 2 O sortP N displaystyle O left textrm sort P N right nbsp P NB2log B logO 1 N M BO 1 displaystyle P leq frac N B 2 log B cdot log O 1 N M B O 1 nbsp Finding a MST 2 O sortP V sortP E log V pB displaystyle O left textrm sort P V textrm sort P E log tfrac V pB right nbsp p V E B2log B logO 1 N M BO 1 displaystyle p leq frac V E B 2 log B cdot log O 1 N M B O 1 nbsp Where sortP N displaystyle textrm sort P N nbsp is the time it takes to sort N items with P processors in the PEM model See also editParallel random access machine PRAM Random access machine RAM External memory EM References edit a b c d e f g h i j k l Arge Lars Goodrich Michael T Nelson Michael Sitchinava Nodari 2008 Fundamental parallel algorithms for private cache chip multiprocessors Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures New York New York USA ACM Press pp 197 206 doi 10 1145 1378533 1378573 ISBN 9781595939739 S2CID 11067041 a b c d Arge Lars Goodrich Michael T Sitchinava Nodari 2010 Parallel external memory graph algorithms 2010 IEEE International Symposium on Parallel amp Distributed Processing IPDPS IEEE pp 1 11 doi 10 1109 ipdps 2010 5470440 ISBN 9781424464425 S2CID 587572 Retrieved from https en wikipedia org w index php title Parallel external memory amp oldid 1180384636, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.