fbpx
Wikipedia

Bulk synchronous parallel

The bulk synchronous parallel (BSP) abstract computer is a bridging model for designing parallel algorithms. It is similar to the parallel random access machine (PRAM) model, but unlike PRAM, BSP does not take communication and synchronization for granted. In fact, quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm.

History

The BSP model was developed by Leslie Valiant of Harvard University during the 1980s. The definitive article was published in 1990.[1]

Between 1990 and 1992, Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model, in Princeton and at Harvard. Between 1992 and 1997, McColl led a large research team at Oxford that developed various BSP programming libraries, languages and tools, and also numerous massively parallel BSP algorithms, including many early examples of high-performance communication-avoiding parallel algorithms [2] and recursive "immortal" parallel algorithms that achieve the best possible performance and optimal parametric tradeoffs.[3]

With interest and momentum growing, McColl then led a group from Oxford, Harvard, Florida, Princeton, Bell Labs, Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996.[4]

Valiant developed an extension to the BSP model in the 2000s, leading to the publication of the Multi-BSP model in 2011.[5]

In 2017, McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large-scale parallel computations in AI, Analytics and high-performance computing (HPC).[6] See also [7]

The BSP model

Overview

A BSP computer consists of the following:

  • Components capable of processing and/or local memory transactions (i.e., processors),
  • A network that routes messages between pairs of such components, and
  • A hardware facility that allows for the synchronization of all or a subset of components.

This is commonly interpreted as a set of processors that may follow different threads of computation, with each processor equipped with fast local memory and interconnected by a communication network.

BSP algorithms rely heavily on the third feature; a computation proceeds in a series of global supersteps, which consists of three components:

  • Concurrent computation: every participating processor may perform local computations, i.e., each process can only make use of values stored in the local fast memory of the processor. The computations occur asynchronously of all the others but may overlap with communication.
  • Communication: The processes exchange data to facilitate remote data storage.
  • Barrier synchronization: When a process reaches this point (the barrier), it waits until all other processes have reached the same barrier.

The computation and communication actions do not have to be ordered in time. Communication typically takes the form of the one-sided PUT and GET remote direct memory access (RDMA) calls rather than paired two-sided send and receive message-passing calls.

 
A BSP superstep. Processes lack linear order and may be mapped to processors in any way

The barrier synchronization concludes the superstep—it ensures that all one-sided communications are properly concluded. Systems based on two-sided communication include this synchronization cost implicitly for every message sent. The barrier synchronization method relies on the BSP computer's hardware facility. In Valiant's original paper, this facility periodically checks if the end of the current superstep is reached globally. The period of this check is denoted by  .[1]

The BSP model is also well-suited for automatic memory management for distributed-memory computing through over-decomposition of the problem and oversubscription of the processors. The computation is divided into more logical processes than there are physical processors, and processes are randomly assigned to processors. This strategy can be shown statistically to lead to almost perfect load balancing, both of work and communication.

Communication

In many parallel programming systems, communications are considered at the level of individual actions, such as sending and receiving a message or memory-to-memory transfer. This is difficult to work with since there are many simultaneous communication actions in a parallel program, and their interactions are typically complex. In particular, it is difficult to say much about the time any single communication action will take to complete.

The BSP model considers communication actions en masse. This has the effect that an upper bound on the time taken to communicate a set of data can be given. BSP considers all communication actions of a superstep as one unit and assumes all individual messages sent as part of this unit have a fixed size.

The maximum number of incoming or outgoing messages for a superstep is denoted by  . The ability of a communication network to deliver data is captured by a parameter  , defined such that it takes time   for a processor to deliver   messages of size 1.

A message of length   obviously takes longer to send than a message of size 1. However, the BSP model does not make a distinction between a message length of   or   messages of length 1. In either case, the cost is said to be  .

The parameter   depends on the following:

  • The protocols used to interact within the communication network.
  • Buffer management by both the processors and the communication network.
  • The routing strategy used in the network.
  • The BSP runtime system.

In practice,   is determined empirically for each parallel computer. Note that   is not the normalized single-word delivery time but the single-word delivery time under continuous traffic conditions.

Barriers

The one-sided communication of the BSP model requires barrier synchronization. Barriers are potentially costly but avoid the possibility of deadlock or livelock, since barriers cannot create circular data dependencies. Tools to detect them and deal with them are unnecessary. Barriers also permit novel forms of fault tolerance[citation needed].

The cost of barrier synchronization is influenced by a couple of issues:

  • The cost imposed by the variation in the completion time of the participating concurrent computations. Take the example where all but one of the processes have completed their work for this superstep, and are waiting for the last process, which still has a lot of work to complete. The best that an implementation can do is ensure that each process works on roughly the same problem size.
  • The cost of reaching a globally consistent state in all of the processors. This depends on the communication network but also on whether there is special-purpose hardware available for synchronizing and on the way in which interrupts are handled by processors.

The cost of a barrier synchronization is denoted by  . Note that   if the synchronization mechanism of the BSP computer is as suggested by Valiant.[1] In practice, a value of   is determined empirically.

On large computers, barriers are expensive, and this is increasingly so on large scales. There is a large body of literature on removing synchronization points from existing algorithms in the context of BSP computing and beyond. For example, many algorithms allow for the local detection of the global end of a superstep simply by comparing local information to the number of messages already received. This drives the cost of global synchronization, compared to the minimally required latency of communication, to zero.[8] Yet also this minimal latency is expected to increase further for future supercomputer architectures and network interconnects; the BSP model, along with other models for parallel computation, require adaptation to cope with this trend. Multi-BSP is one BSP-based solution.[5]

Algorithmic cost

The cost of a superstep is determined as the sum of three terms:

  • The cost of the longest-running local computation
  • The cost of global communication between the processors
  • The cost of the barrier synchronization at the end of the superstep

Thus, the cost of one superstep for   processors:

  where   is the cost for the local computation in process  , and   is the number of messages sent or received by process  . Note that homogeneous processors are assumed here. It is more common for the expression to be written as   where   and   are maxima. The cost of an entire BSP algorithm is the sum of the cost of each superstep.

  where   is the number of supersteps.

 ,  , and   are usually modeled as functions that vary with problem size. These three characteristics of a BSP algorithm are usually described in terms of asymptotic notation, e.g.,  .

Extensions and uses

Interest in BSP has soared, with Google adopting it as a major technology for graph analytics at massive scale via Pregel and MapReduce. Also, with the next generation of Hadoop decoupling the MapReduce model from the rest of the Hadoop infrastructure, there are now active open-source projects to add explicit BSP programming, as well as other high-performance parallel programming models, on top of Hadoop. Examples are Apache Hama and Apache Giraph.[9]

BSP has been extended by many authors to address concerns about BSP's unsuitability for modelling specific architectures or computational paradigms. One example of this is the decomposable BSP model. The model has also been used in the creation of a number of new programming languages and interfaces, such as Bulk Synchronous Parallel ML (BSML), BSPLib, Apache Hama,[9] and Pregel.[10]

Notable implementations of the BSPLib standard are the Paderborn University BSP library[11] and the Oxford BSP Toolset by Jonathan Hill.[12] Modern implementations include BSPonMPI[13] (which simulates BSP on top of the Message Passing Interface), and MulticoreBSP[14][15] (a novel implementation targeting modern shared-memory architectures). MulticoreBSP for C is especially notable for its capability of starting nested BSP runs, thus allowing for explicit Multi-BSP programming.

See also

References

  1. ^ a b c Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, Volume 33 Issue 8, Aug. 1990 [1]
  2. ^ W F McColl. Scalable Computing. Computer Science Today: Recent Trends and Developments. J van Leeuwen (editor). LNCS Volume 1000, Springer-Verlag pp.46-61 (1995) [2]
  3. ^ W F McColl and A Tiskin. Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3) pp.287-297 (1999) [3]
  4. ^ J M D Hill, W F McColl, D C Stefanescu, M W Goudreau, K Lang, S B Rao, T Suel, T Tsantilas and R H Bisseling. BSPlib: The BSP Programming Library. Parallel Computing 24 (14) pp. 1947-1980 (1998) [4]
  5. ^ a b Valiant, L. G. (2011). A bridging model for multi-core computing. Journal of Computer and System Sciences, 77(1), 154-166 [5]
  6. ^ A Bridging Model for High Performance Cloud Computing by Bill McColl in 18th SIAM Conference on Parallel Processing for Scientific Computing (2018), http://meetings.siam.org/sess/dsp_talk.cfm?p=88973 2019-12-11 at the Wayback Machine.
  7. ^ Bill McColl. Mathematics, Models and Architectures. Chapter 1, pp. 6-53. Mathematics for Future Computing and Communications, edited by Liao Heng and Bill McColl. Cambridge University Press (2022). [6]
  8. ^ Alpert, R., & Philbin, J. (1997). cBSP: Zero-cost synchronization in a modified BSP model. NEC Research Institute, 4 Independence Way, Princeton NJ, 8540, [7].
  9. ^ a b Apache Hama
  10. ^ Pregel
  11. ^ The Paderborn University BSP (PUB) Library - Design, Implementation and Performance Heinz Nixdorf Institute, Department of Computer Science, University of Paderborn, Germany, technical report 2001-06-05 at the Wayback Machine.
  12. ^ Jonathan Hill: The Oxford BSP Toolset, 1998.
  13. ^ Wijnand J. Suijlen: BSPonMPI, 2006.
  14. ^ MulticoreBSP for C: a high-performance library for shared-memory parallel programming by A. N. Yzelman, R. H. Bisseling, D. Roose, and K. Meerbergen in International Journal of Parallel Programming, in press (2013), doi:10.1109/TPDS.2013.31.
  15. ^ An Object-Oriented Bulk Synchronous Parallel Library for Multicore Programming by A. N. Yzelman & Rob H. Bisseling in Concurrency and Computation: Practice and Experience 24(5), pp. 533-553 (2012), doi:10.1002/cpe.1843.

External links

  • D.B. Skillicorn, Jonathan Hill, W. F. McColl, Questions and answers about BSP[permanent dead link] (1996)
  • BSP Worldwide
  • BSP related papers
  • (in French) Bulk Synchronous Parallel ML ((in English) official website)
  • Apache Hama
  • Apache Giraph
  • BSPonMPI
  • MulticoreBSP

bulk, synchronous, parallel, bulk, synchronous, parallel, abstract, computer, bridging, model, designing, parallel, algorithms, similar, parallel, random, access, machine, pram, model, unlike, pram, does, take, communication, synchronization, granted, fact, qu. The bulk synchronous parallel BSP abstract computer is a bridging model for designing parallel algorithms It is similar to the parallel random access machine PRAM model but unlike PRAM BSP does not take communication and synchronization for granted In fact quantifying the requisite synchronization and communication is an important part of analyzing a BSP algorithm Contents 1 History 2 The BSP model 2 1 Overview 2 2 Communication 2 3 Barriers 2 4 Algorithmic cost 3 Extensions and uses 4 See also 5 References 6 External linksHistory EditThe BSP model was developed by Leslie Valiant of Harvard University during the 1980s The definitive article was published in 1990 1 Between 1990 and 1992 Leslie Valiant and Bill McColl of Oxford University worked on ideas for a distributed memory BSP programming model in Princeton and at Harvard Between 1992 and 1997 McColl led a large research team at Oxford that developed various BSP programming libraries languages and tools and also numerous massively parallel BSP algorithms including many early examples of high performance communication avoiding parallel algorithms 2 and recursive immortal parallel algorithms that achieve the best possible performance and optimal parametric tradeoffs 3 With interest and momentum growing McColl then led a group from Oxford Harvard Florida Princeton Bell Labs Columbia and Utrecht that developed and published the BSPlib Standard for BSP programming in 1996 4 Valiant developed an extension to the BSP model in the 2000s leading to the publication of the Multi BSP model in 2011 5 In 2017 McColl developed a major new extension of the BSP model that provides fault tolerance and tail tolerance for large scale parallel computations in AI Analytics and high performance computing HPC 6 See also 7 The BSP model EditOverview Edit A BSP computer consists of the following Components capable of processing and or local memory transactions i e processors A network that routes messages between pairs of such components and A hardware facility that allows for the synchronization of all or a subset of components This is commonly interpreted as a set of processors that may follow different threads of computation with each processor equipped with fast local memory and interconnected by a communication network BSP algorithms rely heavily on the third feature a computation proceeds in a series of global supersteps which consists of three components Concurrent computation every participating processor may perform local computations i e each process can only make use of values stored in the local fast memory of the processor The computations occur asynchronously of all the others but may overlap with communication Communication The processes exchange data to facilitate remote data storage Barrier synchronization When a process reaches this point the barrier it waits until all other processes have reached the same barrier The computation and communication actions do not have to be ordered in time Communication typically takes the form of the one sided PUT and GET remote direct memory access RDMA calls rather than paired two sided send and receive message passing calls A BSP superstep Processes lack linear order and may be mapped to processors in any way The barrier synchronization concludes the superstep it ensures that all one sided communications are properly concluded Systems based on two sided communication include this synchronization cost implicitly for every message sent The barrier synchronization method relies on the BSP computer s hardware facility In Valiant s original paper this facility periodically checks if the end of the current superstep is reached globally The period of this check is denoted by L displaystyle L 1 The BSP model is also well suited for automatic memory management for distributed memory computing through over decomposition of the problem and oversubscription of the processors The computation is divided into more logical processes than there are physical processors and processes are randomly assigned to processors This strategy can be shown statistically to lead to almost perfect load balancing both of work and communication Communication Edit In many parallel programming systems communications are considered at the level of individual actions such as sending and receiving a message or memory to memory transfer This is difficult to work with since there are many simultaneous communication actions in a parallel program and their interactions are typically complex In particular it is difficult to say much about the time any single communication action will take to complete The BSP model considers communication actions en masse This has the effect that an upper bound on the time taken to communicate a set of data can be given BSP considers all communication actions of a superstep as one unit and assumes all individual messages sent as part of this unit have a fixed size The maximum number of incoming or outgoing messages for a superstep is denoted by h displaystyle h The ability of a communication network to deliver data is captured by a parameter g displaystyle g defined such that it takes time h g displaystyle hg for a processor to deliver h displaystyle h messages of size 1 A message of length m displaystyle m obviously takes longer to send than a message of size 1 However the BSP model does not make a distinction between a message length of m displaystyle m or m displaystyle m messages of length 1 In either case the cost is said to be m g displaystyle mg The parameter g displaystyle g depends on the following The protocols used to interact within the communication network Buffer management by both the processors and the communication network The routing strategy used in the network The BSP runtime system In practice g displaystyle g is determined empirically for each parallel computer Note that g displaystyle g is not the normalized single word delivery time but the single word delivery time under continuous traffic conditions Barriers Edit This section needs additional citations for verification Please help improve this article by adding citations to reliable sources Unsourced material may be challenged and removed November 2013 Learn how and when to remove this template message The one sided communication of the BSP model requires barrier synchronization Barriers are potentially costly but avoid the possibility of deadlock or livelock since barriers cannot create circular data dependencies Tools to detect them and deal with them are unnecessary Barriers also permit novel forms of fault tolerance citation needed The cost of barrier synchronization is influenced by a couple of issues The cost imposed by the variation in the completion time of the participating concurrent computations Take the example where all but one of the processes have completed their work for this superstep and are waiting for the last process which still has a lot of work to complete The best that an implementation can do is ensure that each process works on roughly the same problem size The cost of reaching a globally consistent state in all of the processors This depends on the communication network but also on whether there is special purpose hardware available for synchronizing and on the way in which interrupts are handled by processors The cost of a barrier synchronization is denoted by l displaystyle l Note that l lt L displaystyle l lt L if the synchronization mechanism of the BSP computer is as suggested by Valiant 1 In practice a value of l displaystyle l is determined empirically On large computers barriers are expensive and this is increasingly so on large scales There is a large body of literature on removing synchronization points from existing algorithms in the context of BSP computing and beyond For example many algorithms allow for the local detection of the global end of a superstep simply by comparing local information to the number of messages already received This drives the cost of global synchronization compared to the minimally required latency of communication to zero 8 Yet also this minimal latency is expected to increase further for future supercomputer architectures and network interconnects the BSP model along with other models for parallel computation require adaptation to cope with this trend Multi BSP is one BSP based solution 5 Algorithmic cost Edit The cost of a superstep is determined as the sum of three terms The cost of the longest running local computation The cost of global communication between the processors The cost of the barrier synchronization at the end of the superstepThus the cost of one superstep for p displaystyle p processors m a x i 1 p w i m a x i 1 p h i g l displaystyle max i 1 p w i max i 1 p h i g l where w i displaystyle w i is the cost for the local computation in process i displaystyle i and h i displaystyle h i is the number of messages sent or received by process i displaystyle i Note that homogeneous processors are assumed here It is more common for the expression to be written as w h g l displaystyle w hg l where w displaystyle w and h displaystyle h are maxima The cost of an entire BSP algorithm is the sum of the cost of each superstep W H g S l s 1 S w s g s 1 S h s S l displaystyle W Hg Sl sum s 1 S w s g sum s 1 S h s Sl where S displaystyle S is the number of supersteps W displaystyle W H displaystyle H and S displaystyle S are usually modeled as functions that vary with problem size These three characteristics of a BSP algorithm are usually described in terms of asymptotic notation e g H O n p displaystyle H in O n p Extensions and uses EditInterest in BSP has soared with Google adopting it as a major technology for graph analytics at massive scale via Pregel and MapReduce Also with the next generation of Hadoop decoupling the MapReduce model from the rest of the Hadoop infrastructure there are now active open source projects to add explicit BSP programming as well as other high performance parallel programming models on top of Hadoop Examples are Apache Hama and Apache Giraph 9 BSP has been extended by many authors to address concerns about BSP s unsuitability for modelling specific architectures or computational paradigms One example of this is the decomposable BSP model The model has also been used in the creation of a number of new programming languages and interfaces such as Bulk Synchronous Parallel ML BSML BSPLib Apache Hama 9 and Pregel 10 Notable implementations of the BSPLib standard are the Paderborn University BSP library 11 and the Oxford BSP Toolset by Jonathan Hill 12 Modern implementations include BSPonMPI 13 which simulates BSP on top of the Message Passing Interface and MulticoreBSP 14 15 a novel implementation targeting modern shared memory architectures MulticoreBSP for C is especially notable for its capability of starting nested BSP runs thus allowing for explicit Multi BSP programming See also EditAutomatic mutual exclusion Apache Hama Apache Giraph Computer cluster Concurrent computing Concurrency computer science Dataflow programming Grid computing LogP machine Parallel computing Parallel programming modelReferences Edit a b c Leslie G Valiant A bridging model for parallel computation Communications of the ACM Volume 33 Issue 8 Aug 1990 1 W F McColl Scalable Computing Computer Science Today Recent Trends and Developments J van Leeuwen editor LNCS Volume 1000 Springer Verlag pp 46 61 1995 2 W F McColl and A Tiskin Memory efficient matrix multiplication in the BSP model Algorithmica 24 3 pp 287 297 1999 3 J M D Hill W F McColl D C Stefanescu M W Goudreau K Lang S B Rao T Suel T Tsantilas and R H Bisseling BSPlib The BSP Programming Library Parallel Computing 24 14 pp 1947 1980 1998 4 a b Valiant L G 2011 A bridging model for multi core computing Journal of Computer and System Sciences 77 1 154 166 5 A Bridging Model for High Performance Cloud Computing by Bill McColl in 18th SIAM Conference on Parallel Processing for Scientific Computing 2018 http meetings siam org sess dsp talk cfm p 88973 Archived 2019 12 11 at the Wayback Machine Bill McColl Mathematics Models and Architectures Chapter 1 pp 6 53 Mathematics for Future Computing and Communications edited by Liao Heng and Bill McColl Cambridge University Press 2022 6 Alpert R amp Philbin J 1997 cBSP Zero cost synchronization in a modified BSP model NEC Research Institute 4 Independence Way Princeton NJ 8540 7 a b Apache Hama Pregel The Paderborn University BSP PUB Library Design Implementation and Performance Heinz Nixdorf Institute Department of Computer Science University of Paderborn Germany technical report Archived 2001 06 05 at the Wayback Machine Jonathan Hill The Oxford BSP Toolset 1998 Wijnand J Suijlen BSPonMPI 2006 MulticoreBSP for C a high performance library for shared memory parallel programming by A N Yzelman R H Bisseling D Roose and K Meerbergen in International Journal of Parallel Programming in press 2013 doi 10 1109 TPDS 2013 31 An Object Oriented Bulk Synchronous Parallel Library for Multicore Programming by A N Yzelman amp Rob H Bisseling in Concurrency and Computation Practice and Experience 24 5 pp 533 553 2012 doi 10 1002 cpe 1843 External links EditD B Skillicorn Jonathan Hill W F McColl Questions and answers about BSP permanent dead link 1996 BSP Worldwide BSP related papers in French Bulk Synchronous Parallel ML in English official website Apache Hama Apache Giraph Paderborn University BSP library BSPonMPI MulticoreBSP Retrieved from https en wikipedia org w index php title Bulk synchronous parallel amp oldid 1129853921, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.