User Tools

Site Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
ffnamespace:architecture [2014/08/14 02:33]
aldinuc [High-level Patterns]
ffnamespace:architecture [2014/08/14 15:32]
aldinuc
Line 53: Line 53:
  
 At the core patterns level, patterns to build a graph of ''​ff_node''​s are defined. Since the graph of ''​ff_node''​s is a streaming network, any FastFlow graph is built using two streaming patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback'',​ to build cyclic networks). These patterns can be arbitrarily nested to build large and complex graphs. However, not all graphs can be build. This enforce the correctness (by-construction) of all streaming networks that can be generated. In particular, they are deadlock-free and data-race free. At the core patterns level, patterns to build a graph of ''​ff_node''​s are defined. Since the graph of ''​ff_node''​s is a streaming network, any FastFlow graph is built using two streaming patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback'',​ to build cyclic networks). These patterns can be arbitrarily nested to build large and complex graphs. However, not all graphs can be build. This enforce the correctness (by-construction) of all streaming networks that can be generated. In particular, they are deadlock-free and data-race free.
 +
 +=== Nonblocking and Blocking behaviour ===
 +
 +Blocking synchronisations fit well coarse grain parallelism (milliseconds tasks or more), whereas Nonblocking fine grain parallelism. Blocking synchronisations make it possible to exploit over-provisioning (e.g. for load balancing) and energy consumption. However, they exhibits large overheads (also due to OS involvement). Mixing blocking and nonblocking synchronisations is not trivial.
 +
 +FastFlow run-time is designed to exhibit a nonblocking behaviour, with the possibility to switch to blocking behaviour. Overall, a FastFlow run is a sequence of nonblocking running phases. Among two phases the run-time can switch to a blocking phase by way of a (original, data-flow) distributed protocol. In the FastFlow terminology,​ a pattern (or a composition of patterns) can //freeze// (i.e. suspend), to be later resumed in the next nonblocking phase. This model make it possible to address fine grain workloads, bursts of fine grain workloads, and coarse grain workloads. During nonblocking phase, the FastFlow run time employes only lock-free and wait-free algorithms in all synchronisation critical paths (whereas it uses pthreads locks in the blocking phase).
 +
 +=== Deadlock avoidance ===
 +
 +The implementation in term of streaming network (i.e. a network of threads or processes) of a pattern can be cyclic (e.g. master-worker,​ D&C, etc.). For this FastFlow uses its own unbound SPSC buffer to avoid deadlocks due to dependency cycles [ADK12].
  
 === Accelerator mode === === Accelerator mode ===
Line 58: Line 68:
 The offloading feature is concerned with abstracting over auxiliary hardware or software accelerators (e.g. GPGPUs or set-of-cores). Offloading allows the model to be heterogeneous over the hardware in that code that is to be executed on a CPU which makes use of this layer may be partially offloaded onto accelerators. The offloading feature is concerned with abstracting over auxiliary hardware or software accelerators (e.g. GPGPUs or set-of-cores). Offloading allows the model to be heterogeneous over the hardware in that code that is to be executed on a CPU which makes use of this layer may be partially offloaded onto accelerators.
  
-=== Nonblocking ​and Blocking behaviour ===+At the level of core patterns, all pattern composition can work in //​accelerator mode//. A software accelerator can be feed with asynchronous offloading requests from the main thread (or any other thread), which basically generate a input stream for the accelerator. The primary aim of offloading is to provide the high-level pattern (or application programmer) programmer with an easy and semi-automatic path to introducing parallelism into a C/C++ sequential code by moving or copying parts of the original code into the body of C++ methods, which will be executed in parallel according to a FastFlow pattern (or pattern composition). Results computed from an accelerator can be collected either synchronously or asynchronously. 
 + 
 +A FastFlow accelerator provides the programmer with one (untyped) streaming input channel and one (untyped) streaming output channel that can be dynamically //created// (and //​destroyed//​) from a C++ code (either sequential or multi-threaded) as a C++ object. Thanks to the underlying shared memory architecture,​ messages flowing into these channels may carry both values and pointers to data structures.  
 + 
 +An accelerator,​ which is a collection of threads, ​ has a global lifecycle with 
 +two stable states: \emph{running} and 
 +\emph{frozen},​ plus several transient states. The running state happens 
 +when all threads are logically able to run (i.e. they are ready or 
 +running at the O.S. level). ​ The frozen state happens when all threads 
 +are suspended (at the O.S. level). Transitions from these two 
 +states involve calls to the underlying threading library (and to the O.S.). 
 + 
 +Once created, an accelerator can be run, making it capable of accepting tasks on the input channel. When running, the threads belonging to an accelerator might fall into an //active waiting// state. These state transitions exhibit a very low overhead and do not involve the O.S. 
 +threads not belonging to the accelerator could //wait// for an accelerator,​ i.e. suspend until the accelerator completes its input tasks (receives the //​End-of-Stream//,​ unique is propagated in transient states of the lifecycle to all threads) ​ and then put it in the frozen state. At creation time, the accelerator is configured and its threads are bound into one or more  cores. Since the FastFlow run-time is implemented via non-blocking threads, they will, if not frozen, fully load the cores in which they are placed, no matter whether they are actually processing something or not. Because of this, the accelerator is usually configured to use "​spare"​ cores 
 +(although over-provisioning could be forced). ​ If necessary, output tasks could be popped from the accelerator output channel. 
 + 
 +More details on FastFlow accelerator technology can be found in [ADK11]. ​
  
-=== Deadlock avoidance === 
  
  
Line 104: Line 129:
 === Distributed platforms === === Distributed platforms ===
  
- +Documentation ​in progress
- +
- +
-<note warning>​OLD page - work in progress</​note>​ +
- +
-===== FastFlow tiers details =====+
  
 ==== Hardware ==== ==== Hardware ====
Line 122: Line 142:
 the x86 model). On other models (e.g., Itanium and Power4, 5, and 6), the x86 model). On other models (e.g., Itanium and Power4, 5, and 6),
 a store fence before an enqueue is needed [GMV08]. a store fence before an enqueue is needed [GMV08].
 +
 == GPGPUs == == GPGPUs ==
 +
 GPGPUs are supported by way of OpenCL and/or CUDA. At the current development status, kernel business code should be written either in OpenCL or CUDA. FastFlow takes care of H2D/D2H (asynchronous) data transfers and synchronisations. ''​stencil-reduce''​ pattern makes it possible to write most of the typical GPGPUs kernels as they were C/C++ code since intra-block and inter-blocks synchronisations (including reduce code) are transparently provided by the pattern. Still, the programmer can use OpenCL/CUDA directives in the kernel. ​ GPGPUs are supported by way of OpenCL and/or CUDA. At the current development status, kernel business code should be written either in OpenCL or CUDA. FastFlow takes care of H2D/D2H (asynchronous) data transfers and synchronisations. ''​stencil-reduce''​ pattern makes it possible to write most of the typical GPGPUs kernels as they were C/C++ code since intra-block and inter-blocks synchronisations (including reduce code) are transparently provided by the pattern. Still, the programmer can use OpenCL/CUDA directives in the kernel. ​
 +
 == Distributed == == Distributed ==
 +
 Distributed platforms build on top of TCP/IP and Infiniband/​OFED protocols are also supported. ​ Distributed platforms build on top of TCP/IP and Infiniband/​OFED protocols are also supported. ​
 FPGA support is planned but not yet fully developed. FPGA support is planned but not yet fully developed.
  
  
 +===== Rationale =====
  
- +Parallelism ​exploitation patterns (a.k.a. {{http://​en.wikipedia.org/​wiki/​Algorithmic_skeleton|skeletons}}) are usually ​categorised ​in three main classes: **Task**, **Data**, and **Stream** Parallelism. ​
-==== Core Patterns ==== +
- +
-<note tip>To be detailed</​note>​ +
- +
-==== High-level Patterns ==== +
- +
-<note important>​To be updated</​note>​ +
- +
-The next layer up, i.e. //Core Patterns//, ​ provides a programming framework based on parallelism ​exploitation patterns (a.k.a. {{http://​en.wikipedia.org/​wiki/​Algorithmic_skeleton|skeletons}}). They are usually ​categorized ​in three main classes: **Task**, **Data**, and **Stream** Parallelism.  +
-FastFlow specifically focuses on Stream Parallelism,​ and in particular provides: +
-''​farm'',​ ''​farm-with-feedback''​ (i.e. Divide\&​Conquer),​ ''​pipeline'',​ and their arbitrary nesting and composition. The set of skeletons provided by FastFlow could be further extended by building new C++ templates.+
  
   * **Stream Parallelism** can be used when there exists a partial or total order in a computation. By processing data elements in order, local state may be maintained in each filter. The set of skeletons provided by FastFlow could be further extended by building new C++ templates on top  of the Fastflow low-level programming layer.   * **Stream Parallelism** can be used when there exists a partial or total order in a computation. By processing data elements in order, local state may be maintained in each filter. The set of skeletons provided by FastFlow could be further extended by building new C++ templates on top  of the Fastflow low-level programming layer.
Line 147: Line 161:
   * **Data Parallelism** is a method for parallelizing a single task by processing independent data elements of this task in parallel. The flexibility of the technique relies upon stateless processing routines implying that the data elements must be fully independent. Data Parallelism also supports Loop-level Parallelism where successive iterations of a loop working on independent or read-only data are parallelized in different flows-of-control and concurrently executed.   * **Data Parallelism** is a method for parallelizing a single task by processing independent data elements of this task in parallel. The flexibility of the technique relies upon stateless processing routines implying that the data elements must be fully independent. Data Parallelism also supports Loop-level Parallelism where successive iterations of a loop working on independent or read-only data are parallelized in different flows-of-control and concurrently executed.
  
 +FastFlow is designed to support all of them over a stream parallel programming model (provided by core patterns tier). The set of patterns provided by FastFlow could be further extended by building new C++ templates.
  
 While many of the  programming frameworks for multi-core offer Data and Task Parallel skeletons, only few of them offer Stream Parallel skeletons (such as TBB's //​pipeline//​). None of them  offers the //farm// skeleton, which exploits ​ functional replication of a set of //workers// and  abstracts out the parallel filtering of successive //​independent//​ items of the stream under the control of a scheduler, as a first-class concept. While many of the  programming frameworks for multi-core offer Data and Task Parallel skeletons, only few of them offer Stream Parallel skeletons (such as TBB's //​pipeline//​). None of them  offers the //farm// skeleton, which exploits ​ functional replication of a set of //workers// and  abstracts out the parallel filtering of successive //​independent//​ items of the stream under the control of a scheduler, as a first-class concept.
- 
-===== Problem Solving Environments (PSEs) ===== 
- 
-==== Fastflow accelerator ​ ==== 
-A //FastFlow accelerator//​ is a software device wrapping a high-level FastFlow program, i.e. a skeleton or a composition of skeletons, and providing the application programmer with a functional //​self-offloading//​ feature, since the offload happens on the same hardware device, i.e. CPU cores. The primary aim of self-offloading is to provide the programmer with an easy and semi-automatic path to introducing parallelism into a C/C++ sequential code by moving or copying parts of the original code into the body of C++ methods, which will be executed in parallel according to a FastFlow skeleton (or skeleton composition). This requires limited programming effort and it may speed up  the original code by exploiting unused cores. 
- 
-A FastFlow accelerator provides the programmer with one (untyped) streaming input channel and one (untyped) streaming output channel that can be dynamically //created// (and //​destroyed//​) from a C++ code (either sequential or multi-threaded) as a C++ object. Thanks to the underlying shared memory architecture,​ messages flowing into these channels may carry both values and pointers to data structures. ​ 
- 
-An accelerator,​ which is a collection of threads, ​ has a global lifecycle with 
-two stable states: \emph{running} and 
-\emph{frozen},​ plus several transient states. The running state happens 
-when all threads are logically able to run (i.e. they are ready or 
-running at the O.S. level). ​ The frozen state happens when all threads 
-are suspended (at the O.S. level). Transitions from these two 
-states involve calls to the underlying threading library (and to the O.S.). 
- 
-Once created, an accelerator can be run, making it capable of accepting tasks on the input channel. When running, the threads belonging to an accelerator might fall into an //active waiting// state. These state transitions exhibit a very low overhead and do not involve the O.S. 
-threads not belonging to the accelerator could //wait// for an accelerator,​ i.e. suspend until the accelerator completes its input tasks (receives the //​End-of-Stream//,​ unique is propagated in transient states of the lifecycle to all threads) ​ and then put it in the frozen state. At creation time, the accelerator is configured and its threads are bound into one or more  cores. Since the FastFlow run-time is implemented via non-blocking threads, they will, if not frozen, fully load the cores in which they are placed, no matter whether they are actually processing something or not. Because of this, the accelerator is usually configured to use "​spare"​ cores 
-(although over-provisioning could be forced). ​ If necessary, output tasks could be popped from the accelerator output channel. 
- 
-More details on //HOWTO// use a FastFlow accelerator can be found in {{http://​www.di.unipi.it/​~aldinuc/​paper_files/​TR-10-03.pdf|TR-10-03}},​ and in several examples within the {{http://​sourceforge.net/​projects/​mc-fastflow/​|FastFlow tarball available from sourceforge}. 
- 
- 
  
  
Line 188: Line 180:
 [AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,​ N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. [[http://​doi.acm.org/​10.1145/​1562764.1562783|DOI]] [AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,​ N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. [[http://​doi.acm.org/​10.1145/​1562764.1562783|DOI]]
  
-[AMT09] M. Aldinucci, M. Meneghin, and M. Torquati. ​ ​Efficient Smith-Waterman ​on multi-core with fastflow. In Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing, ​PisaItalyFeb2010IEEETo appear. [[ffnamespace:​about|(Paper Draft)]]+[ADK11] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. ​Accelerating code on multi- ​cores with fastflow. In Proc. of 17th Intl. Euro-Par 2011 Parallel ​Processing, ​volume 6853 of LNCS, pages 170–181, BordeauxFranceAug2011Springer.
  
 [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer. [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer.
ffnamespace/architecture.txt · Last modified: 2014/09/12 19:06 by aldinuc