User Tools

Site Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
ffnamespace:architecture [2014/08/04 15:43]
aldinuc [References]
ffnamespace:architecture [2014/08/14 02:33]
aldinuc [High-level Patterns]
Line 12: Line 12:
 FastFlow architecture is organised in three main tiers: FastFlow architecture is organised in three main tiers:
  
-  - **High-level patterns** They are clearly characterised in a specific usage context and are targeted to the parallelisation of sequential (legacy) code. Examples are exploitation of loop parallelism,​ stream parallelism,​ data-parallel algorithms, execution of general workflows of tasks, etc. The are typically equipped with self-optimisation capabilities (e.g. load-balancing,​ grain auto-tuning,​ parallelism-degree auto-tuning) and exhibit ​no or limited nesting capability. Examples are:  ''​parallel-for'',​ ''​pipeline'',​ ''​stencil-reduce'',​ ''​macro-data-flow''​. Some of them targets specific devices (e.g. GPGPUs). They are implemented on top of **core patterns**. +  - **High-level patterns** They are clearly characterised in a specific usage context and are targeted to the parallelisation of sequential (legacy) code. Examples are exploitation of loop parallelism,​ stream parallelism,​ data-parallel algorithms, execution of general workflows of tasks, etc. They are typically equipped with self-optimisation capabilities (e.g. load-balancing,​ grain auto-tuning,​ parallelism-degree auto-tuning) and exhibit limited nesting capability. Examples are:  ''​parallel-for'',​ ''​pipeline'',​ ''​stencil-reduce'',​ ''​mdf''​ (macro-data-flow). Some of them targets specific devices (e.g. GPGPUs). They are implemented on top of **core patterns**. 
-  - **Core patterns** They provide a general //​data-centric//​ parallel programming model with its run-time support, which is designed to be minimal and reduce to the minimum typical sources of overheads in parallel programming. At this level there are two patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback''​). They make it possible to build very general (deadlock-free) cyclic process networks. They are not graphs of tasks, they are graphs of parallel executors (processes/​threads). Tasks or data items flows across them. Overall, the programming model can be envisioned as a shared-memory streaming model, i.e. a shared-memory model equipped with message-passing synchronisations. They are implemented on top of **building blocks**. ​+  - **Core patterns** They provide a general //​data-centric//​ parallel programming model with its run-time support, which is designed to be minimal and reduce to the minimum typical sources of overheads in parallel programming. At this level there are two patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​feedback''​). They make it possible to build very general (deadlock-free) cyclic process networks. They are not graphs of tasks, they are graphs of parallel executors (processes/​threads). Tasks or data items flows across them. Overall, the programming model can be envisioned as a shared-memory streaming model, i.e. a shared-memory model equipped with message-passing synchronisations. They are implemented on top of **building blocks**. ​
   - **Building blocks** It provides the basics blocks to build (and generate via C++ header-only templates) the run-time support of core patterns. Typical objects at this level are queues (e.g. wait-free fence-free SPSC queues, bound and unbound), process and thread containers (as C++ classes) mediator threads/​processes (extensible and configurable schedulers and gatherers). The shared-memory run-time support extensively uses nonblocking lock-free (and fence-free) algorithms, the distributed run-time support employs zero-copy messaging, the GPGPUs support exploits asynchrony and SIMT optimised algorithms. ​   - **Building blocks** It provides the basics blocks to build (and generate via C++ header-only templates) the run-time support of core patterns. Typical objects at this level are queues (e.g. wait-free fence-free SPSC queues, bound and unbound), process and thread containers (as C++ classes) mediator threads/​processes (extensible and configurable schedulers and gatherers). The shared-memory run-time support extensively uses nonblocking lock-free (and fence-free) algorithms, the distributed run-time support employs zero-copy messaging, the GPGPUs support exploits asynchrony and SIMT optimised algorithms. ​
  
Line 42: Line 42:
 </​code>​| </​code>​|
  
-In the specific case, the only syntactic difference between OpenMP and FastFlow are that FastFlow provide programmers with C++ templates instead of compiler pragmas. It is worth to notice that despite the similar syntax, the implementation of the ''​parallel_for''​ and all other high-level patterns in FastFlow is quite different from OpenMP and other mainstream programming frameworks (Intel TBB, etc). FastFlow, instead of relying on a general task execution engine, generates a compile time a specific streaming network based on core patterns for each pattern. In the case of ''​parallel_for''​ this network is a parametric master-worker with active or passive (in memory) task scheduler.+In the specific case, the only syntactic difference between OpenMP and FastFlow are that FastFlow provide programmers with C++ templates instead of compiler pragmas. It is worth to notice that despite the similar syntax, the implementation of the ''​parallel_for''​ and all other high-level patterns in FastFlow is quite different from OpenMP and other mainstream programming frameworks (Intel TBB, etc). FastFlow, instead of relying on a general task execution engine, generates a compile time a specific streaming network based on core patterns for each pattern. In the case of ''​parallel_for''​ this network is a parametric master-worker with active or passive (in memory) task scheduler ​(more details in the [[http://​calvados.di.unipi.it/​storage/​paper_files/​2014_ff_looppar_pdp.pdf|PDP2014 paper]]).
  
-As in OpenMP, ''​parallel_for''​ comes in many variants (see user manual, <fc #​FF0000>​documentation in progress</​fc>​). Other patterns at this level are: <fc #​FF0000>​documentation in progress</​fc>​. They cover most common parallel programming paradigms in data, stream and task parallelism. Notably, FastFlow patterns are C++ class templates and can be extended by end users according to the Object-Oriented methodology.+As in OpenMP, ''​parallel_for''​ comes in many variants (see [[ffnamespace:​refman|reference ​manual]]). Other patterns at this level, to date, are: ''​parallel_reduce'',​ ''​mdf''​ (macro-data-flow),​ ''​pool evolution''​ (genetic algorithm), ''​stencil''​. They cover most common parallel programming paradigms in data, stream and task parallelism. Notably, FastFlow patterns are C++ class templates and can be extended by end users according to the Object-Oriented methodology.
  
 Iterative execution of kernels onto GPGPUs are addressed by a single but very flexible pattern, i.e. ''​stencil-reduce'',​ which also takes care of feeding GPGPUs with data and D2H/H2D synchronisations. More details can be found in [[http://​calvados.di.unipi.it/​storage/​talks/​2014_S4585-Marco-Aldinucci.pdf|GTC 2014 talk]]. ​ Iterative execution of kernels onto GPGPUs are addressed by a single but very flexible pattern, i.e. ''​stencil-reduce'',​ which also takes care of feeding GPGPUs with data and D2H/H2D synchronisations. More details can be found in [[http://​calvados.di.unipi.it/​storage/​talks/​2014_S4585-Marco-Aldinucci.pdf|GTC 2014 talk]]. ​
  
 ==== Core Patterns ==== ==== Core Patterns ====
 +At its foundations FastFlow implements a (mid/​low-level) concurrent programming model, which extends C++ language. From the orchestration viewpoint, the process model to be employed is a CSP/Actor hybrid model where processes (so-called ''​ff_node''​s) are named and the data paths between processes are clearly identified. The abstract units of communication and synchronisation are known as ''​channels''​ and represent a stream of data dependency between two processes. A ''​ff_node''​ is C++ class, after construction it enters in a infinite loop
 +that 1) gets a task from input channel (i.e. a pointer); 2) execute business code on the task; 3) put a task the output channel (i.e. a pointer). Representing communication and synchronisation as a channel ensures that synchronisation is tied to communication and allows layers of abstraction at higher levels to compose parallel programs where synchronisation is implicit. ​
  
 +At the core patterns level, patterns to build a graph of ''​ff_node''​s are defined. Since the graph of ''​ff_node''​s is a streaming network, any FastFlow graph is built using two streaming patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback'',​ to build cyclic networks). These patterns can be arbitrarily nested to build large and complex graphs. However, not all graphs can be build. This enforce the correctness (by-construction) of all streaming networks that can be generated. In particular, they are deadlock-free and data-race free.
  
 === Accelerator mode === === Accelerator mode ===
 +
 +The offloading feature is concerned with abstracting over auxiliary hardware or software accelerators (e.g. GPGPUs or set-of-cores). Offloading allows the model to be heterogeneous over the hardware in that code that is to be executed on a CPU which makes use of this layer may be partially offloaded onto accelerators.
  
 === Nonblocking and Blocking behaviour === === Nonblocking and Blocking behaviour ===
Line 59: Line 64:
  
 ==== Building Blocks ==== ==== Building Blocks ====
-At its foundations FastFlow realises a (low-level) concurrent programming model, which extends C++ language. From the orchestration viewpoint, the process model to be employed is a CSP/Actor hybrid model where processes (so-called ''​ff_node''​s) are named and the data paths between processes are clearly identified. The abstract units of communication and synchronisation are known as ''​channels''​ and represent a data dependency between two processes. Representing communication and synchronisation as a channel ensures that synchronisation is tied to communication and allows layers of abstraction at higher levels to compose parallel programs where synchronisation is implicit. The offloading feature is concerned with abstracting over auxiliary hardware or software accelerators (e.g. GPGPUs or set-of-cores). Offloading allows the model to be heterogeneous over the hardware in that code that is to be executed on a CPU which makes use of this layer may be partially offloaded onto accelerators.+ 
  
 At this level, FastFlow programming model can be thought as a hybrid shared-memory/​message-passing model. A process (''​ff_node''​) is sequential, a channel models a true data dependency between processes. Processes typically stream data items (they are not tasks) onto channels, they can be either references (e.g. pointers in the shared-memory) or messages with a payload (e.g. in a distributed platform). In both cases, the data item acts as synchronisation token. In general, no further synchronisation primitives are needed (e.g. locks, semaphores) even thought their usage is not forbidden (they are simply useless and a source of additional overhead). Overall, at this level, FastFlow building blocks make it possible to realise arbitrary streaming networks over lock-less channels. ​ At this level, FastFlow programming model can be thought as a hybrid shared-memory/​message-passing model. A process (''​ff_node''​) is sequential, a channel models a true data dependency between processes. Processes typically stream data items (they are not tasks) onto channels, they can be either references (e.g. pointers in the shared-memory) or messages with a payload (e.g. in a distributed platform). In both cases, the data item acts as synchronisation token. In general, no further synchronisation primitives are needed (e.g. locks, semaphores) even thought their usage is not forbidden (they are simply useless and a source of additional overhead). Overall, at this level, FastFlow building blocks make it possible to realise arbitrary streaming networks over lock-less channels. ​
Line 185: Line 191:
  
 [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer. [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer.
- 
ffnamespace/architecture.txt · Last modified: 2014/09/12 19:06 by aldinuc