User Tools

Site Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
ffnamespace:faq [2013/05/07 18:25]
peter [Supported platforms and OSes]
ffnamespace:faq [2014/09/15 00:58]
aldinuc
Line 1: Line 1:
-~~NOTOC~~ 
 ====== Frequently Asked Questions ====== ====== Frequently Asked Questions ======
 +
 +Which platforms/​OSes/​compilers are supported?
 +
 +  * Linux (i386, x86_64, Arm, PPC) with gcc supporting c++11 (>4.6). Other c++11-enabled compilers (e.g. Intel ICC) typically works. ​    
 +  * MacOS X (> 10.4, i386-x86_64,​ PPC) with a c++ supporting c++11 (e.g. clang5.1, gcc).  ​
 +  * Usage of GPUs (NVidia, AMD) requires either CUDA or OpenCL.
 +  * Microsoft Windows (Windows 7 64 bit, x86_64) with Visual Studio Express 2013. Other Windows and Visual Studio compiler might works (minor fixes might be required). Window code is not fully optimised for performance.
 +  * Other platforms/​OSes/​compilers might work but are not extensively tested (e.g. iOS). FastFlow core is a header-only library and it is likely to work on any platform with a good c++ compiler. c++11 is required to use all FastFlow features. Core patterns does not requires c++11. Main development platform is Linux/​x86_64/​gcc.
 +  * Dependencies from third-party libraries: Shared-memory:​ pthreads (native threads for Windows)
 +  * Distributed:​ zeromq/TCP and/or IB/OFED
 +  * GPU: CUDA or OpenCL
 + 
 +<note important>​Work in progress</​note>​
 +
 ===== Programming effort ===== ===== Programming effort =====
 ==== FastFlow vs OpenMP and Intel TBB (and CnC) ==== ==== FastFlow vs OpenMP and Intel TBB (and CnC) ====
Line 15: Line 28:
 === FastFlow vs Intel CnC === === FastFlow vs Intel CnC ===
 To appear, we are working on it. To appear, we are working on it.
-===== Supported platforms and OSes ===== +
-Linux (i386, x86_64) and MacOS X (> 10.4, i386-x86_64,​ PPC) are directly supported. The support for Windows (32 and 64 bit) is available as beta in FastFlow 1.0.9 at revision 31 of Souceforge svn; it will be released with FastFlow 1.1. cc-NUMA platforms are supported (although optimization of the runtime for these platform is currently ongoing).+
  
 ===== Accelerators and offloading ===== ===== Accelerators and offloading =====
 ==== What is a FastFlow accelerator?​ ==== ==== What is a FastFlow accelerator?​ ====
-The FastFlow accelerator is an extension of the FastFlow framework ​aiming ​at simplifying the porting of existing sequential code to multicore. A FastFlow accelerator is software device defined as a composition of FastFlow patterns (e.g. ''​pipe(S1,​S2)'',​ ''​farm(S)'',​ ''​pipe(S1,​farm(S2))'',​ ...) that can be started independently from the main flow of control; one or more accelerators can be (dynamically) started in one application. Each accelerator exhibits a well-defined parallel semantics that depend ​from its particular ​patter ​composition. Tasks can be asynchronously offloaded (so-called self-offloaded) onto an accelerator. Results from accelerators can be returned to the caller thread either in a blocking or non-blocking fashion. FastFlow accelerators enable programmers to 1) create a stream of tasks from a loop or a recursive call; 2) parallelize kernels of code changing the original code in very local way (as an example a part of a loop body). A FastFlow accelerator typically ​work in non-blocking fashion on a subset of cores of the CPUs, but can be transiently suspended to release hardware resources to efficiently manage non-contiguous bursts of tasks.  +The FastFlow accelerator is an extension of the FastFlow framework ​aimed at simplifying the porting of existing sequential code to multicore. A FastFlow accelerator is software device defined as a composition of FastFlow patterns (e.g. ''​pipe(S1,​S2)'',​ ''​farm(S)'',​ ''​pipe(S1,​farm(S2))'',​ ...) that can be started independently from the main flow of control; one or more accelerators can be (dynamically) started in one application. Each accelerator exhibits a well-defined parallel semantics that depend ​on its particular ​pattern ​composition. Tasks can be asynchronously offloaded (so-called self-offloaded) onto an accelerator. Results from accelerators can be returned to the caller thread either in a blocking or non-blocking fashion. FastFlow accelerators enable programmers to 1) create a stream of tasks from a loop or a recursive call; 2) parallelize kernels of code changing the original code only in very local way (for examplea part of a loop body). A FastFlow accelerator typically ​works in non-blocking fashion on a subset of cores of the CPUs, but can be transiently suspended to release hardware resources to efficiently manage non-contiguous bursts of tasks.  
-==== How does Fastflow ​compares ​to OpenCL or CUDA? ==== +==== How does Fastflow ​compare ​to OpenCL or CUDA? ==== 
-Fastflow cannot be directly compared to OpenCL or CUDA, it rather complements them (currently, at least). OpenCL and CUDA are designed to execute a SIMD code onto a hardware acceleratorthey are at the same level of abstraction ​of Fastflow low-level programming layer. On the contrary, Fastflow has been mainly ​designed to efficiently execute parallel code onto the cores of the CPU (of a SMP platform, currently). ​+Fastflow cannot be directly compared to OpenCL or CUDA, it rather complements them (currently, at least). OpenCL and CUDA are designed to execute a SIMD code onto a hardware acceleratorthey are at the same level of abstraction ​as the Fastflow low-level programming layer. On the contrary, Fastflow has been designed ​mainly ​to efficiently execute parallel code onto the cores of the CPU (of a SMP platform, currently). ​
 ==== How does Fastflow deal with hardware accelerators (GPUs, etc.)? ====    ​ ==== How does Fastflow deal with hardware accelerators (GPUs, etc.)? ====    ​
-Fastflow (at high-level programming layer) provides the programmer with an effective ​way of exploiting parallel program orchestration (efficient synchronizations,​ scheduling, etc.) by means of parallel programming orchestration templates (i.e. skeletons). Skeletons are higher-order entities (i.e. object factories) that should be filled with your own C++/C sequential code. This code can include your own accelerator code (OpenCL, CUDA, SSE, etc.), but Fastflow ​ currently neither provides any special assist in exploiting that code nor help the programmer in the parallelization of that code. An example of the joint usage of Fastflow+SSE3 can be found in the Smith-Waterman application (see Fastflow tarball).+Fastflow (at the high-level programming layer) provides the programmer with an effective ​means of exploiting parallel program orchestration (efficient synchronizations,​ scheduling, etc.) by means of parallel programming orchestration templates (i.e. skeletons). Skeletons are higher-order entities (i.e. object factories) that should be filled with your own C++/C sequential code. This code can include your own accelerator code (OpenCL, CUDA, SSE, etc.), but Fastflow ​ currently neither provides any special assist in exploiting that code nor does it help the programmer in the parallelization of that code. An example of the joint use of Fastflow+SSE3 can be found in the Smith-Waterman application (see Fastflow tarball).
 ==== Can Fastflow ease the use of hardware accelerators (GPUs, etc.)? ====  ==== Can Fastflow ease the use of hardware accelerators (GPUs, etc.)? ==== 
-Yes, theoretically ​yesDesigning ​Fastflow we envision ​it also as a mean to ease and experiment ​the high-level programming of hardware accelerators, ​that is currently ​nearly ​a nightmare. To do that, we need to extend the generation strategy from high-level to low-level layer in the Fastflow stack considering an extended low-level layer that includes accelerator instruction (or accelerator access API). Extending that generation strategy while maintaining high-performance is not trivial. //FastFlow accelerator//​ is a step ahead in this direction. +Yes, theoretically. ​In designing ​Fastflow we envisage ​it also as a means of easing ​the high-level programming of hardware accelerators, ​which is currently ​almost ​a nightmare. To do that, we need to extend the generation strategy from the high-level to the low-level layer in the Fastflow stack by considering an extended low-level layer that includes accelerator instruction (or accelerator access API). Extending that generation strategy while maintaining high-performance is not trivial. //FastFlow accelerator//​ is a step ahead in this direction. 
-Using the //​self-offloading//​ technique, FastFlow program can be almost automatically transformed into a programmable software accelerator,​ i.e. a device running on idle CPU cores that can be used as it were a programmable hardware accelerator. The function to be offloaded on the FastFlow accelerator can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach. More details can be found in  {{http://​compass2.di.unipi.it/​TR/​Files/​TR-10-03.pdf.gz|TR-10-03}}.+Using the //​self-offloading//​ technique, ​FastFlow program can be almost automatically transformed into a programmable software accelerator,​ i.e. a device running on idle CPU cores that can be used as if it were a programmable hardware accelerator. The function to be offloaded on the FastFlow accelerator can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach. More details can be found in  {{http://​compass2.di.unipi.it/​TR/​Files/​TR-10-03.pdf.gz|TR-10-03}}.
 ===== FastFlow queues ===== ===== FastFlow queues =====
 Building a streaming network on top of SPSC queues means potentially using many queues, even if they are automatically generated and assembled in MPMC queues. Details about the implementation of the FastFlow'​s SPSC queues can be found {{http://​calvados.di.unipi.it/​dokuwiki/​lib/​tpl/​torquati/​paper_files/​TR-10-20.pdf|here}}. Building a streaming network on top of SPSC queues means potentially using many queues, even if they are automatically generated and assembled in MPMC queues. Details about the implementation of the FastFlow'​s SPSC queues can be found {{http://​calvados.di.unipi.it/​dokuwiki/​lib/​tpl/​torquati/​paper_files/​TR-10-20.pdf|here}}.
 ==== How big are the SPSC queues? ==== ==== How big are the SPSC queues? ====
-An empty SPSC queue on a 64bit platform has a size of 144 bytes. Queues are thought ​to store memory pointers, ​thus a queue of size ''​k''​ requires ''​144+64*k''​ bytes. Typically, a SPSC queue is just few KBytes.+An empty SPSC queue on a 64bit platform has a size of 144 bytes. Queues are considered ​to store memory pointers, ​and so a queue of size ''​k''​ requires ''​144+64*k''​ bytes. Typically, a SPSC queue is just few KBytes.
 ====  How much memory may be consumed on a many-core system? ==== ====  How much memory may be consumed on a many-core system? ====
-Not that much since thread ​are typically not connected by a //​complete//​ graph, but according to skeleton synchronization schema that is not typically complete. As an example, the ''​pipeline''​ skeleton with n stages requires ''​n-1''​ queues, the ''​farm''​ skeleton requires ''​2+2*n_workers''​ queues, Divide&​Conquer (i.e. farm with feedback channels) requires ​ ''​2*n_workers''​ queues. Typically the consumed size linearly grows with number of threads. +Not very much since threads ​are typically not connected by a //​complete//​ graph, but according to the skeleton synchronization schema that is not typically complete. As an example, the ''​pipeline''​ skeleton with n stages requires ''​n-1''​ queues, the ''​farm''​ skeleton requires ''​2+2*n_workers''​ queues, Divide&​Conquer (i.e. farm with feedback channels) requires ​ ''​2*n_workers''​ queues. Typically the consumed size linearly grows with the number of threads. 
-==== How MPMC queues ​are realized? ==== +==== How are MPMC queues realized? ==== 
-Multiple-Producer-Multiple-Consumer (MPMC) queues are realized using one SPSC queue per producer and one SPSC queue per consumer. These queues are put together using an arbiter thread in a fully lock-free and fence-free fashion (no CAS at all). SPSC queues are enriched with additional methods ​aiming ​at improving cache locality and throughput, such as multi-push. In addition, FastFlow provides several variants of classic lock-free queues (using CAS operations) such as Michael&​Scott queue, which leverage on deferred reclamation and memory alignment provided by FastFlow allocators. ​+Multiple-Producer-Multiple-Consumer (MPMC) queues are realized using one SPSC queue per producer and one SPSC queue per consumer. These queues are put together using an arbiter thread in a fully lock-free and fence-free fashion (no CAS at all). SPSC queues are enriched with additional methods ​aimed at improving cache locality and throughput, such as multi-push. In addition, FastFlow provides several variants of classic lock-free queues (using CAS operations) such as the Michael&​Scott queue, which leverage on deferred reclamation and memory alignment provided by FastFlow allocators. ​
 ==== Is this approach scalable? ==== ==== Is this approach scalable? ====
-In general, the scalability of the approach ​depend by the quality of the mapping from skeleton implementation onto underlying memory connectivity. Skeletons requiring higher connectivity (i.e. more synchronizations) may requires ​a higher connectivity degree at the hardware memory data-path level. ​Observe ​however, this is true for any concurrent programming model. The big advance of skeletal ​approach indeed consists in the possibility to exploit different implementation templates for the same skeleton in order to match the peculiarity of different memory sub-systems. This enhance ​portability and performance portability since the code should ​not be re-designed for different multi-core platforms. ​  +In general, the scalability of the approach ​depends on the quality of the mapping from skeleton implementation onto underlying memory connectivity. Skeletons requiring higher connectivity (i.e. more synchronizations) may require ​a higher connectivity degree at the hardware memory data-path level. ​Note, however, ​that this is true for any concurrent programming model. The big advance of the skeleton ​approach indeed consists in the possibility to exploit different implementation templates for the same skeleton in order to match the peculiarity of different memory sub-systems. This enhances ​portability and performance portability since the code does not have to be re-designed for different multi-core platforms. ​  
-==== Is FastFlow ​supporting ​unbound/​dynamic queues? ==== +==== Does FastFlow ​support ​unbound/​dynamic queues? ==== 
-There exist an unbound version of SPSC FastFlow queue. This kind of queue can dynamically and automatically grow and shrink to match actual size needs. As other queues, the unbound queue (so-called //uSPSC//) is lock-free and fence-free and exhibits almost the same performance of other queues. uBuffer implementation is available within FastFlow tarballthe correctness proof is described {{http://​calvados.di.unipi.it/​dokuwiki/​lib/​tpl/​torquati/​paper_files/​TR-10-20.pdf|here}}. +There exists ​an unbound version of the SPSC FastFlow queue. This kind of queue can dynamically and automatically grow and shrink to match actual size requirements. As with other queues, the unbound queue (so-called //uSPSC//) is lock-free and fence-free and exhibits almost the same performance of other queues. uBuffer implementation is available within ​the FastFlow tarballthe correctness proof is described {{http://​calvados.di.unipi.it/​dokuwiki/​lib/​tpl/​torquati/​paper_files/​TR-10-20.pdf|here}}. 
-==== Why using both bound and unbound queues? ==== +==== Why use both bound and unbound queues? ==== 
-Bound and unbound queues target different problems. Bound queues can be used to exploit a limited degree of asynchrony among threads, ​thus are useful ​to enforce ​temporal synchronizations. Unbounded queues ​enforces ​data-dependency only (asynchrony degree is unbound), they are very useful in deadlock avoidance strategies of cyclic streaming networks, but does not induce temporal synchronicity among threads. A good system should find a fair trade-off between the two kind of queues ​as well as properly ​defines ​the size of bound queues. As an example, a queues ​with length 1 can be used to model a temporal synchronization device since the producer can check when the consumer has received the data. +Bound and unbound queues target different problems. Bound queues can be used to exploit a limited degree of asynchrony among threads, ​and so are useful ​for enforcing ​temporal synchronizations. Unbounded queues ​enforce ​data-dependency only (asynchrony degree is unbound), they are very useful in deadlock avoidance strategies of cyclic streaming networks, but do not induce temporal synchronicity among threads. A good system should find a fair trade-off between the two kinds of queue as well as properly ​defining ​the size of bound queues. As an example, a queue with length 1 can be used to model a temporal synchronization device since the producer can check when the consumer has received the data. 
-==== Do FastFlow queues represent a novel research ​results? ====  +==== Do FastFlow queues represent a novel research ​result? ====  
-Bound SPSC queues are inspired ​to //P1C1// queues by Higham and Kavalsh (1997), ​despite ​the implementation ​differ from many important details. FastFlow MPMC queues are, to the best of our knowledge, an original ​usage of SPSC queues. FastFlow unbound SPSC queues ​idea and design, to the best of our knowledge, is fully novel. Unbound queues can be combined exactly as other SPSC queues to compose MPSC unbound queues (and this is again a novel result).  ​+Bound SPSC queues are inspired ​by //P1C1// queues by Higham and Kavalsh (1997), ​although ​the implementation ​differs in many important details. FastFlow MPMC queues are, to the best of our knowledge, an original ​use of SPSC queues. ​The FastFlow unbound SPSC queue idea and design, to the best of our knowledge, is fully novel. Unbound queues can be combined exactly as other SPSC queues to compose MPSC unbound queues (and this is again a novel result).  ​
ffnamespace/faq.txt · Last modified: 2014/09/15 01:00 by aldinuc