This is an old revision of the document!
Of course, the argument for programmability will only be fully “proven” via a large study in which programmers with equal starting knowledge of differing technologies develop a range of applications in a range of technologies and compare experiences. Such a study is difficult to arrange and execute. For the moment the argument for scalability can only be based on a subjective assessment of the abstraction levels of the differing technologies and limited empirical experience. For the latter we can report that the entire YaDT-FF parallelisation (more details about YaDT-FF can be found in ECMLPKDD 2010 paper.) required just a few days of work, which the data mining experts report is significantly less than the time needed to parallelise the same application with OpenMP or TBB; and that this is due mainly to the fact that FastFlow provides a native way of implementing Divide&Conquer that can be used to structure the YaDT-FF application.
From a programming viewpoint FastFlow is similar to TBB (well, it is also correct to add that TBB is similar to many other programming frameworks that we, and the skeleton research community in general, have been developing for two decades, see architecture page). They both are C++ template libraries and advocate a pattern-based (a.k.a. skeleton) approach to parallel programming. There are two main differences between FastFlow and TBB:
FastFlow and OpenMP are not very similar. OpenMP, different from FastFlow which is a library, is a pragma-based compiled language extension (à la HPF). In OpenMP, if you are expert enough (or lucky enough) to find the correct pragmas and the correct places to put them in the code, you usually succeed in parallelising code in a very compact way. The time and effort required to find and tune these pragmas is not guaranteed to be small; performance tuning can be particularly annoying since the programmer has little or no control over decisions made by the compiler (unless using quite complex pragma combinations). OpenMP is usually quite good in the parallelisation of coarse grain loops. In version 3 parts of a loop can also be parallelized via task extension, which makes possible offloading on an “unstructured” accelerator; the performance achievable in this way is usually suboptimal. FastFlow follows another approach providing the programmer with a set of patterns with a standard semantics and the possibility to modify it via OO mechanisms (overloading methods). OpenMP does not currently effectively support streaming and Divide&Conqueur paradigms (even if several proposals in this regards exist).
To appear, we are working on it.
Linux (i386, x86_64) and MacOS X (> 10.4, i386-x86_64, PPC) are directly supported. The support for Windows (32 and 64 bit) is available as beta in FastFlow 1.0.9 at revision 31 of Souceforge svn; it will be released with FastFlow 1.1. cc-NUMA platforms are supported (although optimization of the runtime for these platform is currently ongoing).
The FastFlow accelerator is an extension of the FastFlow framework aiming at simplifying the porting of existing sequential code to multicore. A FastFlow accelerator is software device defined as a composition of FastFlow patterns (e.g.
pipe(S1,farm(S2)), …) that can be started independently from the main flow of control; one or more accelerators can be (dynamically) started in one application. Each accelerator exhibits a well-defined parallel semantics that depend from its particular patter composition. Tasks can be asynchronously offloaded (so-called self-offloaded) onto an accelerator. Results from accelerators can be returned to the caller thread either in a blocking or non-blocking fashion. FastFlow accelerators enable programmers to 1) create a stream of tasks from a loop or a recursive call; 2) parallelize kernels of code changing the original code in very local way (as an example a part of a loop body). A FastFlow accelerator typically work in non-blocking fashion on a subset of cores of the CPUs, but can be transiently suspended to release hardware resources to efficiently manage non-contiguous bursts of tasks.
Fastflow cannot be directly compared to OpenCL or CUDA, it rather complements them (currently, at least). OpenCL and CUDA are designed to execute a SIMD code onto a hardware accelerator, they are at the same level of abstraction of Fastflow low-level programming layer. On the contrary, Fastflow has been mainly designed to efficiently execute parallel code onto the cores of the CPU (of a SMP platform, currently).
Fastflow (at high-level programming layer) provides the programmer with an effective way of exploiting parallel program orchestration (efficient synchronizations, scheduling, etc.) by means of parallel programming orchestration templates (i.e. skeletons). Skeletons are higher-order entities (i.e. object factories) that should be filled with your own C++/C sequential code. This code can include your own accelerator code (OpenCL, CUDA, SSE, etc.), but Fastflow currently neither provides any special assist in exploiting that code nor help the programmer in the parallelization of that code. An example of the joint usage of Fastflow+SSE3 can be found in the Smith-Waterman application (see Fastflow tarball).
Yes, theoretically yes. Designing Fastflow we envision it also as a mean to ease and experiment the high-level programming of hardware accelerators, that is currently nearly a nightmare. To do that, we need to extend the generation strategy from high-level to low-level layer in the Fastflow stack considering an extended low-level layer that includes accelerator instruction (or accelerator access API). Extending that generation strategy while maintaining high-performance is not trivial. FastFlow accelerator is a step ahead in this direction. Using the self-offloading technique, FastFlow program can be almost automatically transformed into a programmable software accelerator, i.e. a device running on idle CPU cores that can be used as it were a programmable hardware accelerator. The function to be offloaded on the FastFlow accelerator can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach. More details can be found in TR-10-03.
Building a streaming network on top of SPSC queues means potentially using many queues, even if they are automatically generated and assembled in MPMC queues. Details about the implementation of the FastFlow's SPSC queues can be found here.
An empty SPSC queue on a 64bit platform has a size of 144 bytes. Queues are thought to store memory pointers, thus a queue of size
144+64*k bytes. Typically, a SPSC queue is just few KBytes.
Not that much since thread are typically not connected by a complete graph, but according to skeleton synchronization schema that is not typically complete. As an example, the
pipeline skeleton with n stages requires
n-1 queues, the
farm skeleton requires
2+2*n_workers queues, Divide&Conquer (i.e. farm with feedback channels) requires
2*n_workers queues. Typically the consumed size linearly grows with number of threads.
Multiple-Producer-Multiple-Consumer (MPMC) queues are realized using one SPSC queue per producer and one SPSC queue per consumer. These queues are put together using an arbiter thread in a fully lock-free and fence-free fashion (no CAS at all). SPSC queues are enriched with additional methods aiming at improving cache locality and throughput, such as multi-push. In addition, FastFlow provides several variants of classic lock-free queues (using CAS operations) such as Michael&Scott queue, which leverage on deferred reclamation and memory alignment provided by FastFlow allocators.
In general, the scalability of the approach depend by the quality of the mapping from skeleton implementation onto underlying memory connectivity. Skeletons requiring higher connectivity (i.e. more synchronizations) may requires a higher connectivity degree at the hardware memory data-path level. Observe however, this is true for any concurrent programming model. The big advance of skeletal approach indeed consists in the possibility to exploit different implementation templates for the same skeleton in order to match the peculiarity of different memory sub-systems. This enhance portability and performance portability since the code should not be re-designed for different multi-core platforms.
There exist an unbound version of SPSC FastFlow queue. This kind of queue can dynamically and automatically grow and shrink to match actual size needs. As other queues, the unbound queue (so-called uSPSC) is lock-free and fence-free and exhibits almost the same performance of other queues. uBuffer implementation is available within FastFlow tarball, the correctness proof is described here.
Bound and unbound queues target different problems. Bound queues can be used to exploit a limited degree of asynchrony among threads, thus are useful to enforce temporal synchronizations. Unbounded queues enforces data-dependency only (asynchrony degree is unbound), they are very useful in deadlock avoidance strategies of cyclic streaming networks, but does not induce temporal synchronicity among threads. A good system should find a fair trade-off between the two kind of queues as well as properly defines the size of bound queues. As an example, a queues with length 1 can be used to model a temporal synchronization device since the producer can check when the consumer has received the data.
Bound SPSC queues are inspired to P1C1 queues by Higham and Kavalsh (1997), despite the implementation differ from many important details. FastFlow MPMC queues are, to the best of our knowledge, an original usage of SPSC queues. FastFlow unbound SPSC queues idea and design, to the best of our knowledge, is fully novel. Unbound queues can be combined exactly as other SPSC queues to compose MPSC unbound queues (and this is again a novel result).