Table of Contents

Frequently Asked Questions

Which platforms/OSes/compilers are supported?

Work in progress

Programming effort

FastFlow vs OpenMP and Intel TBB (and CnC)

Of course, the argument for programmability will only be fully “proven” via a large study in which programmers with equal starting knowledge of differing technologies develop a range of applications in a range of technologies and compare experiences. Such a study is difficult to arrange and execute. For the moment the argument for scalability can only be based on a subjective assessment of the abstraction levels of the differing technologies and limited empirical experience. For the latter we can report that the entire YaDT-FF parallelisation (more details about YaDT-FF can be found in ECMLPKDD 2010 paper.) required just a few days of work, which the data mining experts report is significantly less than the time needed to parallelise the same application with OpenMP or TBB; and that this is due mainly to the fact that FastFlow provides a native way of implementing Divide&Conquer that can be used to structure the YaDT-FF application.

FastFlow vs Intel TBB

From a programming viewpoint FastFlow is similar to TBB (well, it is also correct to add that TBB is similar to many other programming frameworks that we, and the skeleton research community in general, have been developing for two decades, see architecture page). They both are C++ template libraries and advocate a pattern-based (a.k.a. skeleton) approach to parallel programming. There are two main differences between FastFlow and TBB:

  1. Runtime. TBB is not fence-free; also using lock-free components such as TBB concurrent containers, the TBB programmer cannot avoid the use of atomic instructions (that affect cache performance). In our experience, FastFlow is typically faster than TBB (see PDP 2010 paper).
  2. FastFlow primarily targets streaming, whereas TBB streaming support is limited to simple pipeline. FastFlow natively supports pipeline, farm, Divide&Conquer, their nesting and composition. Clearly, these structures can be also realized with TBB, although there their definition is completely up to the programmer.

FastFlow vs OpenMP

FastFlow and OpenMP are not very similar. OpenMP, different from FastFlow which is a library, is a pragma-based compiled language extension (à la HPF). In OpenMP, if you are expert enough (or lucky enough) to find the correct pragmas and the correct places to put them in the code, you usually succeed in parallelising code in a very compact way. The time and effort required to find and tune these pragmas is not guaranteed to be small; performance tuning can be particularly annoying since the programmer has little or no control over decisions made by the compiler (unless using quite complex pragma combinations). OpenMP is usually quite good in the parallelisation of coarse grain loops. In version 3 parts of a loop can also be parallelized via task extension, which makes possible offloading on an “unstructured” accelerator; the performance achievable in this way is usually suboptimal. FastFlow follows another approach providing the programmer with a set of patterns with a standard semantics and the possibility to modify it via OO mechanisms (overloading methods). OpenMP does not currently effectively support streaming and Divide&Conqueur paradigms (even if several proposals in this regards exist).

FastFlow vs Intel CnC

To appear, we are working on it.

Accelerators and offloading

What is a FastFlow accelerator?

The FastFlow accelerator is an extension of the FastFlow framework aimed at simplifying the porting of existing sequential code to multicore. A FastFlow accelerator is a software device defined as a composition of FastFlow patterns (e.g. pipe(S1,S2), farm(S), pipe(S1,farm(S2)), …) that can be started independently from the main flow of control; one or more accelerators can be (dynamically) started in one application. Each accelerator exhibits a well-defined parallel semantics that depend on its particular pattern composition. Tasks can be asynchronously offloaded (so-called self-offloaded) onto an accelerator. Results from accelerators can be returned to the caller thread either in a blocking or non-blocking fashion. FastFlow accelerators enable programmers to 1) create a stream of tasks from a loop or a recursive call; 2) parallelize kernels of code changing the original code only in very local way (for example, a part of a loop body). A FastFlow accelerator typically works in non-blocking fashion on a subset of cores of the CPUs, but can be transiently suspended to release hardware resources to efficiently manage non-contiguous bursts of tasks.

How does Fastflow compare to OpenCL or CUDA?

Fastflow cannot be directly compared to OpenCL or CUDA, it rather complements them (currently, at least). OpenCL and CUDA are designed to execute a SIMD code onto a hardware accelerator: they are at the same level of abstraction as the Fastflow low-level programming layer. On the contrary, Fastflow has been designed mainly to efficiently execute parallel code onto the cores of the CPU (of a SMP platform, currently).

How does Fastflow deal with hardware accelerators (GPUs, etc.)?

Fastflow (at the high-level programming layer) provides the programmer with an effective means of exploiting parallel program orchestration (efficient synchronizations, scheduling, etc.) by means of parallel programming orchestration templates (i.e. skeletons). Skeletons are higher-order entities (i.e. object factories) that should be filled with your own C++/C sequential code. This code can include your own accelerator code (OpenCL, CUDA, SSE, etc.), but Fastflow currently neither provides any special assist in exploiting that code nor does it help the programmer in the parallelization of that code. An example of the joint use of Fastflow+SSE3 can be found in the Smith-Waterman application (see Fastflow tarball).

Can Fastflow ease the use of hardware accelerators (GPUs, etc.)?

Yes, theoretically. In designing Fastflow we envisage it also as a means of easing the high-level programming of hardware accelerators, which is currently almost a nightmare. To do that, we need to extend the generation strategy from the high-level to the low-level layer in the Fastflow stack by considering an extended low-level layer that includes accelerator instruction (or accelerator access API). Extending that generation strategy while maintaining high-performance is not trivial. FastFlow accelerator is a step ahead in this direction. Using the self-offloading technique, a FastFlow program can be almost automatically transformed into a programmable software accelerator, i.e. a device running on idle CPU cores that can be used as if it were a programmable hardware accelerator. The function to be offloaded on the FastFlow accelerator can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach. More details can be found in TR-10-03.

FastFlow queues

Building a streaming network on top of SPSC queues means potentially using many queues, even if they are automatically generated and assembled in MPMC queues. Details about the implementation of the FastFlow's SPSC queues can be found here.

How big are the SPSC queues?

An empty SPSC queue on a 64bit platform has a size of 144 bytes. Queues are considered to store memory pointers, and so a queue of size k requires 144+64*k bytes. Typically, a SPSC queue is just few KBytes.

How much memory may be consumed on a many-core system?

Not very much since threads are typically not connected by a complete graph, but according to the skeleton synchronization schema that is not typically complete. As an example, the pipeline skeleton with n stages requires n-1 queues, the farm skeleton requires 2+2*n_workers queues, Divide&Conquer (i.e. farm with feedback channels) requires 2*n_workers queues. Typically the consumed size linearly grows with the number of threads.

How are MPMC queues realized?

Multiple-Producer-Multiple-Consumer (MPMC) queues are realized using one SPSC queue per producer and one SPSC queue per consumer. These queues are put together using an arbiter thread in a fully lock-free and fence-free fashion (no CAS at all). SPSC queues are enriched with additional methods aimed at improving cache locality and throughput, such as multi-push. In addition, FastFlow provides several variants of classic lock-free queues (using CAS operations) such as the Michael&Scott queue, which leverage on deferred reclamation and memory alignment provided by FastFlow allocators.

Is this approach scalable?

In general, the scalability of the approach depends on the quality of the mapping from skeleton implementation onto underlying memory connectivity. Skeletons requiring higher connectivity (i.e. more synchronizations) may require a higher connectivity degree at the hardware memory data-path level. Note, however, that this is true for any concurrent programming model. The big advance of the skeleton approach indeed consists in the possibility to exploit different implementation templates for the same skeleton in order to match the peculiarity of different memory sub-systems. This enhances portability and performance portability since the code does not have to be re-designed for different multi-core platforms.

Does FastFlow support unbound/dynamic queues?

There exists an unbound version of the SPSC FastFlow queue. This kind of queue can dynamically and automatically grow and shrink to match actual size requirements. As with other queues, the unbound queue (so-called uSPSC) is lock-free and fence-free and exhibits almost the same performance of other queues. uBuffer implementation is available within the FastFlow tarball: the correctness proof is described here.

Why use both bound and unbound queues?

Bound and unbound queues target different problems. Bound queues can be used to exploit a limited degree of asynchrony among threads, and so are useful for enforcing temporal synchronizations. Unbounded queues enforce data-dependency only (asynchrony degree is unbound), they are very useful in deadlock avoidance strategies of cyclic streaming networks, but do not induce temporal synchronicity among threads. A good system should find a fair trade-off between the two kinds of queue as well as properly defining the size of bound queues. As an example, a queue with length 1 can be used to model a temporal synchronization device since the producer can check when the consumer has received the data.

Do FastFlow queues represent a novel research result?

Bound SPSC queues are inspired by P1C1 queues by Higham and Kavalsh (1997), although the implementation differs in many important details. FastFlow MPMC queues are, to the best of our knowledge, an original use of SPSC queues. The FastFlow unbound SPSC queue idea and design, to the best of our knowledge, is fully novel. Unbound queues can be combined exactly as other SPSC queues to compose MPSC unbound queues (and this is again a novel result).