# FastFlow

## Fastflow architecture

Fastflow is conceptually designed as a stack of layers that progressively abstract out the shared memory parallelism at the level of cores up to the definition of useful programming abstractions and parallel patterns. The abstraction process has two main goals:

1. to promote high-level, platform-independent parallel programming, and in particular skeletal programming (i.e. pattern-based explicit parallel programming).
2. to promote efficient programming of applications for homogenous and heterogenous of platforms, including multicore, may-core (e.g. GPGPU, FPGA), and distributed clusters of them.

These two goals have been perceived as a dichotomy for many years by many computer practitioners. We - as people in the skeletal and high-level parallel programming community - never perceived it this way. Indeed, Fastflow is the nth-in-a-row high-level pattern-based high-level parallel programming frameworks we designed in the last fifteen years (detailed in contributors home pages: MarcoA, Massimo, MarcoD). Along this path we was happy to see that high-level parallel programming is becoming a mainstream approach, as demonstrated by large industry involvement in the field: Google (with MapReduce), Intel (with Threading Building Blocks), Microsoft (with TPL). A more comprehensive introduction to high-level parallel programming can be found in one of our recent talk.

To achieve both of the above mentioned goals we should:

• Identify good parametric patterns, which are clearly characterized in a specific usage context and can be possibly combined with well-defined (functional and extra-functional) behavior [High-level patterns].
• Provide a parallel programming model and run-time support that support several implementation alternatives that exhibit predictable performances on a given class of platforms, that can be statically or dynamically selected and combined to build higher level patterns [Core patterns].
• Design a set of mechanisms supporting concurrency and parallelism on a given class of platforms that is both minimal (kernel features) and extremely efficient. At this level, the cost of operations is likely to be overhead that cannot be avoided. Minimizing overheads significantly widens design options at higher-levels [Building blocks].

As we shall see, Fastflow addresses these issues by way of a stack of layers, which are described in the next sections:

Observe that this stack of layers is purely conceptual. As we shall see, many of these stacks are statically compiled and optimised in a cross-layer fashion into the binary code by way of generative programming techniques, such as inline C++ templates.

### Hardware

FastFlow is specifically designed for cache-coherent multiprocessors, and in particular commodity homogenous multi-core (e.g. Intel core, AMD K10, etc.). It supports multiprocessors exploiting any memory consistency, including very weak consistency models. FastFlow implementation is always lock-free, and for several memory consistency models is also memory fence-free (e.g., sequential consistency, total store ordering, and the x86 model). On other models (e.g., Itanium and Power4, 5, and 6), a store fence before an enqueue is needed [GMV08]. At the current development status, GPGPUs are supported by way of OpenCL and/or CUDA. Distributed platforms build on top of TCP/IP and Infiniband/OFED protocols are also supported. FPGA support is planned but not yet implemented.

### Building Blocks

At its foundations FastFlow realises a (low-level) concurrent programming model, which extends C++ language. From the orchestration viewpoint, the process model to be employed in this Work Package is a CSP/Actor hybrid model where processes (so-called ff_nodes) are named and the data paths between processes are clearly identified. The abstract units of communication and synchronisation are known as channels and represent a data dependency between two processes. Representing communication and synchronisation as a channel ensures that synchronisation is tied to communication and allows layers of abstraction at higher levels to compose parallel programs where synchronisation is implicit. The offloading feature is concerned with abstracting over auxiliary hardware (or software) accelerators (e.g. GPGPUs). Accelerators are typically co-processors, they cannot host processes and they require to be coupled with a process running on a CPU. Offloading allows the model to be heterogeneous over the hardware in that code that is to be executed on a CPU which makes use of this layer may be partially offloaded onto accelerators.

At this level, FastFlow programming model can be thought as a hybrid shared-memory/message-passing model. A process (ff_node) is sequential, a channel models a true data dependency between processes. Processes typically stream data items (they are not tasks) onto channels, they can be either references (e.g. pointers in the shared-memory) or messages with a payload (e.g. in a distributed platform). In both cases, the data item acts as synchronisation token. In general, no further synchronisation primitives are needed (e.g. locks, semaphores) even thought their usage is not forbidden (they are simply useless and a source of additional overhead). Overall, at this level, FastFlow building blocks make it possible to realise arbitrary streaming networks over lock-less channels.

In summary, the FastFlow building blocks layer realizes the two basic features:

1. parallelism exploitation, i.e. the creation, destruction and life cycle control of different flows of controls, and
2. asynchronous communication channels, supporting the synchronization of different flows of control.

Multicore support

Implementation-wise, a ff_node is a C++ object that is mapped onto a OS thread (POSIX or OS native threads). Typically ff_nodes have a nonblocking behaviour, i.e. do not suspend on pushing or popping messages from channels. Empty and full channels are managed via busy waiting. If needed, a graph of nodes can be switched from nonblocking to blocking behaviour, and vice-versa (via an embedded distributed protocol). Nonblocking behaviour, coupled with lock-less (actually wait-free) channels enforce a very high throughput and very low latency onto cache-coherent shared-memory multicore. The possibility to switch from blocking to nonblocking behaviour is useful to manage bursts of activity interweaved by periods of inactivity.

Channels are inspired to P1C1 and Fastforward queues [GMV08] and Lamport's wait-free protocols [Lam83], and provides mechanisms to define simple streaming networks whose run-time support is implemented through correct and efficient lock-free Single-Producer-Single-Consumer (SPSC) queues queues equipped with non-blocking push and pop operations (more details about FastFlow's SPSC queues can be found in TR-10-20).

Shared-memory channels exhibits a number of performance pitfalls on commodity shared-memory cache-coherent multiprocessors (as many commodity multi-core are). In particular, traditional lock-free implementations (such as Lamport's solution [Lam83]) of SPSC queues are correct under sequential consistency only, where none of the current multi-cores implement sequential consistency. Also, some correct queue implementations induce a very high invalidation rate - and thus reduced performance - because they exhibit the sharing of locations that are subject to alternative invalidations from communication partners (e.g. head and tail of a circular buffers).

The FastFlow implementation does not suffer from the ABA problem [MS98], and it remains correct also in the case that only a reference instead of the full message is communicated using the queues. The FastFlow SPSC queue implementation (shown in the figure) is largely inspired by Fastforward queues [GMV08]. As with Fastforward queues, the push operation (issued by the producer) always reads and writes pwrite (i.e. tail pointer) only, and the pop (issued by the consumer) always reads and writes pread (i.e. head pointer) only. This approach substantially differs from the traditional one (e.g. in Lamport's queues) where both the producer and the consumer access both the head and tail pointers causing the continuous invalidation of cache lines holding head and tail pointers.

Fastflow SPSC queues can be directly used to write parallel programs by writing a C++ program that spawns a set of ff_nodes and orchestrates them in pairs each of them sharing (at least) a SPSC queue descriptor. Each thread in the pair has a fixed role in using the queue, either producer or consumer. The bulk of created threads can then start and eventually synchronise using SPSP queues, for example in pipeline fashion. Examples of streaming networks that can be build are:

Other orchestrations are also possible, although their correct exploitation may be non-trivial. In particular expressing N-to-1, 1-to-M, and N-to-M streaming networks might be complex and require mutual exclusion, which is typically a source of a non-negligible overhead in multi-core architectures.

One small, but significant, abstraction step consists in providing one-to-many (SPMC), many-to-one (MPSC), and many-to-many (MPMC) channels. SPMC, MPSC, and MPMC queues can be realized in several different ways, for example using locks, or in a lock-free fashion in order to avoid lock overhead. However, these queues could not be directly programmed in a lock-free fashion without using at least one atomic operation, which is typically used to enforce the correct serialization of updates from either many producers or many consumers at the same end of the queue (see an example). These operations, however, induce a memory fence, thus a cache invalidation/update, which can seriously impair the performance of parallel programs exhibiting frequent synchronizations (e.g. for fine-grain parallelism). Notice that building a lock also requires an atomic operation unless working under sequential consistency for which a number of algorithms that do not require atomic operations exist, e.g. Lamport's Bakery algorithm[Lam74].

With FastFlow we advocate a different approach to the implementation of these queues, which require neither locks nor atomic operations. SPMC, MPSC, and MPMC channels are realised by using only SPSC queues and a mediator thread, which enforce the correct serialization of producers and consumers. As shown in architectural figure, this arbiter thread is called Emitter (E) when it is used to dispatch data from one channel to many channels, Collector (C) when it is used to gather data from many channels and push the messages into one channel, and Collector-Emitter (CE) when it behaves both as Collector and Emitter.

It is interesting to observe that

• a collective channel (e.g. SPMC, MPSC) implemented via SPSCs+mediator is typically faster with respect to CAS-based implementation.
• Mediator thread make it possible to easily program scheduling policy for both item distribution and gathering.
• Nonblocking mediator threads couples very well with hyper threading technology because they typically execute lot of instructions that never arrives to execute stage in the processor pipeline.
• Fastflow relies on wait-free, non-blocking synchronizations. The approach has pros and cons. The main advantage consists in performance: avoiding memory fences dramatically reduces cache coherence overhead.

To be detailed

To be detailed

To be detailed

### High-level Patterns

To be updated

The next layer up, i.e. Core Patterns, provides a programming framework based on parallelism exploitation patterns (a.k.a. skeletons). They are usually categorized in three main classes: Task, Data, and Stream Parallelism. FastFlow specifically focuses on Stream Parallelism, and in particular provides: farm, farm-with-feedback (i.e. Divide\&Conquer), pipeline, and their arbitrary nesting and composition. The set of skeletons provided by FastFlow could be further extended by building new C++ templates.

• Stream Parallelism can be used when there exists a partial or total order in a computation. By processing data elements in order, local state may be maintained in each filter. The set of skeletons provided by FastFlow could be further extended by building new C++ templates on top of the Fastflow low-level programming layer.
• Task Parallelism is explicit in the algorithm and consists of running the same or different code on different executors (cores, processors, machines, etc.). Different flows-of-control (threads, processes, etc.) may communicate with one another as they work. Communication usually takes place to pass data from one thread to the next as part of the same data-flow graph.
• Data Parallelism is a method for parallelizing a single task by processing independent data elements of this task in parallel. The flexibility of the technique relies upon stateless processing routines implying that the data elements must be fully independent. Data Parallelism also supports Loop-level Parallelism where successive iterations of a loop working on independent or read-only data are parallelized in different flows-of-control and concurrently executed.

While many of the programming frameworks for multi-core offer Data and Task Parallel skeletons, only few of them offer Stream Parallel skeletons (such as TBB's pipeline). None of them offers the farm skeleton, which exploits functional replication of a set of workers and abstracts out the parallel filtering of successive independent items of the stream under the control of a scheduler, as a first-class concept.

## Problem Solving Environments (PSEs)

### Fastflow accelerator

A FastFlow accelerator is a software device wrapping a high-level FastFlow program, i.e. a skeleton or a composition of skeletons, and providing the application programmer with a functional self-offloading feature, since the offload happens on the same hardware device, i.e. CPU cores. The primary aim of self-offloading is to provide the programmer with an easy and semi-automatic path to introducing parallelism into a C/C++ sequential code by moving or copying parts of the original code into the body of C++ methods, which will be executed in parallel according to a FastFlow skeleton (or skeleton composition). This requires limited programming effort and it may speed up the original code by exploiting unused cores.

A FastFlow accelerator provides the programmer with one (untyped) streaming input channel and one (untyped) streaming output channel that can be dynamically created (and destroyed) from a C++ code (either sequential or multi-threaded) as a C++ object. Thanks to the underlying shared memory architecture, messages flowing into these channels may carry both values and pointers to data structures.

An accelerator, which is a collection of threads, has a global lifecycle with two stable states: \emph{running} and \emph{frozen}, plus several transient states. The running state happens when all threads are logically able to run (i.e. they are ready or running at the O.S. level). The frozen state happens when all threads are suspended (at the O.S. level). Transitions from these two states involve calls to the underlying threading library (and to the O.S.).

More details on HOWTO use a FastFlow accelerator can be found in TR-10-03, and in several examples within the {{http://sourceforge.net/projects/mc-fastflow/|FastFlow tarball available from sourceforge}.

## References

[AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. DOI

[GMV08] J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), pages 43-52, New York, NY, USA, 2008. ACM.

[AMT09] M. Aldinucci, M. Meneghin, and M. Torquati. Efficient Smith-Waterman on multi-core with fastflow. In Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing, Pisa, Italy, Feb. 2010. IEEE. To appear. (Paper Draft)

[Lam83] L. Lamport. Specifying concurrent program modules. ACM Trans. Program. Lang. Syst., 5(2):190–222, 1983.

[MS98] M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51(1):1–26, 1998.

[Lam74] L. Lamport. A new solution of dijkstra’s concurrent programming problem. Commun. ACM, 17(8):453–455, 1974.