## Fastflow architecture

Fastflow is conceptually designed as a stack of layers that progressively abstract out the shared memory parallelism at the level of cores up to the definition of useful programming constructs and patterns. The abstraction process has two main goals:

1. to promote high-level parallel programming, and in particular skeletal programming (i.e. pattern-based explicit parallel programming);
2. to promote efficient programming of applications for multi-core.

These two goals have been perceived as a dichotomy for many years by many computer practitioners. We - as people in the skeletal and high-level parallel programming community - never saw it this way. Indeed, Fastflow is the nth-in-a-row high-level pattern-based high-level parallel programming framework we designed in the last fifteen years. Along this path we was happy to see that several renowned colleagues joined us: for example Google (with MapReduce), Intel (with Threading Building Blocks), and very recently the Berkley Par Lab (with pattern-based programming [AB+09]).

To achieve both of the above mentioned goals we should:

• Identify good parametric patterns, which are clearly characterized in a specific usage context and can be combined with well-defined (functional and extra-functional) behavior [skeleton set design].
• Design an implementation methodology that supports several implementation alternatives that exhibit predictable performances on a given class of platforms, that can be statically or dynamically selected and combined [factory and optimization].
• Design a set of operations supporting concurrency and parallelism on a given class of platforms that is both minimal (kernel features) and extremely efficient. At this level, the cost of operations is likely to be overhead that cannot be avoided. Minimizing overheads significantly widens design options at higher-levels [synchronization mechanisms].

As we shall see, Fastflow addresses these issues by way of a stack of layers, which are described in the next sections:

Most of the issues are addressed within low-level and high-level programming layers, while the Problem Solving Environment layer is mainly focused on enhancing programmability and ease of use.

Observe that this stack of layers is purely conceptual. As we shall see, many of these stacks are statically compiled and optimised in a cross-layer fashion into the binary code by way of generative programming techniques, such as inline C++ templates.

### Hardware

FastFlow is specifically designed for cache-coherent multiprocessors, and in particular commodity homogenous multi-core (e.g. Intel core, AMD K10, etc.). It supports multiprocessors exploiting any memory consistency, including very weak consistency models. FastFlow implementation is always lock-free, and for several memory consistency models is also memory fence-free (e.g., sequential consistency, total store ordering, and the x86 model). On other models (e.g., Itanium and Power4, 5, and 6), a store fence before an enqueue is needed [GMV08].

At the current development status, FastFlow does not include any specific optimization targeting cc-NUMA platforms, although their support is planned and under design. Also, it currently does not automatically generate code running on GPUs (and other accelerators), even if it admits and supports the linking of code running on hardware accelerators. The full support of heterogenous platforms is currently under evaluation.

### Run-time support

Taking inspiration from Fastforward queues [GMV08] and Lamport's wait-free protocols [Lam83], the second tier provides mechanisms to define simple streaming networks whose run-time support is implemented through correct and efficient lock-free Single-Producer-Single-Consumer (SPSC) queues (more details about FastFlow's SPSC queues can be found in TR-10-20).

The FastFlow run-time support layer realizes the two basic features:

1. parallelism exploitation, i.e. the creation, destruction and life cycle control of different flows of control sharing the memory, and
2. asynchronous one-to-one communication channels, supporting the synchronization of different flows of control. They are implemented by way of lock-free Single-Producer-Single-Consumer (SPSC) queues equipped with non-blocking push and pop operations.

While the former point can be addressed using quite standard technology (i.e. the wrapping of existing threading libraries, such as POSIX threads), the second exhibits a number of performance pitfalls on commodity shared-memory cache-coherent multiprocessors (as many commodity multi-core are). In particular, traditional lock-free implementations (such as Lamport's solution [Lam83]) of SPSC queues are correct under sequential consistency only, where none of the current multi-cores implement sequential consistency. Also, some correct queue implementations induce a very high invalidation rate - and thus reduced performance - because they exhibit the sharing of locations that are subject to alternative invalidations from communication partners (e.g. head and tail of a circular buffers).

The implementation does not suffer from the ABA problem [MS98], and it remains correct also in the case that only a reference instead of the full message is communicated using the queues. The FastFlow SPSC queue implementation (shown in the figure) is largely inspired by Fastforward queues [GMV08]. As with Fastforward queues, the push operation (issued by the producer) always reads and writes pwrite (i.e. tail pointer) only, and the pop (issued by the consumer) always reads and writes pread (i.e. head pointer) only. This approach substantially differs from the traditional one (e.g. in Lamport's queues) where both the producer and the consumer access both the head and tail pointers causing the continuous invalidation of cache lines holding head and tail pointers.

Fastflow SPSC queues can be directly used to write parallel programs. At this level, writing a C++ program that spawns a set of threads and orchestrates them in pairs each of them sharing (at least) a SPSC queue descriptor. Each thread in the pair has a fixed role in using the queue, either producer or consumer. The bulk of created threads can then start and eventually synchronize using SPSP queues, for example in pipeline fashion. Other orchestrations are also possible, although their correct exploitation may be non-trivial.

Notice that at this level it is already possible to express arbitrary streaming networks (i.e. networks of queues) provided that all threads in the network are pairwise connected by SPSC queues. One-to-many multicast and many-to-one gather are not explicitly supported. In addition, at this level, Fastflow does not make any decision about thread scheduling and their mapping onto the core; the programmer should be fully aware of all programming aspects and their potential performance drawback, such as load-balancing and memory alignment and hot-spots. It is very low-level programming: let us define it as concurrent assembler.

### Low-level programming

One small, but significant, abstraction step is evident in the low-level programming layer, which provides one-to-many, many-to-one, and many-to-many synchronizations and data flows. In the FastFlow approach these forms of communication are supported by SPMC (Single-Producer-Multiple-Consumer), MPSC (Multiple-Producer-Single-Consumer), MPMC (Multiple-Producer-Multiple-Consumer) queues, respectively. They can be directly used as general asymmetric asynchronous channels among threads. Clearly, messages flowing through these channels may carry memory pointers (that behave also as synchronization tokens), since we are exploiting the underlying hardware cache-coherent shared memory. Abstractly, these queues realize a general message passing API on top of a hardware shared memory layer.

SPMC, MPSC, and MPMC queues can be realized in several different ways, for example using locks, or in a lock-free fashion in order to avoid lock overhead (which is a non-negligible overhead in multi-core architectures). However, these queues could not be directly programmed in a lock-free fashion without using at least one atomic operation, which is typically used to enforce the correct serialization of updates from either many producers or many consumers at the same end of the queue (see an example).

These operations, however, induce a memory fence, thus a cache invalidation/update, which can seriously impair the performance of parallel programs exhibiting frequent synchronizations (e.g. for fine-grain parallelism). Notice that building a lock also requires an atomic operation unless working under sequential consistency for which a number of algorithms that do not require atomic operations exist, e.g. Lamport's Bakery algorithm[Lam74].

With FastFlow we advocate a different approach to the implementation of these queues, which require neither locks nor atomic operations. SPMC, MPSC, and MPMC queues are realized by using only SPSC queues and an arbiter thread, which enforce the correct serialization of producers and consumers. As shown in architectural figure, this arbiter thread is called Emitter (E) when it is used to dispatch data from one channel to many channels, Collector (C) when it is used to gather data from many channels and push the messages into one channel, and Collector-Emitter (CE) when it behaves both as Collector and Emitter (a.k.a. Master-workers pattern).

Notice that, at this level, FastFlow does not make any decision about thread scheduling and their mapping onto the core; the programmer should be fully aware of all programming aspects and their potential performance drawback, such as load-balancing and memory alignment and hot-spots.

#### Fastflow approach: pros and cons

Fastflow relies on lock-free, wait-free, non-blocking synchronizations. The approach has pros and cons. The main advantage consists in performance: avoiding memory fences dramatically reduces cache coherence overhead.

### High-level programming

The next layer up, i.e. high-level programming, provides a programming framework based on parallelism exploitation patterns (a.k.a. skeletons). They are usually categorized in three main classes: Task, Data, and Stream Parallelism. FastFlow specifically focuses on Stream Parallelism, and in particular provides: farm, farm-with-feedback (i.e. Divide\&Conquer), pipeline, and their arbitrary nesting and composition. The set of skeletons provided by FastFlow could be further extended by building new C++ templates.

• Stream Parallelism can be used when there exists a partial or total order in a computation. By processing data elements in order, local state may be maintained in each filter. The set of skeletons provided by FastFlow could be further extended by building new C++ templates on top of the Fastflow low-level programming layer.
• Task Parallelism is explicit in the algorithm and consists of running the same or different code on different executors (cores, processors, machines, etc.). Different flows-of-control (threads, processes, etc.) may communicate with one another as they work. Communication usually takes place to pass data from one thread to the next as part of the same data-flow graph.
• Data Parallelism is a method for parallelizing a single task by processing independent data elements of this task in parallel. The flexibility of the technique relies upon stateless processing routines implying that the data elements must be fully independent. Data Parallelism also supports Loop-level Parallelism where successive iterations of a loop working on independent or read-only data are parallelized in different flows-of-control and concurrently executed.

While many of the programming frameworks for multi-core offer Data and Task Parallel skeletons, only few of them offer Stream Parallel skeletons (such as TBB's pipeline). None of them offers the farm skeleton, which exploits functional replication of a set of workers and abstracts out the parallel filtering of successive independent items of the stream under the control of a scheduler, as a first-class concept.

## Problem Solving Environments (PSEs)

### Fastflow accelerator

A FastFlow accelerator is a software device wrapping a high-level FastFlow program, i.e. a skeleton or a composition of skeletons, and providing the application programmer with a functional self-offloading feature, since the offload happens on the same hardware device, i.e. CPU cores. The primary aim of self-offloading is to provide the programmer with an easy and semi-automatic path to introducing parallelism into a C/C++ sequential code by moving or copying parts of the original code into the body of C++ methods, which will be executed in parallel according to a FastFlow skeleton (or skeleton composition). This requires limited programming effort and it may speed up the original code by exploiting unused cores.

A FastFlow accelerator provides the programmer with one (untyped) streaming input channel and one (untyped) streaming output channel that can be dynamically created (and destroyed) from a C++ code (either sequential or multi-threaded) as a C++ object. Thanks to the underlying shared memory architecture, messages flowing into these channels may carry both values and pointers to data structures.

An accelerator, which is a collection of threads, has a global lifecycle with two stable states: \emph{running} and \emph{frozen}, plus several transient states. The running state happens when all threads are logically able to run (i.e. they are ready or running at the O.S. level). The frozen state happens when all threads are suspended (at the O.S. level). Transitions from these two states involve calls to the underlying threading library (and to the O.S.).

More details on HOWTO use a FastFlow accelerator can be found in TR-10-03, and in several examples within the {{http://sourceforge.net/projects/mc-fastflow/|FastFlow tarball available from sourceforge}.

### Fastflow Dynamic Memory allocator

Completed - To appear soon.

On going.

On going.

## References

[AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. DOI

[GMV08] J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), pages 43-52, New York, NY, USA, 2008. ACM.

[AMT09] M. Aldinucci, M. Meneghin, and M. Torquati. Efficient Smith-Waterman on multi-core with fastflow. In Proc. of Intl. Euromicro PDP 2010: Parallel Distributed and network-based Processing, Pisa, Italy, Feb. 2010. IEEE. To appear. (Paper Draft)

[Lam83] L. Lamport. Specifying concurrent program modules. ACM Trans. Program. Lang. Syst., 5(2):190–222, 1983.

[MS98] M. M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed Computing, 51(1):1–26, 1998.

[Lam74] L. Lamport. A new solution of dijkstra’s concurrent programming problem. Commun. ACM, 17(8):453–455, 1974.

• Bookmark at