User Tools

Site Tools


Tutorial

Work in progress

Programming model

FastFlow programming model is a structured parallel programming model. The framework provides several pre defined, general purpose, customizable and composable parallel patterns (or algorithmic skeletons). Any application whose parallel structure may be modelled after the provided parallel patterns, used alone or in composition, may be implemented using FastFlow. Those applications whose parallel structure may not be suitably modelled using the FastFlow patterns or proper customization of these patterns could not be implemented using FastFlow.

Concurrent activities and streams

Wrappers

Skeletons

Customizing skeletons

Distributed FastFlow

Hardware accelerators

Related technologies

Sequential Consistency and Lock-Free synchronizations:Lamport's Bakery Algorithm

It is possible to implement mutual exclusion without special atomic machine instructions like CAS or LL/SC.

One of the most famous lock-free algorithm that uses only read and write operations to ensure mutual exclusion among N processes, is due to Leslie Lamport and is known as Bakery Algorithm (see the page on wikipedia).

The algorithm is elegant and apparently very simple. It is based on the policy that is sometimes used in a bakery. Upon entering the bakery a customer gets a number which is greater than the numbers of other customers that are waiting for service. The holder of the lowest number is the next to be served.

The Bakery algorithm satisfies FIFO ordering and is wait-free, but as the numbers can grow without bound, its implementation uses unbounded size registers (this might be a nonissue if we use a 64-bit counter). Gadi Taubenfeld proposed the Black-White Bakery Algorithm (link), which preserve the simplicity and elegance of the original algorithm and bounds the amount of space required by coloring the numbers (using only one additional shared bit).

The Bakery Algorithm works correctly under Sequential Consistency (SC) memory model. The SC memory model was formally defined by Lamport as follows:

A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

As reported in ”Shared Memory Consistency Models: A Tutorial” by Sarita V. Adve and Kourosh Gharachorloo link, “There are two aspects to sequential consistency: (1) maintaining program order among operations from individual processors, and (2) maintaining a single sequential order among operations from all processors. The latter aspect makes it appear as if a memory operation executes atomically or instantaneously with respect to other memory operations.”

Unfortunately, it is costly in term of performance to implement SC because prevents widely used compiler optimizations based on memory instruction reordering. Furthermore, modern pipeline microprocessors feature out-of-order instructions execution to maxymize CPU utilization. So, in most modern multiprocessor/multicore architectures, memory reads and writes are not sequentially consistent.

When the programmes need SC memory model, they must enforce it explicitly by using architecture's special instruction called memory barriers (or memory fences). Such special instructions tell the processor to immediatly propagate updates to and from memory hierarchy in order to guarantee that “every load and store instruction that precedes in program order the fence instruction is globally visible before any load or store instruction that follows the fence instruction is globally visible”.

In the paper ”Consistency Requirements of Distributed Shared Memory for Lamport's Bakery Algorithm for Mutual Exclusion” by Jerzy Brzezinski and Dariusz Wawrzyniak link, has been proved that the Bakery algorithm is correct (without any explicit synchronisation), provided that write operations to the array “choosing” are sequentially consistent, and write operations to the array “number” are at least PRAM consistent. So the paper states that sequential consistency is not necessary for all memory operations but only for some of them. It is worth to say that PRAM consistency is a quite relaxed memory model, in particular much weaker than TSO-like models often used in modern multiprocessors.

The Bakery Algorithm is not used in practice, not only because it does not work on modern architectures that exploit relaxed memory consistency model (like TSO memory models), but also because it requires to read and write N distict memory locations where N is the maximum number of concurrent processes.

Maurice Herlihy and Nir Shavit in "The Art Of Multiprocessor Programming" proved that “any deadlock-free Lock algorithm requires allocating and then reading or writing at least N distinct locations in the worst case. This result is crucially important, because it motivates us to add to our multiprocessor machines, synchronization operations stronger than read-write, and use them as the basis of our mutual exclusion algorithms.”

We have implemented the Bakery Algorithm under Linux using GNU g++; the code can be found here.

Old Tutorial

This is an old tutorial, we are rewriting a new, more updated one

Simple pipeline

Pipelining is one of the simplest parallel pattern where data flows through a series of stages (or nodes) and each stage processes the input data in some way producing as output a modified version or new data. We will call the data flows streams of data or simply streams. A pipeline's stages can operate sequentially or in parallel and may have or not have an internal state.

Suppose you want to improve the performance of a Network Packet Processing Engine that performs some kind of packet analysis on each input packet. As a first step, you want to split the packet capture phase and the packet analysis phase in order to execute them in pipeline, overlapping the execution time for each input packet.

Depending on the kind of packet analysis, the second phase could be further split into multiple phases or replicated (using the farm skeleton) to improve the service time.

Let's start writing our first FastFlow pipeline for our Network Packet Engine case study.

First, we have to include the necessary header file, and optionally define the FastFlow namespace to avoid writing each time ff:: everywhere in the code.

#include <pipeline.hpp> // necessary include file
 
using namespace ff; // FastFlow's namespace

Next we have to define the first and the second stage of our pipeline deriving from FastFlow's ff_node class.

// define a stage deriving from FastFlow's ff_node class
class PacketCaptureStage: public ff_node { 
public:
   PacketCaptureStage(int Npackets):Npackets(Npackets) {} 
 
   /* optional method, called once after the thread is started */
   int svc_init() { 
      /* initialize network device */
   }
   /* you must define the svc method */
   void * svc(void * task) {
       if (capturedPackets == Npackets) return NULL; // End Of Stream
       // allocate task;  capture a network packet filling the task data 
       ++capturedPackets;
       return task; // task is not NULL
   }
   /* optional method: called once at the end of thread execution */
   void  svc_end() {
      /* close network device */   
   }
private:
   int Npackets;
};

In the same way as adobe you have to define the PacketAnalysisStage and then write the main function where you add the two stages to the pipeline object. Take a look at the following main function.

int main(int argc, char * argv[]) {
    /* type definition, arguments check, .... */
 
    ff_pipeline pipe; // define the pipeline object 
 
    packetCaptureStage = new PacketCaptureStage(Npackets); // build 1st stage
    packetAnalysisStage = new PacketAnalysisStage(...);    // build 2sd stage
 
    pipe.add_stage(packetCaptureStage);  // this is the 1st stage of the pipeline
    pipe.add_stage(packetAnalysisStage); // this is the 2nd stage of the pipeline
 
    if (pipe.run_and_wait_end()<0) { // let's go...
        error("running pipeline\n");
        return -1;
    }
    return 0;
}

Simple farm

need to be written.

Some FastFlow schemas

Farm schemas

Classical farm skeleton, with the Emitter and the Collector filters.

Note that Worker's function could be different for each Worker filter, this with the possibility to define a personalized application-level scheduling in the Emitter filter, leads to the implementation of a very flexible and powerful farm schema.

FastFlow farm without the Collector filter.

FastFlow farm with feedback channel between the Collector and the Emitter.

FastFlow farm accelerator. The farm has the Collector filter and a feedback channel.

Classical master-worker skeleton.

Pipeline schemas

Simple 3-stage pipeline.

Simple 3-stage torus.

FastFlow pipeline accelerator with feedback channel.

Some possible compositions

A 3-stage pipeline where the middle stage is actually a farm skeleton.

A farm skeleton where the workers are a 2-stage pipeline.

Using FastFlow as an accelerator

You can use FastFlow to accelerate existing sequential code. In a nutshell, programmers may identify potentially concurrent tasks and request their execution from an appropriate FastFlow pattern composition in such a way that those tasks are computed in parallel.

Parallel computation of tasks happens on the available cores on the multi/many-core architecture at hand.

By analogy with what happens with GPUs and FPGAs used to support computations on the main processing unit, the cores used to run the user defined tasks through FastFlow define a software accelerator device. This device will provide better acceleration when improved resources (cores and main memory) are available.

From a more formal perspective, a FastFlow accelerator is defined by a skeletal composition augmented with an input and an output stream that can be, respectively, pushed and popped from outside the accelerator. Both the functional and extra-functional behaviour of the accelerator is fully determined by the chosen skeletal composition. For example, the farm skeleton provides the parallel execution of the same code (within a worker object) on independent items of the input stream. The pipeline skeleton provides the parallel execution of filters (or stages) exhibiting a direct data dependency. More complex behaviours can be defined by creating compositions of skeletons whose behaviour could be described using (acyclic or cyclic) data flow graphs. A clear knowledge of accelerator behaviour makes it possible to correctly parallelize segments of code.

To illustrate how FastFlow can be used to build a software accelerator for existing code, we consider a trivial matrix multiplication code and go through the steps needed to derive a FastFlow accelerator from that code.

The use of a farm accelerator is exemplified in the following figure.

The code in the left column of the figure (lines 1-17) corresponds to a sequential program computing matrix multiplication with the three-loop naive algorithm.

To build a software accelerator for this matrix multiplication code, we first identify the possibly concurrent tasks in the original code. We identify lines 12-15 as possible candidates to generate concurrent tasks.

As those tasks are all independent, we deduce that a farm pattern will be suitable for implementing the accelerator: the emitter will get the stream of tasks generated within the main (single thread code) and it will automatically dispatch the tasks received to the available farm workers in the most efficient way.

Thus we devise the accelerator code as shown in the above figure (right side).

First we instantiate the software accelerator pattern (lines 9-14) with a suitable parallelism degree. In this case we use a simple farm and the parallelism degree corresponds to the number of workers used in the farm.

Then we move the logical task code to the farm worker (lines 40-43 of worker defined at lines 35-48), and we apply a suitable variable renaming to avoid conflicts.

Finally, we change the main code in such a way that task computation is substituted to the delegation code sending task data to the accelerator (lines 19-20).

Because it is composed of threads, the accelerator shares the memory with its caller (and other threads of the same process). As is well-known, transforming a sequential program into a parallel one requires regulation of possibly concurrent memory accesses. In low-level programming models this is usually done by using critical sections and monitors under the responsibility of the programmer. FastFlow does not prohibit these mechanisms, but promotes a methodology to avoid them. In very general terms, the sequential code statement can be correctly accelerated with FastFlow only mechanisms if the offloaded code and the offloading code (e.g. main thread) instances do not break any data dependency. FastFlow supports the programmer in enforcing these conditions by means of its two main abstractions:skeletons and streams.

The skeletal structure of the accelerator induces a well-defined partial ordering among offloaded parts of code. The synchronization among threads is enforced by streams along the paths of the particular skeleton composition, as in a data-flow graph. True dependencies (read-after-write) are admissible only along these paths. Streams can carry values or pointers, which act as synchronization tokens for indirect access to the shared memory.

It is worth pointing out that accelerating code in this way differs slightly from plain parallelization of existing code. When the accelerating methodology is used, instead, tasks are identified, the code computing those tasks is moved into suitable skeletal code (the software accelerator) and finally the code computing the tasks is substituted with the code offloading the data structure representing the task to the accelerator input stream.

In case of from scratch parallelization, more extensive programmer knowledge is needed, whereas in case of code acceleration different kinds of pre-defined accelerator templates may be directly provided by the FastFlow framework. As a consequence the programmer has only to identify the concurrent tasks in the code and provide a suitable representation of those tasks to be submitted through the accelerator input stream.

ffnamespace/tutorial.txt · Last modified: 2014/01/02 02:09 by aldinuc