### **HPCC 2014** The 16th IEEE International Conference on **High Performance Computing and Communications** Paris, France, August 20-22, 2014



Marco Aldinucci, University of Torino, Computer Science Department

HPCC keynote talk, Paris, France. Aug 20, 2014









# Outline

- Parallel patterns and structured parallel programming
  - On the hierarchy of abstractions for parallel programming \*
- Data-centric concurrency \*
  - FastFlow: a data-centric run-time support for heterogenous platforms \*
- Applications
  - With FastFlow and other frameworks •
- \* ... and lot of cars



# MPI is like a car, you can drive it as you like







```
#include <stdio.h>
#include "mpi.h"
#define MAXPROC 8
                     /* Max number of procsses */
                    /* Max length of machine name */
#define NAMELEN 80
                     /* Lengt of send buffer is divisible by 2, 4, 6 and 8 */
#define LENGTH 24
main(int argc, char* argv[]) {
 int i, j, np, me;
                              /* Tag value for sending name */
 const int nametag = 42;
                             /* Tag value for sending data */
  const int datatag = 43;
                              /* Root process in scatter */
 const int root = 0;
                              /* Status object for receive */
 MPI Status status;
 char myname[NAMELEN];
                                    /* Local host name string */
 char hostname[MAXPROC][NAMELEN]; /* Received host names */
 int x[LENGTH];
                        /* Send buffer */
                       /* Receive buffer */
 int y[LENGTH];
 MPI Init(&argc, &argv);
                                         /* Initialize MPI */
                                        /* Get nr of processes */
 MPI Comm size(MPI COMM WORLD, &np);
 MPI_Comm_rank(MPI_COMM_WORLD, &me);
                                        /* Get own identifier */
                                    /* Get host name */
  gethostname(&myname, NAMELEN);
                   /* Process 0 does this */
 if (me == 0) {
   /* Initialize the array x with values 0 .. LENGTH-1 */
   for (i=0; i<LENGTH; i++) {</pre>
     x[i] = i;
    }
   /* Check that we have an even number of processes and at most MAXPROC */
   if (np>MAXPROC || np%2 != 0) {
     printf("You have to use an even np (at most %d)\n", MAXPROC);
     MPI Finalize();
     exit(0);
    }
   printf("P %d on host %s is distributing array x to all %d processes\n\n",
         me, myname, np);
   /* Scatter the array x to all proceses, place it in y */
   MPI Scatter(&x, LENGTH/np, MPI INT, &y, LENGTH/np, MPI INT, root, \
            MPI_COMM_WORLD);
   /* Print out own portion of the scattered array */
   printf("Process %d on host %s has elements", me, myname);
```

```
for (i=0; i<LENGTH/np; i++) {</pre>
    printf(" %d", y[i]);
  printf("\n");
  /* Receive messages with hostname and the scattered data */
  /* from all other processes */
  for (i=1; i<np; i++) {</pre>
  MPI_Recv (&hostname[i], NAMELEN, MPI_CHAR, i, nametag,
    MPI COMM WORLD, &status);
  MPI_Recv (&y, LENGTH/np, MPI_INT, i, datatag, MPI_COMM_WORLD, &status);
    printf("Process %d on host %s has elements", i, hostname[i]);
    for (j=0; j<LENGTH/np; j++) {</pre>
    printf(" %d", y[j]);
    printf("\n");
  printf("Ready\n");
} else { /* all other processes do this */
  /* Check sanity of the user */
  if (np>MAXPROC || np%2 != 0) {
    MPI_Finalize();
    exit(0);
  /* Receive the scattered array from process 0, place it in array y */
  MPI_Scatter(&x, LENGTH/np, MPI_INT, &y, LENGTH/np, MPI_INT, root, \
    MPI_COMM_WORLD);
  /* Send own name back to process 0 */
  MPI_Send (&myname, NAMELEN, MPI_CHAR, 0, nametag, MPI_COMM_WORLD);
  /* Send the received array back to process 0 */
  MPI_Send (&y, LENGTH/np, MPI_INT, 0, datatag, MPI_COMM_WORLD);
MPI_Finalize();
exit(0);
```



```
bool push(void *const data) {
 unsigned long pw, seq;
   element_t * node;
   unsigned long bk = BACKOFF_MIN;
   do {
       pw = pwrite.load(std::memory_order_relaxed);
       node = &buf[pw & mask];
       seq = node->seq.load(std::memory_order_acquire);
       if (pw == seq) { // CAS
          break;
         for(volatile unsigned i=0;i<bk;++i) ;</pre>
         bk <<= 1;
         bk &= BACKOFF_MAX;
       } else
         if (pw > seq) return false; // queue full
   } while(1);
   node -> data = data;
   node->seq.store(seq+1,std::memory_order_release);
   return true;
```

### CAS & memory consistency

if (pwrite.compare\_exchange\_weak(pw, pw+1, std::memory\_order\_relaxed))



```
__global__ void kernel1(int *lock, int *x) {
  unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
 do {
   //nop
 } while(atomicCAS(lock, 0, 1));
 //critical region
  *lock = 0;
}
__global__ void kernel2(int *var, int *x) {
  unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
  int oldval, newval = 0;
  do {
    oldval = *var; //read shared var
    //critical region
    newval = oldval + 1; // f(oldval)
  } while(atomicCAS(var, oldval, newval) != oldval);
  x[idx] = newval;
```

- Deadlock on GPGPU (unless nvcc -G).

Data-dependency can be managed only via lock-free algorithms

### Deadlock

### Works?

A different execution model. Impossible to make any assumptions about scheduling



# Message-passing (e.g. MPI) Shared-memory (e.g. threads + mutex/CAS)

- (Can be) Efficient
- Available in almost all platforms
  - Often the only access to net, e.g. HRLS Cray XE6 Hermit: 32 x 3552 nodes = 113 664 cores
  - De-facto standards, used for decades •
- Parallel primitives fully inter-waived with business code
- How compose "computing phases", "SW modules" ...
  - Parallel behaviour and data layout not explicit in the code. Primitives often do not compose.
- Often to be coupled with shared-memory for intra-node
  - e.g. MPI & Pthreads, MPI & OpenMP, PGAS-MPI & OpenCL, ...



Lot of

# TY MPT I V threads

No whiners

```
(POP_SO_ISO(pbOut).N), & (POP_J
    195(pbIn);
    10C(pbOut):
   if IsFULL(pbOut)
        if (15_05_SK(pbOut))
         MPI_Vaitall(pbGut->ndest,pbGut)
        else
         SET_DS_SM(pbOut);
        EVAP(pb0i/s):
        SetLAST(pbOut,0);
        for[skie_i=0;skie_i<[pbOut->sdear
         MPI_lsend(pbOut->c.sizeof(stre
                    TPI_BTTE,pbOut->deat
                   #(pbOut->req_Out[skB
        RESUT(pbOut);
   if (isSMPTT(pbis))
       MPI_Wait(&(req_task_r)_&(sta_task
       MPI_Wait(&(pbln->req_Is),&(pbln-
       pbIn->sexttag=pbIn->sta_In.MP1_T
        SWAP(pbln);
        if (!IsLAST(pbIn))
           MP1_Isend(&task_r,sizeof(task
                      _S_REQUEST_TAG_MPI
           MP1_Irecv[pb1a->c,sizeof(stm
                     phis-raitt. MP1_ANY
        MASST(pbIn);
/* IDS Manageneot */
pbOut->acttag=pbIn->nexttag;
WriteMARK(pbOut,MARK(pbIn));
if (IS_GS_SM(pbOut))
 MPI_Waital1(pbOut->odest.pbOut->req_Du
SWAP(pbOut);
SetLAST(pbOut.1);
for(skie_i=0;skie_i<(ptOut->ndest);skie_
 MPI_Isend(pbDut->c,sizeof(strean_type)
           MP1_BYTE,pbOut->dest[skie_i]
           &(pbOut->req_Out[skie_i]));
MP1_Waitall(pbOut->ndest,pbOut->req_Out,)
```

### Design and validation of the MPI backend for the SkIE compiler

Marco Aldinucci

Dec 18, 1997 Quadrics Supercomputing World (QSW) Technical Report



# Unstructured programming considered harmful

### Edgar Dijkstra: Go To Statement Considered Harmful

### Go To Statement Considered Harmful

Key Words and Phrases: go to statement, jump instruction, branch instruction, conditional chanse, alternative clause, repetjure clause, program intelligibility, program sequencing CE Categories: 4.32, 5.25, 5.38

### Excrete a

For a number of years I have been familiar with the observation that the quality of programmers is a deverasing function of the density of go to statements in the programs they produce. More security I discovered why the use of the go to statement has such deastrow efforts, and I became convinced that the go to statesect should be abslabed from all "higher level" recomming dynamic program is only characterized when we also give to which call of the procedure we refer. With the indusion of procedures we can characterise the program of the process via a sequence of textual indices, the length of this sequence being equal to the dynamic depth of procedure calling.

Let us now consider repetition classes (like, while 3 repeat A or repeat A sectil #1. Logically speaking, such classes are now superfluous, because we can express repetition with the aid of recursive procedures. For reasons of realism I don't wish to exclude them: on the one hand, repetition clauses can be implemented quite confortably with present day finite equipment; on the other hand, the reasoning pattern known as "induction"

### Send-Receive Considered Harmful: Myths and Realities of Message Passing

SERGEI GORLATCH Universität Münster

During the software crisis of the 1960s, Dijkstra's famous thesis "pote considered hareafal" paved the way for structured programming. This short communication suggests that many current difficulties of parallel programming based on message passing are caused by poorly structured communication, which is a consequence of using low-level send receive primitives. We argue that, like goto in sequential programs, send-receive should be avoided as far as possible and replaced by collective operations in the setting of message passing. We dispute some widely held opinions about the apparent superiority of pairwise communication over collective communication and present substantial theoretical and empirical evidence to the contrary in the context of MPI (Message Passing Interface).

Categories and Subject Descriptors: D.1.3 (Programming Techniques): Concurrent Programming-Parallel programming

General Terms: Algorithms, Languages, Performance, Experimentation, Measurement

Additional Key Words and Phrases: Programming methodology, Message Passing Interface (MPI)











REALITY

Loop parallelism helps

OpenMP is great example
It simply works well on loops
Not really useful for all problems
Tasks, Graphs, ...



# Programmings model is paramount



Figure 1. A view from Berkeley: seven critical questions for 21<sup>st</sup> Century parallel computing. (This figure is inspired by a view of the Golden Gate Bridge from Berkeley.)

DOI:10.1145/1562764.1562783

Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.

BY KRSTE ASANOVIC, RASTISLAV BODIK, JAMES DEMMEL, TONY KEAVENY, KURT KEUTZER, JOHN KUBIATOWICZ, **NELSON MORGAN, DAVID PATTERSON, KOUSHIK SEN,** JOHN WAWRZYNEK, DAVID WESSEL, AND KATHERINE YELICK

# A View of the Parallel Computing Landscape

3. What are the building blocks?

tech form plic that pow as a sequ con inef pow issu exec and perf que hit diss cha pro tion con two was ceed in 2 toda  $200^{\circ}$ arch

01

# And it is paramount also at the large scale

### The Abstract Machine Model & Execution Mo



- A computer language is not a programming model
  - "C++ is not scalable to exascale"
- A communication library is not a programming model
  - "MPI won't scale to exascale"
- A unique application inventory response...
  - Should we be talking "Execution Model"?
- What is a programming model?

Thought: real issue is mapping science problem to execution model and run-time system

Courtesy: P. Beckman, Director, Exascale Technology, ANL

| d | el |
|---|----|
|---|----|

| nputing |  |
|---------|--|
| emory   |  |

 Coarse grain concurrency is nearly exhausted

Often, it is not about FLOPS, it is about data movement

 Programming systems should be designed to support fast data movement and enforce locality

Variable coherency & intersocket messaging



# Excerpts

- MPI, the current dominant programming model for parallel scientific programming, forces coders to be aware of the exact mapping of [Gursoy and Kale 2004]

computational tasks to processors. This style has been recognised for years to increase the cognitive load on programmers, and has persisted primarily because it is expressive and delivers the best performance. [Snir et al 1998]

Because we anticipate a massive increase in exploitable concurrency, we believe that this model will break down in the near future, as programmers have to explicitly deal with decomposing data, mapping tasks, and performing synchronisation over thousands of processing elements. [Asanovic et al 2006]



# Abstraction (D. Reed)

"To date, attempts to develop higher level programming abstractions, tools and environments for HPC have largely failed.

There are many reasons for this failure, but I believe many are rooted in our excessive focus on hardware performance measures.

By definition, the raison d'être for high-performance computing is high performance, but FLOPS need not be the only measure.

Human productivity, total cost and time to solution are equally, if not more important."

### COMMUNICATIONS ACM NEWS BLOGS OPINION RESEARCEPRAC

Home / Hogs / HLOG(SCACM / High-Performance Computing: Where / Full Text

### BLOG@CACM High-Performance Computing: Where

By Daniel Reed May 7, 2009 Comments 🚊 SHARE: 💷 🤮 🚇 VIEW



You are a young entrepreneur with a "can't miss" idea for the next great social networking service, one so hip, so cool and so awesome that it will spread like a virus, while venture capital firms line up to beg for a piece of the action. It's just a simple matter of programming to turn your idea into code that scales seamlessly across tens of thousands of data center servers.

You know the drill. Fire up the coffee percolator, break out your copy of Kernighan and Ritchie, grab your cheat sheet of TCP/IP functions and parameters, slam a cassette into the tape player and start programming. Oh, wait; that's so 70s and 80s! In the hierarchy of abstractions, it's only slightly above toggling absolute binary into the front panel of the

machine.

In the web service world, we have moved beyond these low level tools. Over the past twenty years, we have built and embraced a suite of powerful libraries, scripting languages, software services and tools that allow developers to create complex software systems, while hiding the low level attributes of networks and computer systems. We focus on composition, abstraction, rapid deployment, software scaling and human productivity.

Meanwhile, in the world of high-performance computing, message passing has remained the programming paradigm of choice for over twenty years. The durable Message Passing Interface (MPI) standard, with send/receive, broadcast and reduction operators is still used to construct parallel programs composed of tens to hundreds of thousands of communicating processes. Each interprocess communication is orchestrated by the developer based on knowledge of code function and overhead.

To date, attempts to develop higher level programming abstractions, tools and environments for high-performance computing have largely failed. There are many reasons for this failure, but I believe many are rooted in our excessive focus on hardware performance measures. By definition, the raison d'être for high-performance computing is high performance, but floating point operations per second (FLOPS) need not be the only measure. Human productivity, total cost and time to solution are equally, if not more important.

I am confident that high-performance computing can and should learn a few tricks from the world of web services. We need a Ruby on Rails for defining parallel application frameworks and an Erlang for concurrent specification. We need to focus on high-productivity computing, balancing human and machine performance.

You know the drill. Pour some hot water in the French press, break out your copy of Lattice QCD for Dummies, grab your cheat sheet of high-performance computing abstractions, download some digital tunes and start programming. In all seriousness, a new world of opportunity and scientific discovery awaits those who first embrace and master the abstractions needed to create rich, multidisciplinary parallel applications.



### ACM R Visual Course

Those V

Not Mat

Bertrand

I Didn't State

# Low-level programming models: assessment

- Efficient (can be) and widely supported •
- Not likely to scale to mainstream (industrial) development
- Not compositional
- Parallel behaviour not explicit (can be hard to catch) \*
- At the bottom line, higher-level "mental" overlays are already there •
  - In the mind of designers (i.e. data organisation, partition, data patterns, ...)



# Programming model: my wish list

### Should enforce to think to problems in parallel & and at high-level of abstraction \*

- \*
- implementation

### Should support containment and composition \*

At large scale: clear fault model with containment & recovery \*

### Should integrate synchronisation/communication with scheduling \*

- System programmers should use the techniques they advocate

Clear semantics: functional and extra-functional (parallel), describing collective behaviours Trading memory coherency for power, and power for parallelism should be a matter of the

Weak execution model rather than per device. Multicore, GPGPUs, distributed with an unifying vision



# An alternative mobility approach



# **Skeletons and Patterns**

- From HPC community
- Started in early '90 (M. Cole's PhD thesis)
- Pre-defined parallel higher-order functions, exposed to programmers as constructs/lib calls

### Algorithmic Skeletons

- From SW engineering community
- Started in early '00

"Recipes" to handle parallelism (name, problem, algorithms, solutions, ...)

### Parallel Design Patterns



# Evolution of the concept



• Targeting cluster of heterogeneous multicore + GPGPU • Quite complex tool chain (compiler + run-time system)



# Evolution of the concept



Intel/TBB/CnC (?), Microsoft/TPL (?), Google/MapReduce (?), ...







# Algorithmic skeletons



Divide&Conquer, Pool, MDF, ...

Pipeline, farm, ...

Data parallel

Task parallel

Stream parallel

Problem

App developer



### High-level code



# Structured parallel programming

# Parallel Design methodology to compose and extend Patterns

### Problem



### **Programming tools**





supporting structure and implementation mechanisms



### Flexible high-level code



instantiate



# MapReduce a success story

**Innovative implementation** (even if theoretically an instance of Map+Reduce)

FAN programs obey the single-a defining each variable. All outp in the program's body. Each eg Expressions (r) can be const function applications or skeleto as map, reduce, etc.,

 $E_2 = pair | proj1$ 

The scope of each variable definition extends across all subsequent definitions in the same body. At the end of the body, we can specify local definitions using a where clause. Names of FAN programs can be used as functions in expressions.

FIGURE 1 A FAN program to compute the inner product of two vectors.

Algorithm: and Applications, Vol. 16, pp. 87-131 Repton available density from the publisher Partosopping permitted by large unit

nine and its

### TOWARDS PARALLEL PROGRAMMING BY TRANSFORMATION: THE FAN SKELETON FRAMEWORK\*

M. ALDINUCCI\*, S. GORLATCH\*, C. LENGAUER\* and S. PELAGATTI\*

\*Dipersimento di Informatica, Università di Piso, 40 Corso Italia, 3-36123 Pisa, Italy; \*Fakultür für Mathematik und Informatik. Universität Pantas, D-94930 Pattas, Germany

```
e = c |x| ee |\lambda x.e| E_1 ee |E_2 e| E_3 eee
E<sub>1</sub> = map | map<sub>4</sub> | reduce | scanL | copy | split | part | rearrange
E<sub>3</sub> = loopfor | loopwhile | looprepeat
```

```
inner.product (in a, b : Array n Scalar, out c : Scalar)
  t = map(*) (pair(a,b));
  c = reduce (+) t;
```

| -       | ni N   | ٧  |  |
|---------|--------|----|--|
| lane of | 1.00   | ÷. |  |
| each i  | Scie   |    |  |
| here:   | stephy | ÷  |  |





# Patterns are natural for GPGPUs

### Courtesy: J. Owen, UC Davis

### **Think In Parallel**



- Thousands of parallel threads
- Thousands of data elements to process
- All data processed by the same program SPMD computation model
- Contrast with task parallelism and ILP

### Best results when you "Think Data Parallel"

- Design your algorithm for data-parallelism
- Understand parallel algorithmic complexity and efficiency
- Use data-parallel algorithmic primitives as building blocks





### **Data-Parallel Algorithms**



Efficient algorithms require efficient building blocks

- This talk: data-parallel building blocks
  - Map
  - Gather & Scatter
  - Reduce
  - Scan



# Maybe also on FPGA

### Courtesy: A. Koch, TU Darmstadt





### Role of FPGA in FastFlow

### FPGA can act as entire FastFlow subgraph





# Separation of concerns Application prog System programmer Inversion of Structure suggest

Inversion of control

Structure suggested by programmer - no idiom recognition
Clear functional and parallel semantics

### Performance

Close to hand tuned code (better for not expert programmers)
Reduced development time. Graceful tuning curve.

Application programmer: what to do
System programmer: how to make it in parallel



# FastFlow (FF)

### and its data-centric run-time support



### **High-level patterns**

parallel\_for, parallel\_forReduce, ...

astFlow

### **Core patterns**

pipeline, farm, feedback

Building blocks queues, ff\_node, ...

CUDA

OpenCL



Multicore and many-core platforms Clusters of multicore + many-core



61



a native protocol)

(OpenCL, CUDA)

TCP/IP, OFED/IB, MPI (ongoing), HW/SW PGAS (ongoing)



# Building blocks Everything is SPSC (in the shared-memory)

### FF bound shmem FIFO channel Single-Producer-Single-Consumer • *lock-free fence-free queue* \* FF unbound shmem FIFO channel \* Single-Producer-Single-Consumer *lock-free fence-free queue* \* Distributed zero-copy channel OMQ/TCP or native IB/OFED \* shmem channels communicate

pointers in a message passing style

- Enough to support producer-consumer
  - Inherently weaker w.r.t. mutual exclusion
  - Weaker execution model requirements
    - Mutex not really possible on SIMT model (GPU)
    - Mutex requires memory-fences and leverages on (expensive) cache coherency on multicore
  - Deadlock is cyclic networks avoided via unbound queue (wait-free)

M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, M. Torquati. An efficient unbounded lock-free queue for multi-core systems. Euro-Par 2012, LNCN 7484, 2012.



# Queues: performance

### Shared-memory



- MVAPICH (shmem) ~ 190ns
- Faster and more scalable than
   CAS/test-and-set implementation

### Distributed

| Message size | ib_write_bw | MPI    | FastFlow   | FastFlow/ZMQ  |
|--------------|-------------|--------|------------|---------------|
| (bytes)      | (Mb/s)      | (Mb/s) | /IB (Mb/s) | /IPoIB (Mb/s) |
| 10           | 300         | 192    | 129        | 0.7           |
| 100          | 3,600       | 1,816  | 1,300      | 7.0           |
| 1,024        | 22,900      | 13,936 | 10,591     | 70.0          |
| 5,000        | 25,200      | 23,880 | 19,761     | 300.0         |
| 10,000       | 25,500      | 25,128 | 20,479     | 500.0         |
| 25,000       | 25,700      | 12,408 | 20,051     | 1,100.0       |
| 50,000       | 25,800      | 16,232 | 21,019     | 1,950.0       |
| 65,536       | 22,900      | 17,472 | 20,889     | 1,980.0       |
| 200,000      | 25,800      | 21,208 | 21,211     | 3,800.0       |
| 400,000      | 25,800      | 22,532 | 21,226     | 6,200.0       |

TABLE I: Comparing throughput of different implementations of the *unidirectional bandwidth* test for several message sizes.

Comparable with MVAPICH (distributed)



# Core patterns

- \*
  - Full control of locality and memory affinity \*
- Fully compositional and deadlock-free (by construction) •
- Minimalistic run-time •
  - Two patterns and one pattern-modifier: farm, pipeline and feedback \*
  - Configurable scheduling and gathering policies and implementations \*
- Full expressivity

# Streaming patterns: fast data movements + computation-to-core pinning

Enough to build higher-level abstraction: Master-worker, D&C, Map, Reduce, Pool, Stencil, ...



## Core patterns: farm, pipeline, feedback + Nonblocking/blocking, dynamic/static scheduling, E/C policy configuration, core pinning





farm with in-memory scheduling



farm of GPU nodes

Specialisation (00)

S2 GPU

pipeline with GPU nodes



## Core patterns can be arbitrarily composed

#### pipe(farm(s),farm(s)) farm(pipe(s1,s2))





pipe(farm(s),farm(s),loopback)

pipe(s1,farm(s2))





farm(farm(s))





#### Master-worker (0.5 µS workload)



Number of Cores



### Master-worker (5 µS workload)





### Master-worker (5 µS workload)





### Core patterns: a programming model

- A low/medium-level data-centric programming model \*
  - Concurrent computation modelled as a (cyclic) graph \*
  - Nodes are parallel activities. Edges are true data dependencies •
  - Synchronisation are messages, data can be moved as messages or shared memory \*
    - Can be realised with or without coherency, in shared-memory, distributed, PGAS, ... \*
- Not exactly a Kahn's net, more a CSP-actor hybrid model
  - Processes are named and the data paths between processes are identified



# High-level patterns

#### Address application needs \*

- Loop parallelism (OpenMP fashion) •
  - Parallel-For, Parallel-For-Reduce \*
- Data Parallelism •
  - Stencil, Stencil-Reduce, Map, MapReduce (pipelined) \*
- Task & Stream •
  - \*
  - Farm, Pipeline \*

Implementation-wise, just OO extensions of composition of core patterns •

Pool (e.g. genetic alg.), Macro-Data-Flow (e.g. linear algebra, dynamic programming, ...)



# High-level patterns

- Think in parallel & high-level
  - Efficiency, portability, time-to-market •
- High-level parallel patterns
  - Describing collective behaviour •
  - Can be: expressive, efficient, compositional \*
  - Multicore, GPGPUs, distributed with an unifying vision •
- Thanks to clear semantics, can be autonomically reconfigured
  - not discussed in this talk see ParaPhrase EU STREP FP7 project

#### Autonomic management of non-functional concerns in distributed & parallel application programming

Marco Aldinucci Dept. Computer Science University of Torino Torino - Italy aldinac @ di.anito.it

Marco Danelutto Dept. Computer Science University of Pisa Plag - Italy marcod@di.anipi.it

Peter Kilpatrick Dept. Computer Science Queen's University of Belfast Bellast - UK p.kilpatrick@quh.ac.ak

#### Abstract

An approach to the management of non-functional concerns in massively parallel and/or distributed architectures that marries parallel programming patients with autonomic computing is presented. The necessity and suitability of the adoption of autonomic techniques are evidenced. Issues arising in the implementation of autonomic managers taking care of multiple concerns and of coordination among kierarchies of such autonomic managers are discussed. Experimental results are presented that demonstrate the feasibility of the approach.

Keywords: autonomic management, algorithmic skele-

fanctional concerns are those concerns not directly related to the result computed by an application, but rather to the way this result is computed. Examples of non-functional concerns include performance, security, fault tolerance. Management of such concerns usually requires extensive knowledge of the target execution environment and appropriate interaction with the functional code of the application. Non-functional concern management becomes increasingly complex as the target architecture becomes more and more dynamic and heterogeneous, and the features of the target execution environment are progressively hidden from the application programmer (moving from clusters to "invisible grids" [6] and from grids to clouds). This sug-

M. Aldinucci, M. Danelutto, P. Kilpatrick. Autonomic management of non-functional concerns in distributed and parallel application programming. IPDPS, 2009. IEEE.





### Example: parallel for

// FastFlow (--std=c++11) ff::ParallelFor pf; pf.parallel\_for(0L,N,[&A](const long i) { A[i]+=1; },nworkers);

• Currently a method call on a C++11 Lambda function (loop body)

All other high-level patterns in the same style \*

Moving to C++11 generalised attribute syntax (N2761) \* Within REPARA EU-FP7 STREP project •

[[ff::target(ff::cpu,ff::gpu), ff::input(A), ff::output(A), ...]] for( ; ; ; ) { ... }

// OpenMP (-fopenmp) #pragma omp parallel for num\_threads(nworkers) for(long i=0;i<N;++i) {</pre> A[i]+=1;



## Applications

Stream: Sequence Alignment Tools

Tasks: Linear Algebra

Data: Image Filtering

System programmers should use the techniques they advocate: memory allocation



# Stream: Bowtie (BT) and BWA sequence alignment tools



- Top tools for parallel DNA alignment
- Hand-tuned C/C++/SSE2 code
- Spin locks + Pthreads
- Sequences (reads) are streamed from MMIO files to workers
- Memory bound

- FastFlow master-worker
- Memory affinity, pinning, affinity scheduling (embedded in the pattern)
- ✤ BT: up to 200% speedup BWA: up to 10% speedup over originals



C. Misale, G. Ferrero, M. Torquati, M. Aldinucci. Sequence alignment tools: one parallel pattern to rule them all? BioMed Research International, 2014.



46

## Graphs of tasks LU and Cholesky

- Macro-Data-Flow (MDF) pattern encoding dependence DAG
  - pipeline(TaskGen, farm(TaskExec))
  - Configurable scheduling, affinity, ...
- Dynamic generation of the DAG
- Comparable or faster than specialised linear algebra frameworks (e.g. PLASMA)
  - MDF is general, can be used for a wide range of applications, Dynamic Programming, Divide&Conquer, ...









30

20

15

10

Speedup

## Data Parallelism: Two-stage restoring



- Statistic detection + variational restoration
  - High quality, edge-preserving filtering \*
  - Much more computational demanding, not really viable without parallelism
    - Matlab on a single 256x256 image with 50% of noise requires dozen of minutes
  - Stages can be pipelined



## Effective noise filtering (variational) Salt-andPepper, Gaussian, ...



10% impulsive noise

Original Baboon standard test image 1024x1024

#### Restored

50% impulsive noise

90% impulsive noise



PNSR 32.75dB MAE 2.67

PNSR 23.4 MAE 11.21





M. Aldinucci, M. Torquati, M. Drocco, G. Peretti Pezzi, and C. Spampinato. Fastflow: Combining pattern-level abstraction and efficiency in GPGPUs. GTC 2014, San Jose, CA, USA, 2014

```
using namespace ff:
```

template<typenome DenoiserCUDAtaskType, typenome DenoiserCUDAmapF> class cudaDenoiserAUTO: public Denoiser, public FFSTENCILREDUCECUDA(DenoiserCUDAtaskType, DenoiserCUDAmapF, reduceF) { public:

```
bool trace_time) :
```

```
kernel_params(kernel_params_),
```

FFSTENCILREDUCECUDA(DenoiserCUDAtoskType, DenoiserCUDAmopF, reduceF)(mox\_cycles), Denoiser(height, w dth. fixed\_cycles, max\_cycles, trace\_time) [

```
void "swc(void "t) (
  ((denoise_task *)t)->kennel_params = kennel_params;
```

```
((denoise_task *)t)->fixed_cycles = fixed_cycles;
return Denoiser::svc(t);
```

```
() () Chm. svc. md
```

unsigned int restore unsigned chor "in, unsigned chor "out, int "noisymop, unsigned int "noisy, unsigned int n\_noisy, void \*task) {

```
unsigned int height = this->height;
   unsigned int width = this-swidth;
   memcpy(out, in, height * width * sizeof(unsigned chor));
   FFSTENCILREOUCECUDA(DenoiserCUDAtaskType, DenoiserCUDAmapF, reduceF)::svc(task);
   return this->getIter();
privote:
 void kernel parans
3:
         CUDADENOISERAUTO_H
```

cudaDenoiserAUTOCvoid \*kernel\_params\_, unsigned int height, unsigned int width, bool fixed\_cycles, unsigned int max\_cycles,

#### Stencil-Reduce pattern init

Stencil-Reduce pattern run





#### Performance (CPUs + GPGPUs) Video frames 768x512





no difference w.r.t. hand-written CUDA code

#### SandyBridge 16 cores + 2 Tesla M2090



















- A network that connects data allocations-deallocations paths
- Faster than posix, often faster
   than hoard and TBB
  - unpublished, code available on sourceforge
  - Implements deferred deallocation to avoid ABA problem







#### Conclusion

- Low-level approach for a better performance/scalability \*
  - \*
  - Not proved, but we believe it will be even more evident at the large scale (exa-scale) \*
- FastFlow: header-only C++11 library
  - Research framework, portable everywhere exists a C++11 compiler, tiny codebase \*
  - Efficient, scalable \*
- A data-centric parallel programming model is paramount •
  - movements

It is already a myth, a medium scale. Does someone still believe assembler is faster than C++?

High-level with a clear parallel semantics, compositional, enhancing locality and fast data



#### Think different.

Stay foolish, play with cars Stay hungry, keep looking to new toys



#### Thanks



University of Turin



M. Aldinucci



G. Peretti



A. Secco



F. Tordini



University of Pisa



M. Danelutto



M. Torquati





P. Kilpatrick

40,403 Pageviews 81,457 Avg. Visit Duration 00:02:26

Visits

https://sourceforge.net/projects/mc-fastflow/







M. Drocco









EU-FP7 - 3.5M€







EU-FP7 Network of Excellense



