User Tools

Site Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
ffnamespace:architecture [2014/08/14 13:31]
aldinuc
ffnamespace:architecture [2014/09/12 19:06]
aldinuc
Line 8: Line 8:
   - to support the development of portable and efficient applications for homogenous and heterogenous of platforms, including multicore, many-core (e.g. GPGPU, FPGA), and distributed clusters of them.   - to support the development of portable and efficient applications for homogenous and heterogenous of platforms, including multicore, many-core (e.g. GPGPU, FPGA), and distributed clusters of them.
  
-These two goals have been perceived as a dichotomy for many years by many computer practitioners. Researchers in the [[http://​en.wikipedia.org/​wiki/​Algorithmic_skeleton| skeletal and high-level parallel programming community]] never perceived it this way. Indeed, Fastflow is the nth-in-a-row high-level pattern-based high-level parallel programming frameworks we designed in the last fifteen years (detailed in contributors home pages: [[http://​alpha.di.unito.it/​parallel-computing-tools-marco-aldinucci/​|MarcoA]], ​ [[http://​calvados.di.unipi.it/​dokuwiki/​doku.php/​torquatinamespace:​software|Massimo]],​ [[http://​backus.di.unipi.it/​~marcod/​wiki/​doku.php?​id=paralleltools|MarcoD]]). Along this path we was happy to see that high-level parallel programming is becoming a mainstream approach, as demonstrated by large industry involvement in the field: Google (with MapReduce), Intel (with TBB and CnC), Microsoft (with TPL). A more comprehensive introduction to high-level parallel programming can be found in one of our recent [[http://​calvados.di.unipi.it/​storage/​talks/​2014_repara_skeletonintro_marcod.pdf|talk]].+These two goals have been perceived as a dichotomy for many years by many computer practitioners. Researchers in the [[http://​en.wikipedia.org/​wiki/​Algorithmic_skeleton| skeletal and high-level parallel programming community]] never perceived it this way. Indeed, Fastflow is the nth-in-a-row high-level pattern-based high-level parallel programming frameworks we designed in the last fifteen years (detailed in contributors home pages: [[http://​alpha.di.unito.it/​parallel-computing-tools-marco-aldinucci/​|MarcoA]], ​ [[http://​calvados.di.unipi.it/​dokuwiki/​doku.php/​torquatinamespace:​software|Massimo]],​ [[http://​backus.di.unipi.it/​~marcod/​wiki/​doku.php?​id=paralleltools|MarcoD]]). Along this path we were happy to see that high-level parallel programming is becoming a mainstream approach, as demonstrated by large industry involvement in the field: Google (with MapReduce), Intel (with TBB and CnC), Microsoft (with TPL). A more comprehensive introduction to high-level parallel programming can be found in one of our recent [[http://​calvados.di.unipi.it/​storage/​talks/​2014_repara_skeletonintro_marcod.pdf|talk]].
  
 FastFlow architecture is organised in three main tiers: FastFlow architecture is organised in three main tiers:
  
-  - **High-level patterns** They are clearly characterised in a specific usage context and are targeted to the parallelisation of sequential (legacy) code. Examples are exploitation of loop parallelism,​ stream parallelism,​ data-parallel algorithms, execution of general workflows of tasks, etc. They are typically equipped with self-optimisation capabilities (e.g. load-balancing,​ grain auto-tuning,​ parallelism-degree auto-tuning) and exhibit limited nesting capability. Examples are:  ''​parallel-for'',​ ''​pipeline'',​ ''​stencil-reduce'',​ ''​mdf''​ (macro-data-flow). Some of them targets ​specific devices (e.g. GPGPUs). They are implemented on top of **core patterns**.+  - **High-level patterns** They are clearly characterised in a specific usage context and are targeted to the parallelisation of sequential (legacy) code. Examples are exploitation of loop parallelism,​ stream parallelism,​ data-parallel algorithms, execution of general workflows of tasks, etc. They are typically equipped with self-optimisation capabilities (e.g. load-balancing,​ grain auto-tuning,​ parallelism-degree auto-tuning) and exhibit limited nesting capability. Examples are:  ''​parallel-for'',​ ''​pipeline'',​ ''​stencil-reduce'',​ ''​mdf''​ (macro-data-flow). Some of them target ​specific devices (e.g. GPGPUs). They are implemented on top of **core patterns**.
   - **Core patterns** They provide a general //​data-centric//​ parallel programming model with its run-time support, which is designed to be minimal and reduce to the minimum typical sources of overheads in parallel programming. At this level there are two patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​feedback''​). They make it possible to build very general (deadlock-free) cyclic process networks. They are not graphs of tasks, they are graphs of parallel executors (processes/​threads). Tasks or data items flows across them. Overall, the programming model can be envisioned as a shared-memory streaming model, i.e. a shared-memory model equipped with message-passing synchronisations. They are implemented on top of **building blocks**. ​   - **Core patterns** They provide a general //​data-centric//​ parallel programming model with its run-time support, which is designed to be minimal and reduce to the minimum typical sources of overheads in parallel programming. At this level there are two patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​feedback''​). They make it possible to build very general (deadlock-free) cyclic process networks. They are not graphs of tasks, they are graphs of parallel executors (processes/​threads). Tasks or data items flows across them. Overall, the programming model can be envisioned as a shared-memory streaming model, i.e. a shared-memory model equipped with message-passing synchronisations. They are implemented on top of **building blocks**. ​
-  - **Building blocks** It provides the basics ​blocks to build (and generate via C++ header-only templates) the run-time support of core patterns. Typical objects at this level are queues (e.g. wait-free fence-free SPSC queues, bound and unbound), process and thread containers (as C++ classes) mediator threads/​processes (extensible and configurable schedulers and gatherers). The shared-memory run-time support extensively uses nonblocking lock-free (and fence-free) algorithms, the distributed run-time support employs zero-copy messaging, the GPGPUs support exploits asynchrony and SIMT optimised algorithms. ​+  - **Building blocks** It provides the basic blocks to build (and generate via C++ header-only templates) the run-time support of core patterns. Typical objects at this level are queues (e.g. wait-free fence-free SPSC queues, bound and unbound), process and thread containers (as C++ classes) mediator threads/​processes (extensible and configurable schedulers and gatherers). The shared-memory run-time support extensively uses nonblocking lock-free (and fence-free) algorithms, the distributed run-time support employs zero-copy messaging, the GPGPUs support exploits asynchrony and SIMT optimised algorithms. ​
  
  
Line 42: Line 42:
 </​code>​| </​code>​|
  
-In the specific case, the only syntactic difference between OpenMP and FastFlow are that FastFlow ​provide ​programmers with C++ templates instead of compiler pragmas. It is worth to notice that despite the similar syntax, the implementation of the ''​parallel_for''​ and all other high-level patterns in FastFlow is quite different from OpenMP and other mainstream programming frameworks (Intel TBB, etc). FastFlow, instead of relying on a general task execution engine, generates a compile time a specific streaming network based on core patterns for each pattern. In the case of ''​parallel_for''​ this network is a parametric master-worker with active or passive (in memory) task scheduler (more details in the [[http://​calvados.di.unipi.it/​storage/​paper_files/​2014_ff_looppar_pdp.pdf|PDP2014 paper]]).+In the specific case, the only syntactic difference between OpenMP and FastFlow are that FastFlow ​provides ​programmers with C++ templates instead of compiler pragmas. It is worth to notice that despite the similar syntax, the implementation of the ''​parallel_for''​ and all other high-level patterns in FastFlow is quite different from OpenMP and other mainstream programming frameworks (Intel TBB, etc). FastFlow, instead of relying on a general task execution engine, generates a compile time a specific streaming network based on core patterns for each pattern. In the case of ''​parallel_for''​ this network is a parametric master-worker with active or passive (in memory) task scheduler (more details in the [[http://​calvados.di.unipi.it/​storage/​paper_files/​2014_ff_looppar_pdp.pdf|PDP2014 paper]]).
  
 As in OpenMP, ''​parallel_for''​ comes in many variants (see [[ffnamespace:​refman|reference manual]]). Other patterns at this level, to date, are: ''​parallel_reduce'',​ ''​mdf''​ (macro-data-flow),​ ''​pool evolution''​ (genetic algorithm), ''​stencil''​. They cover most common parallel programming paradigms in data, stream and task parallelism. Notably, FastFlow patterns are C++ class templates and can be extended by end users according to the Object-Oriented methodology. As in OpenMP, ''​parallel_for''​ comes in many variants (see [[ffnamespace:​refman|reference manual]]). Other patterns at this level, to date, are: ''​parallel_reduce'',​ ''​mdf''​ (macro-data-flow),​ ''​pool evolution''​ (genetic algorithm), ''​stencil''​. They cover most common parallel programming paradigms in data, stream and task parallelism. Notably, FastFlow patterns are C++ class templates and can be extended by end users according to the Object-Oriented methodology.
Line 50: Line 50:
 ==== Core Patterns ==== ==== Core Patterns ====
 At its foundations FastFlow implements a (mid/​low-level) concurrent programming model, which extends C++ language. From the orchestration viewpoint, the process model to be employed is a CSP/Actor hybrid model where processes (so-called ''​ff_node''​s) are named and the data paths between processes are clearly identified. The abstract units of communication and synchronisation are known as ''​channels''​ and represent a stream of data dependency between two processes. A ''​ff_node''​ is C++ class, after construction it enters in a infinite loop At its foundations FastFlow implements a (mid/​low-level) concurrent programming model, which extends C++ language. From the orchestration viewpoint, the process model to be employed is a CSP/Actor hybrid model where processes (so-called ''​ff_node''​s) are named and the data paths between processes are clearly identified. The abstract units of communication and synchronisation are known as ''​channels''​ and represent a stream of data dependency between two processes. A ''​ff_node''​ is C++ class, after construction it enters in a infinite loop
-that 1) gets a task from input channel (i.e. a pointer); 2) execute business code on the task; 3) put a task the output channel (i.e. a pointer). Representing communication and synchronisation as a channel ensures that synchronisation is tied to communication and allows layers of abstraction at higher levels to compose parallel programs where synchronisation is implicit. ​+that 1) gets a task from input channel (i.e. a pointer); 2) execute business code on the task; 3) put a task into the output channel (i.e. a pointer). Representing communication and synchronisation as a channel ensures that synchronisation is tied to communication and allows layers of abstraction at higher levels to compose parallel programs where synchronisation is implicit. ​
  
 At the core patterns level, patterns to build a graph of ''​ff_node''​s are defined. Since the graph of ''​ff_node''​s is a streaming network, any FastFlow graph is built using two streaming patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback'',​ to build cyclic networks). These patterns can be arbitrarily nested to build large and complex graphs. However, not all graphs can be build. This enforce the correctness (by-construction) of all streaming networks that can be generated. In particular, they are deadlock-free and data-race free. At the core patterns level, patterns to build a graph of ''​ff_node''​s are defined. Since the graph of ''​ff_node''​s is a streaming network, any FastFlow graph is built using two streaming patterns (''​farm''​ and ''​pipeline''​) and one pattern-modifier (''​loopback'',​ to build cyclic networks). These patterns can be arbitrarily nested to build large and complex graphs. However, not all graphs can be build. This enforce the correctness (by-construction) of all streaming networks that can be generated. In particular, they are deadlock-free and data-race free.
 +
 +=== Nonblocking and Blocking behaviour ===
 +
 +Blocking synchronisations fit well coarse grain parallelism (milliseconds tasks or more), whereas Nonblocking fine grain parallelism. Blocking synchronisations make it possible to exploit over-provisioning (e.g. for load balancing) and energy consumption. However, they exhibits large overheads (also due to OS involvement). Mixing blocking and nonblocking synchronisations is not trivial.
 +
 +FastFlow run-time is designed to exhibit a nonblocking behaviour, with the possibility to switch to blocking behaviour. Overall, a FastFlow run is a sequence of nonblocking running phases. Among two phases the run-time can switch to a blocking phase by way of a (original, data-flow) distributed protocol. In the FastFlow terminology,​ a pattern (or a composition of patterns) can //freeze// (i.e. suspend), to be later resumed in the next nonblocking phase. This model makes it possible to address fine grain workloads, bursts of fine grain workloads, and coarse grain workloads. During nonblocking phase, the FastFlow run time employes only lock-free and wait-free algorithms in all synchronisation critical paths (whereas it uses pthreads locks in the blocking phase).
 +
 +=== Deadlock avoidance ===
 +
 +The implementation in term of streaming network (i.e. a network of threads or processes) of a pattern can be cyclic (e.g. master-worker,​ D&C, etc.). For this FastFlow uses its own unbound SPSC buffer to avoid deadlocks due to dependency cycles [ADK12].
  
 === Accelerator mode === === Accelerator mode ===
Line 71: Line 81:
  
 Once created, an accelerator can be run, making it capable of accepting tasks on the input channel. When running, the threads belonging to an accelerator might fall into an //active waiting// state. These state transitions exhibit a very low overhead and do not involve the O.S. Once created, an accelerator can be run, making it capable of accepting tasks on the input channel. When running, the threads belonging to an accelerator might fall into an //active waiting// state. These state transitions exhibit a very low overhead and do not involve the O.S.
-threads ​not belonging to the accelerator could //wait// for an accelerator,​ i.e. suspend until the accelerator completes its input tasks (receives the //​End-of-Stream//,​ unique is propagated in transient states of the lifecycle to all threads) ​ and then put it in the frozen state. At creation time, the accelerator is configured and its threads are bound into one or more  cores. Since the FastFlow run-time is implemented via non-blocking threads, they will, if not frozen, fully load the cores in which they are placed, no matter whether they are actually processing something or not. Because of this, the accelerator is usually configured to use "​spare"​ cores+Threads ​not belonging to the accelerator could //wait// for an accelerator,​ i.e. suspend until the accelerator completes its input tasks (receives the //​End-of-Stream//,​ unique is propagated in transient states of the lifecycle to all threads) ​ and then put it in the frozen state. At creation time, the accelerator is configured and its threads are bound into one or more  cores. Since the FastFlow run-time is implemented via non-blocking threads, they will, if not frozen, fully load the cores in which they are placed, no matter whether they are actually processing something or not. Because of this, the accelerator is usually configured to use "​spare"​ cores
 (although over-provisioning could be forced). ​ If necessary, output tasks could be popped from the accelerator output channel. (although over-provisioning could be forced). ​ If necessary, output tasks could be popped from the accelerator output channel.
  
 More details on FastFlow accelerator technology can be found in [ADK11]. ​ More details on FastFlow accelerator technology can be found in [ADK11]. ​
  
-=== Nonblocking and Blocking behaviour === 
- 
-=== Deadlock avoidance === 
  
  
Line 85: Line 92:
  
  
-At this level, FastFlow programming model can be thought as a hybrid shared-memory/​message-passing model. A process (''​ff_node''​) is sequential, a channel models a true data dependency between processes. Processes typically stream data items (they are not tasks) onto channels, they can be either references (e.g. pointers in the shared-memory) or messages with a payload (e.g. in a distributed platform). In both cases, the data item acts as synchronisation token. In general, no further synchronisation primitives are needed (e.g. locks, semaphores) even thought their usage is not forbidden (they are simply useless and a source of additional overhead). Overall, at this level, FastFlow building blocks make it possible to realise arbitrary streaming networks over lock-less channels. ​+At this level, ​the FastFlow programming model can be thought as a hybrid shared-memory/​message-passing model. A process (''​ff_node''​) is sequential, a channel models a true data dependency between processes. Processes typically stream data items (they are not tasks) onto channels, they can be either references (e.g. pointers in the shared-memory) or messages with a payload (e.g. in a distributed platform). In both cases, the data item acts as synchronisation token. In general, no further synchronisation primitives are needed (e.g. locks, semaphores) even thought their usage is not forbidden (they are simply useless and a source of additional overhead). Overall, at this level, FastFlow building blocks make it possible to realise arbitrary streaming networks over lock-less channels. ​
  
 In summary, the FastFlow building blocks layer realizes the two basic features: In summary, the FastFlow building blocks layer realizes the two basic features:
Line 95: Line 102:
 Implementation-wise,​ a ''​ff_node''​ is a C++ object that is mapped onto a OS thread (POSIX or OS native threads). Typically ff_nodes have a nonblocking behaviour, i.e. do not suspend on pushing or popping messages from channels. Empty and full channels are managed via busy waiting. If needed, a graph of nodes can be switched from nonblocking to blocking behaviour, and vice-versa (via a native distributed protocol). Nonblocking behaviour, coupled with lock-less (actually wait-free) channels enforce a very high throughput and very low latency onto cache-coherent shared-memory multicore (~20 clock cycles per message). The possibility to switch from blocking to nonblocking behaviour is useful to manage bursts of activity interweaved by periods of inactivity. Implementation-wise,​ a ''​ff_node''​ is a C++ object that is mapped onto a OS thread (POSIX or OS native threads). Typically ff_nodes have a nonblocking behaviour, i.e. do not suspend on pushing or popping messages from channels. Empty and full channels are managed via busy waiting. If needed, a graph of nodes can be switched from nonblocking to blocking behaviour, and vice-versa (via a native distributed protocol). Nonblocking behaviour, coupled with lock-less (actually wait-free) channels enforce a very high throughput and very low latency onto cache-coherent shared-memory multicore (~20 clock cycles per message). The possibility to switch from blocking to nonblocking behaviour is useful to manage bursts of activity interweaved by periods of inactivity.
  
-Channels are inspired to P1C1 [HK97] and Fastforward queues [GMV08] and Lamport'​s wait-free protocols [Lam83], and provides mechanisms to define simple streaming networks whose //run-time support// is implemented through correct and efficient lock-free Single-Producer-Single-Consumer (SPSC) ​queues ​queues equipped with non-blocking ''​push''​ and ''​pop''​ operations (more details about FastFlow'​s SPSC queues can be found in [ADK12].+Channels are inspired to P1C1 [HK97] and Fastforward queues [GMV08] and Lamport'​s wait-free protocols [Lam83], and provides mechanisms to define simple streaming networks whose //run-time support// is implemented through correct and efficient lock-free Single-Producer-Single-Consumer (SPSC) queues equipped with non-blocking ''​push''​ and ''​pop''​ operations (more details about FastFlow'​s SPSC queues can be found in [ADK12].
  
 Shared-memory channels exhibits a number of performance pitfalls on commodity shared-memory cache-coherent multiprocessors (as many commodity multi-core are). In particular, traditional lock-free implementations (such as Lamport'​s solution [Lam83]) of SPSC queues are correct under sequential consistency only, where none of the current multi-cores implement sequential consistency. Also, some correct queue implementations induce a very high invalidation rate - and thus reduced performance - because they exhibit the sharing of locations that are subject to alternative invalidations from communication partners (e.g. head and tail of a circular buffers). Shared-memory channels exhibits a number of performance pitfalls on commodity shared-memory cache-coherent multiprocessors (as many commodity multi-core are). In particular, traditional lock-free implementations (such as Lamport'​s solution [Lam83]) of SPSC queues are correct under sequential consistency only, where none of the current multi-cores implement sequential consistency. Also, some correct queue implementations induce a very high invalidation rate - and thus reduced performance - because they exhibit the sharing of locations that are subject to alternative invalidations from communication partners (e.g. head and tail of a circular buffers).
Line 115: Line 122:
 It is interesting to observe that It is interesting to observe that
  
-  * a collective channel (e.g. SPMC, MPSC) implemented via SPSCs+mediator are sometime faster with respect to CAS-based implementation (this really ​depend ​on the platform, parallelism degree and computation grain). +  * a collective channel (e.g. SPMC, MPSC) implemented via SPSCs+mediator are sometime faster with respect to CAS-based implementation (this really ​depends ​on the platform, parallelism degree and computation grain). 
-  * Mediator thread ​make it possible to easily program scheduling policy for both item distribution and gathering. +  * Mediator thread ​makes it possible to easily program scheduling policy for both item distribution and gathering. 
-  * Nonblocking mediator threads ​couples ​very well with hyper threading technology because they typically execute lot of instructions that never arrives to execute stage in the processor pipeline.+  * Nonblocking mediator threads ​couple ​very well with hyper threading technology because they typically execute lot of instructions that never arrives to execute stage in the processor pipeline.
   * Fastflow relies on  wait-free, non-blocking synchronizations. The approach has pros and cons. The main advantage consists in performance:​ avoiding memory fences dramatically reduces cache coherence overhead. ​   * Fastflow relies on  wait-free, non-blocking synchronizations. The approach has pros and cons. The main advantage consists in performance:​ avoiding memory fences dramatically reduces cache coherence overhead. ​
  
Line 135: Line 142:
 the x86 model). On other models (e.g., Itanium and Power4, 5, and 6), the x86 model). On other models (e.g., Itanium and Power4, 5, and 6),
 a store fence before an enqueue is needed [GMV08]. a store fence before an enqueue is needed [GMV08].
 +
 == GPGPUs == == GPGPUs ==
 +
 GPGPUs are supported by way of OpenCL and/or CUDA. At the current development status, kernel business code should be written either in OpenCL or CUDA. FastFlow takes care of H2D/D2H (asynchronous) data transfers and synchronisations. ''​stencil-reduce''​ pattern makes it possible to write most of the typical GPGPUs kernels as they were C/C++ code since intra-block and inter-blocks synchronisations (including reduce code) are transparently provided by the pattern. Still, the programmer can use OpenCL/CUDA directives in the kernel. ​ GPGPUs are supported by way of OpenCL and/or CUDA. At the current development status, kernel business code should be written either in OpenCL or CUDA. FastFlow takes care of H2D/D2H (asynchronous) data transfers and synchronisations. ''​stencil-reduce''​ pattern makes it possible to write most of the typical GPGPUs kernels as they were C/C++ code since intra-block and inter-blocks synchronisations (including reduce code) are transparently provided by the pattern. Still, the programmer can use OpenCL/CUDA directives in the kernel. ​
 +
 == Distributed == == Distributed ==
 +
 Distributed platforms build on top of TCP/IP and Infiniband/​OFED protocols are also supported. ​ Distributed platforms build on top of TCP/IP and Infiniband/​OFED protocols are also supported. ​
 FPGA support is planned but not yet fully developed. FPGA support is planned but not yet fully developed.
Line 167: Line 178:
 [GMV08] J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism:​ a cache-optimized concurrent lock-free queue. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), pages 43-52, New York, NY, USA, 2008. ACM. [GMV08] J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism:​ a cache-optimized concurrent lock-free queue. In Proc. of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), pages 43-52, New York, NY, USA, 2008. ACM.
  
-[AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,​ N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. [[http://​doi.acm.org/​10.1145/​1562764.1562783|DOI]] +[AB+09] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz,​ N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM 52, 10 (Oct. 2009), 56-67. [[http://​doi.acm.org/​10.1145/​1562764.1562783|DOI:​10.1145/​1562764.1562783]]
- +
-[AMT09] M. Aldinucci, M. Meneghin, and M. Torquati. ​ Efficient Smith-Waterman on multi-core with fastflow. In Proc. of Intl. Euromicro PDP 2010Parallel Distributed and network-based Processing, Pisa, Italy, Feb2010IEEE. To appear. [[ffnamespace:​about|(Paper Draft)]]+
  
 [ADK11] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. Accelerating code on multi- cores with fastflow. In Proc. of 17th Intl. Euro-Par 2011 Parallel Processing, volume 6853 of LNCS, pages 170–181, Bordeaux, France, Aug. 2011. Springer. [ADK11] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. Accelerating code on multi- cores with fastflow. In Proc. of 17th Intl. Euro-Par 2011 Parallel Processing, volume 6853 of LNCS, pages 170–181, Bordeaux, France, Aug. 2011. Springer.
  
 [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer. [ADK12] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. An efficient unbounded lock-free queue for multi-core systems. In Proc. of 18th Intl. Euro-Par 2012 Parallel Processing, volume 7484 of LNCS, pages 662–673, Rhodes Island, Greece, aug 2012. Springer.
ffnamespace/architecture.txt · Last modified: 2014/09/12 19:06 by aldinuc