User Tools

Site Tools


Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
ffnamespace:performance [2013/12/29 20:26]
aldinuc
ffnamespace:performance [2014/08/31 02:51]
aldinuc
Line 1: Line 1:
 +===== Applications and Performances ===== 
 +==== NGS tools (Bowtie2, BWA) - 2014 ====
 +Bowtie2.0.6,​ Bowtie-2.2.1,​ and BWA compared in performance against their porting onto the FastFlow library. Tested on   
 +     * Intel 4-socket 8-core Nehalem (64 HT) @2.0GHz, 72MB L3, 64 GB mem, Linux x86_64
 +     * Intel 2-socket 8-core Sandy Bridge (32 HT) @2.2GHz, 40MB L3, 64 GB mem, Linux x86_64 ​
 +
 +More details in:
 +
 +  * C. Misale, “Accelerating Bowtie2 with a lock-less concurrency approach and memory affinity,​” in Proc. of Intl. Euromicro PDP 2014: Parallel Distributed and network-based Processing, Torino, Italy, 2014. doi:​10.1109/​PDP.2014.50 [[http://​calvados.di.unipi.it/​storage/​paper_files/​2014_pdp_bowtieff.pdf|PDF]]
 +  * C. Misale, G. Ferrero, M. Torquati, and M. Aldinucci, “Sequence alignment tools: one parallel pattern to rule them all?,” BioMed Research International,​ 2014. doi:​10.1155/​2014/​539410 [[http://​downloads.hindawi.com/​journals/​bmri/​2014/​539410.pdf|PDF]] ​
 +
 +|{{:​ffnamespace:​bowtie2-speedup.png?​300|}}|{{:​ffnamespace:​bowtie-bwa-maxspeedup.png?​300|}}|
 +
 +==== Yadt-ff (parallel C4.5)  - 2012 ====
 +The well-known C4.5 statistical classifier is a double hard algorithm. First of all, because data-miners simply would not like to spend time on a yet another brand new parallel version :-) Many past experiences demonstrated that tiny improvements of the sequential algorithm could bring much more performance than a robust investment on parallelization. This clearly does not absolutely mean that parallelization is useless, but, at least in our understanding,​ that a low-effort and conservative parallelization is the only fairly welcome parallelization in the data-mining community. Unfortunately that kind of parallelization,​ i.e. loop and recursion parallelization,​ is technically complex because independent tasks generated in this way may exhibit several non nice proprieties,​ including a huge range of variability in the task size that in turn may induce both severe synchronization overheads and non-trivial load balancing problems that limit the speedup.
 +
 +The YaDT-FastFlow application faces both problems. [[http://​ieeexplore.ieee.org/​Xplore/​login.jsp?​url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9460%2F30023%2F01374196.pdf|YaDT]] is a third-party,​ main-memory implementation of the C4.5-like decision tree algorithm by Salvatore Ruggieri. YaDT-FastFlow is a //​low-effort//​ parallelization of the sequential algorithm that required less than 10 hours of development (including tuning and testing) while producing a significant speedup over the sequential version.
 +
 +This application aims at demonstrating the ability of FastFlow and FastFlow accelerator to support rapid and efficient development via semi-automatic parallelization of loops and Divide&​Conquer in third-party and legacy codes. ​
 +
 +Stay tuned for a brand new Technical Report about that. The code will be publicly available with the Technical Report. The C.4.5-FastFlow application has been developed in cooperation with Salvatore Ruggieri, University of Pisa, Italy. ​
 +
 +=== Performances ===
 +Tests on andromeda (2 x quad-core HT - 16 contexts, Linux) and ottavinareale (2 x quad-core, Linux).
 +
 +|{{:​ffnamespace:​model_cr2_speedup.png?​320|Speedup on ottavinareale}}|{{:​ffnamespace:​ottavina_cr2_speedup.png?​320|Speedup on andromeda}}|
 +| On Andromeda (HT, 8 cores, 16 contexts) | On Ottavinareale (8 cores) |
 +
 <note important>​ <note important>​
-This page will be entirely renewed+The rest is outdated
 </​note>​ </​note>​
  
-===== Applications and Performances ===== +
 We have been developing several applications using FastFlow and FastFlow accelerator. The complexity of them ranges from simple micro-benchmarks to quite complex scientific and business applications. Clearly, our main business consists in developing FastFlow itself more than any big or complex application. However, we believe that developing and running applications is the only effective way to demonstrate that FastFlow is a viable and convenient way to high-level parallel programming for multi-core. For this, each application is carefully chosen in order to demonstrate a particular aspect of feature of FastFlow, and we try make them timeliness available to third-parties with all the information needed to understand the code and reproduce the experiments we did. We also try to publish a Technical Report for each significant advance. Said that, we are also very interested to support independent programmers,​ scientists, and industries that would like to try FastFlow on their own applicative domains. If you interested just write us.  We have been developing several applications using FastFlow and FastFlow accelerator. The complexity of them ranges from simple micro-benchmarks to quite complex scientific and business applications. Clearly, our main business consists in developing FastFlow itself more than any big or complex application. However, we believe that developing and running applications is the only effective way to demonstrate that FastFlow is a viable and convenient way to high-level parallel programming for multi-core. For this, each application is carefully chosen in order to demonstrate a particular aspect of feature of FastFlow, and we try make them timeliness available to third-parties with all the information needed to understand the code and reproduce the experiments we did. We also try to publish a Technical Report for each significant advance. Said that, we are also very interested to support independent programmers,​ scientists, and industries that would like to try FastFlow on their own applicative domains. If you interested just write us. 
  
Line 21: Line 49:
 Tests on ottavinareale (8-cores, Linux) Tests on ottavinareale (8-cores, Linux)
  
-{{:​ffnamespace:​sw_ff_tbb_omp_cilk_50.png?​320|}} +|{{:​ffnamespace:​sw_ff_tbb_omp_cilk_50.png?​320|}}|{{:​ffnamespace:​sw_ff_tbb_omp_cilk_5.png?​320|}}| 
-{{:​ffnamespace:​sw_ff_tbb_omp_cilk_5.png?​320|}} +|{{:​ffnamespace:​sw_ff_tbb_omp_cilk_05.png?​320|}}| |
-{{:​ffnamespace:​sw_ff_tbb_omp_cilk_05.png?​320|}}+
  
 ==== N-Queens ​ ==== ==== N-Queens ​ ====
Line 110: Line 137:
  
  
-==== Yadt-ff (parallel C4.5)  ==== 
-The well-known C4.5 statistical classifier is a double hard algorithm. First of all, because data-miners simply would not like to spend time on a yet another brand new parallel version :-) Many past experiences demonstrated that tiny improvements of the sequential algorithm could bring much more performance than a robust investment on parallelization. This clearly does not absolutely mean that parallelization is useless, but, at least in our understanding,​ that a low-effort and conservative parallelization is the only fairly welcome parallelization in the data-mining community. Unfortunately that kind of parallelization,​ i.e. loop and recursion parallelization,​ is technically complex because independent tasks generated in this way may exhibit several non nice proprieties,​ including a huge range of variability in the task size that in turn may induce both severe synchronization overheads and non-trivial load balancing problems that limit the speedup. 
  
-The YaDT-FastFlow application faces both problems. [[http://​ieeexplore.ieee.org/​Xplore/​login.jsp?​url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F9460%2F30023%2F01374196.pdf|YaDT]] is a third-party,​ main-memory implementation of the C4.5-like decision tree algorithm by Salvatore Ruggieri. YaDT-FastFlow is a //​low-effort//​ parallelization of the sequential algorithm that required less than 10 hours of development (including tuning and testing) while producing a significant speedup over the sequential version. 
- 
-This application aims at demonstrating the ability of FastFlow and FastFlow accelerator to support rapid and efficient development via semi-automatic parallelization of loops and Divide&​Conquer in third-party and legacy codes. ​ 
- 
-Stay tuned for a brand new Technical Report about that. The code will be publicly available with the Technical Report. The C.4.5-FastFlow application has been developed in cooperation with Salvatore Ruggieri, University of Pisa, Italy. ​ 
- 
-=== Performances === 
-Tests on andromeda (2 x quad-core HT - 16 contexts, Linux) and ottavinareale (2 x quad-core, Linux). 
- 
-|{{:​ffnamespace:​model_cr2_speedup.png?​320|Speedup on ottavinareale}}|{{:​ffnamespace:​ottavina_cr2_speedup.png?​320|Speedup on andromeda}}| 
-| On Andromeda (HT, 8 cores, 16 contexts) | On Ottavinareale (8 cores) | 
 ==== Smith-Waterman ​ ==== ==== Smith-Waterman ​ ====
 In bioinformatics,​ sequence database searches are used to find the In bioinformatics,​ sequence database searches are used to find the
Line 300: Line 314:
  
 {{:​ffnamespace:​iphone-2012.06.27-14.18.40.png?​240|}} {{:​ffnamespace:​iphone-2012.06.27-14.18.40.png?​240|}}
- 
- 
-====== Platforms ====== 
- 
-==== Andromeda ==== 
-Andromeda is an Intel  workstation with 2 quad-core Xeon E5520 Nehalem (16 HyperThreads) @2.26GHz with 8MB L3 cache and 24 GBytes of main memory. The platform implements Quickpath processor interconnect equipped with an extended version of MESI cache coherence protocol: a new read-only forward state has also been introduced to enable cache-to-cache clean line forwarding. This eliminates invalidations in the case of read-only sharing that significantly simplifies the performance tuning of the FastFlow code. Courtesy of University of Pisa. 
- 
-==== Ottavarinareale ==== 
- 
-[[http://​cotognata.di.unipi.it/​~marcodanelutto/​wiki/​doku.php?​id=regoleottavina|Ottavinareale]] is shared memory Intel platform with two quad-core Xeon E5420 Harpertown 2.5GHz 6MB L2 cache and 8 GBytes of main memory, a Linux CentOS release 5.2  2.6.18-92.1.22.el5,​ and gcc version 4.1.2 with POSIX thread model. Courtesy of University of Pisa. 
- 
-==== Biocluster ==== 
-Courtesy of University of Torino. 
- 
-==== Magnana ​ ===== 
-Magnana is a Macbook 13''​ unibody with Core 2 Duo P8600 2.4GHz 3MB L2 cache and 4GBytes of main memory. It currently runs Mac OS X 10.6.2 Snow Leopard, Macports QT 4.6.1 (well, it is just one of our laptops). Courtesy of University of Torino. 
- 
-==== Calvados ​ ==== 
- 
-Calvados is a Power Macintosh G4 (Mirrored Drive Doors) with 2x 1.25 GHz PowerPC G4 (7455 v3.2), 256KBytes L2, 1MBytes L3, and 1.5 GBytes of memory. It currently runs Mac OS X 10.5.8 Leopard. Calvados is particularly important as testing platform because it is equipped with processors that have a weaker memory consistency than Intel core platforms. Courtesy of University of Pisa. 
ffnamespace/performance.txt · Last modified: 2014/08/31 02:52 by aldinuc