Add performance notes on matrix multiplication.

c87c8e3f · FritzFlorian · afd0331b · c87c8e3f · c87c8e3f
Commit c87c8e3f authored May 28, 2019 by FritzFlorian
Show whitespace changes
Inline Side-by-side

Showing with 34 additions and 0 deletions

PERFORMANCE.md
+33 -0

lib/pls/include/pls/pls.h
+1 -0

No files found.
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -122,3 +122,36 @@ Average results Unbalanced:
 There seems to be only a minor performance difference between the two,
 suggesting tha our two-level approach is not the part causing our
 weaker performance.
+### Commit afd0331b - Some notes on scaling problems
+After tweaking individual values and parameters we can still not find
+the main cause for our slowdown on multiple processors.
+We also use intel's vtune amplifier to measure performance on our run
+and find that we always spend way too much time 'waiting for work',
+e.g. in the backoff mechanism when enabled or in the locks for stealing
+work when backoff is disabled. This leads us to believe that our problems
+might be connected to some issue with work distribution on the FFT case,
+as the unbalanced tree search (with a lot 'local' work) performs good.
+To get more data in we add benchmarks on matrix multiplication implemented
+in two fashions: once with a 'native' array stealing task and once with
+a fork-join task. Both implementations use the same minimum array
+sub-size of 4 elements and we can hopefully see if they have any
+performance differences.
+Best case fork-join:
+<img src="media/afd0331b_matrix_best_case_fork.png" width="400"/>
+Average case fork-join:
+<img src="media/afd0331b_matrix_average_case_fork.png" width="400"/>
+Best case Native:
+<img src="media/afd0331b_matrix_best_case_native.png" width="400"/>
+Average case Native:
+<img src="media/afd0331b_matrix_average_case_native.png" width="400"/>
--- a/lib/pls/include/pls/pls.h
+++ b/lib/pls/include/pls/pls.h
@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
 using algorithm::invoke_parallel;
 using algorithm::parallel_for_fork_join;
+using algorithm::parallel_for;
 }