Commit c87c8e3f by FritzFlorian

Add performance notes on matrix multiplication.

parent afd0331b
Pipeline #1228 passed with stages
in 3 minutes 50 seconds
......@@ -122,3 +122,36 @@ Average results Unbalanced:
There seems to be only a minor performance difference between the two,
suggesting tha our two-level approach is not the part causing our
weaker performance.
### Commit afd0331b - Some notes on scaling problems
After tweaking individual values and parameters we can still not find
the main cause for our slowdown on multiple processors.
We also use intel's vtune amplifier to measure performance on our run
and find that we always spend way too much time 'waiting for work',
e.g. in the backoff mechanism when enabled or in the locks for stealing
work when backoff is disabled. This leads us to believe that our problems
might be connected to some issue with work distribution on the FFT case,
as the unbalanced tree search (with a lot 'local' work) performs good.
To get more data in we add benchmarks on matrix multiplication implemented
in two fashions: once with a 'native' array stealing task and once with
a fork-join task. Both implementations use the same minimum array
sub-size of 4 elements and we can hopefully see if they have any
performance differences.
Best case fork-join:
<img src="media/afd0331b_matrix_best_case_fork.png" width="400"/>
Average case fork-join:
<img src="media/afd0331b_matrix_average_case_fork.png" width="400"/>
Best case Native:
<img src="media/afd0331b_matrix_best_case_native.png" width="400"/>
Average case Native:
<img src="media/afd0331b_matrix_average_case_native.png" width="400"/>
......@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
using algorithm::invoke_parallel;
using algorithm::parallel_for_fork_join;
using algorithm::parallel_for;
}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment