From c87c8e3f8006e6547726a02dbf8f085cc2bc2f5b Mon Sep 17 00:00:00 2001 From: FritzFlorian Date: Tue, 28 May 2019 12:45:44 +0200 Subject: [PATCH] Add performance notes on matrix multiplication. --- PERFORMANCE.md | 33 +++++++++++++++++++++++++++++++++ lib/pls/include/pls/pls.h | 1 + 2 files changed, 34 insertions(+) diff --git a/PERFORMANCE.md b/PERFORMANCE.md index abe4521..dfe7d0c 100644 --- a/PERFORMANCE.md +++ b/PERFORMANCE.md @@ -122,3 +122,36 @@ Average results Unbalanced: There seems to be only a minor performance difference between the two, suggesting tha our two-level approach is not the part causing our weaker performance. + +### Commit afd0331b - Some notes on scaling problems + +After tweaking individual values and parameters we can still not find +the main cause for our slowdown on multiple processors. +We also use intel's vtune amplifier to measure performance on our run +and find that we always spend way too much time 'waiting for work', +e.g. in the backoff mechanism when enabled or in the locks for stealing +work when backoff is disabled. This leads us to believe that our problems +might be connected to some issue with work distribution on the FFT case, +as the unbalanced tree search (with a lot 'local' work) performs good. + +To get more data in we add benchmarks on matrix multiplication implemented +in two fashions: once with a 'native' array stealing task and once with +a fork-join task. Both implementations use the same minimum array +sub-size of 4 elements and we can hopefully see if they have any +performance differences. + +Best case fork-join: + + + +Average case fork-join: + + + +Best case Native: + + + +Average case Native: + + diff --git a/lib/pls/include/pls/pls.h b/lib/pls/include/pls/pls.h index 2cc0757..accbcc3 100644 --- a/lib/pls/include/pls/pls.h +++ b/lib/pls/include/pls/pls.h @@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task; using algorithm::invoke_parallel; using algorithm::parallel_for_fork_join; +using algorithm::parallel_for; } -- libgit2 0.26.0