Found general problem for FFT performance.

a4b03ffe · FritzFlorian · 5044f0a1 · a4b03ffe · a4b03ffe
Commit a4b03ffe authored Jun 06, 2019 by FritzFlorian
Hide whitespace changes
Inline Side-by-side

Showing with 45 additions and 2 deletions

PERFORMANCE.md
+37 -0

lib/pls/include/pls/algorithms/invoke_parallel_impl.h
+8 -2

No files found.
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -318,3 +318,40 @@ threads (threads without any actual work) and the threads actually
 performing work. Most likely there is a resource on the same cache
 line used that hinders the working threads, but we can not really
 figure out which one it is.
+
+### Commit be2cdbfe - Locking Deque
+
+Switching to a locking deque has not improved (or even slightly hurt)
+performance, we therefore think that the deque itself is not the
+portion slowing down our execution.
+
+### Commit 5044f0a1 - Performance Bottelneck in FFT FIXED
+
+By moving from directly calling one of the parallel invocations
+
+```c++
+scheduler::spawn_child(sub_task_2);
+function1(); // Execute first function 'inline' without spawning a sub_task object
+```
+
+to spawning two tasks
+```c++
+scheduler::spawn_child(sub_task_2);
+scheduler::spawn_child(sub_task_1);
+```
+
+we where able to fix the bad performance of our framework in the
+FFT benchmark (where there is a lot spinning/idling of some
+worker threads).
+
+We think this is due to some sort of cache misses/bus contemption
+on the finishing counters. This would make sense, as the drop
+at the hyperthreading mark indicates problems with this part of the
+CPU pipeline (althought it did not show clearly in our profiling runs).
+We will now try to find the exact spot where the problem originates and
+fix the source rather then 'circumventing' it with these extra tasks.
+(This then aigain, should hopefully even boost all other workloads
+performance, as contemption on the bus/cache is always bad)
+
+
+
--- a/lib/pls/include/pls/algorithms/invoke_parallel_impl.h
+++ b/lib/pls/include/pls/algorithms/invoke_parallel_impl.h
@@ -13,10 +13,13 @@ template<typename Function1, typename Function2>
 void invoke_parallel(const Function1 &function1, const Function2 &function2) {
  using namespace ::pls::internal::scheduling;

+  auto sub_task_1 = lambda_task_by_reference<Function1>(function1);
  auto sub_task_2 = lambda_task_by_reference<Function2>(function2);

  scheduler::spawn_child(sub_task_2);
-  function1(); // Execute first function 'inline' without spawning a sub_task object
+  scheduler::spawn_child(sub_task_1);
+  // TODO: Research the exact cause of this being faster
+//  function1(); // Execute first function 'inline' without spawning a sub_task object
  scheduler::wait_for_all();
 }

@@ -24,12 +27,15 @@ template<typename Function1, typename Function2, typename Function3>
 void invoke_parallel(const Function1 &function1, const Function2 &function2, const Function3 &function3) {
  using namespace ::pls::internal::scheduling;

+  auto sub_task_1 = lambda_task_by_reference<Function1>(function1);
  auto sub_task_2 = lambda_task_by_reference<Function2>(function2);
  auto sub_task_3 = lambda_task_by_reference<Function3>(function3);

  scheduler::spawn_child(sub_task_2);
  scheduler::spawn_child(sub_task_3);
-  function1(); // Execute first function 'inline' without spawning a sub_task object
+  scheduler::spawn_child(sub_task_1);
+  // TODO: Research the exact cause of this being faster
+//  function1(); // Execute first function 'inline' without spawning a sub_task object
  scheduler::wait_for_all();
 }