diff --git a/PERFORMANCE.md b/PERFORMANCE.md
index abe4521..dfe7d0c 100644
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -122,3 +122,36 @@ Average results Unbalanced:
There seems to be only a minor performance difference between the two,
suggesting tha our two-level approach is not the part causing our
weaker performance.
+
+### Commit afd0331b - Some notes on scaling problems
+
+After tweaking individual values and parameters we can still not find
+the main cause for our slowdown on multiple processors.
+We also use intel's vtune amplifier to measure performance on our run
+and find that we always spend way too much time 'waiting for work',
+e.g. in the backoff mechanism when enabled or in the locks for stealing
+work when backoff is disabled. This leads us to believe that our problems
+might be connected to some issue with work distribution on the FFT case,
+as the unbalanced tree search (with a lot 'local' work) performs good.
+
+To get more data in we add benchmarks on matrix multiplication implemented
+in two fashions: once with a 'native' array stealing task and once with
+a fork-join task. Both implementations use the same minimum array
+sub-size of 4 elements and we can hopefully see if they have any
+performance differences.
+
+Best case fork-join:
+
+
+
+Average case fork-join:
+
+
+
+Best case Native:
+
+
+
+Average case Native:
+
+
diff --git a/lib/pls/include/pls/pls.h b/lib/pls/include/pls/pls.h
index 2cc0757..accbcc3 100644
--- a/lib/pls/include/pls/pls.h
+++ b/lib/pls/include/pls/pls.h
@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
using algorithm::invoke_parallel;
using algorithm::parallel_for_fork_join;
+using algorithm::parallel_for;
}