diff --git a/PERFORMANCE.md b/PERFORMANCE.md
index abe4521..dfe7d0c 100644
--- a/PERFORMANCE.md
+++ b/PERFORMANCE.md
@@ -122,3 +122,36 @@ Average results Unbalanced:
 There seems to be only a minor performance difference between the two,
 suggesting tha our two-level approach is not the part causing our
 weaker performance.
+
+### Commit afd0331b - Some notes on scaling problems
+
+After tweaking individual values and parameters we can still not find
+the main cause for our slowdown on multiple processors.
+We also use intel's vtune amplifier to measure performance on our run
+and find that we always spend way too much time 'waiting for work',
+e.g. in the backoff mechanism when enabled or in the locks for stealing
+work when backoff is disabled. This leads us to believe that our problems
+might be connected to some issue with work distribution on the FFT case,
+as the unbalanced tree search (with a lot 'local' work) performs good.
+
+To get more data in we add benchmarks on matrix multiplication implemented
+in two fashions: once with a 'native' array stealing task and once with
+a fork-join task. Both implementations use the same minimum array
+sub-size of 4 elements and we can hopefully see if they have any
+performance differences.
+
+Best case fork-join:
+
+<img src="media/afd0331b_matrix_best_case_fork.png" width="400"/>
+
+Average case fork-join:
+
+<img src="media/afd0331b_matrix_average_case_fork.png" width="400"/>
+
+Best case Native:
+
+<img src="media/afd0331b_matrix_best_case_native.png" width="400"/>
+
+Average case Native:
+
+<img src="media/afd0331b_matrix_average_case_native.png" width="400"/>
diff --git a/lib/pls/include/pls/pls.h b/lib/pls/include/pls/pls.h
index 2cc0757..accbcc3 100644
--- a/lib/pls/include/pls/pls.h
+++ b/lib/pls/include/pls/pls.h
@@ -25,6 +25,7 @@ using internal::scheduling::fork_join_task;
 
 using algorithm::invoke_parallel;
 using algorithm::parallel_for_fork_join;
+using algorithm::parallel_for;
 
 }